[05:19:06] Good morning folks o/ [06:52:36] (03CR) 10Ilias Sarantopoulos: "Nice work!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031590 (https://phabricator.wikimedia.org/T363506) (owner: 10Kevin Bazira) [07:01:59] * isaranto afk be back in an hour [07:10:17] (03PS4) 10Kevin Bazira: logo-detection: process image objects instead of image URLs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031590 (https://phabricator.wikimedia.org/T363506) [07:15:35] (03CR) 10Kevin Bazira: "sure, functionality to download images from URLs has been retained." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031590 (https://phabricator.wikimedia.org/T363506) (owner: 10Kevin Bazira) [08:07:12] * isaranto back! [08:54:38] morning! [09:02:02] morning Aiko! [10:50:18] * klausman lunch [11:08:25] * isaranto lunch + dentist [12:23:07] * isaranto back! [12:39:15] o/ [12:41:31] o/ elukey Buon pomeriggio come stai? [12:59:55] molto bene grazie [13:02:10] I started a daily goal of 15 minutes of learning Italian on Duolingo. let's see how that goes [13:03:16] I am learning mandarin at the moment :D [13:04:59] super! [13:05:32] I am surely not going to make it, sooo difficult, but I am learning a ton of cultural context that I love [13:05:37] the tones are really tough [13:07:29] elukey: 太棒了!! :D [13:11:08] 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Rename the envoy's uses_ingress option to sets_sni - https://phabricator.wikimedia.org/T346638#9804401 (10JMeybohm) [13:23:42] aiko: I am struggling with pinyin, cannot really read those :D [13:23:45] I wish I could! [13:23:52] but I have to use a translator :D [13:30:27] natematias: hello! there is some information about evaluation of the model on the Model card page ->https://meta.wikimedia.org/wiki/Machine_learning_models/Production/English_Wikipedia_goodfaith_edit [13:30:27] evaluation is reported on a test split after training was done. These are available on the model cards for all ores models. However there is no evaluation on the real time data available (that is if you want to evaluate revisions between Aug 2019 and Feb 2020). [13:31:03] do the model cards help or do you need sth else? [14:00:08] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Done" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031590 (https://phabricator.wikimedia.org/T363506) (owner: 10Kevin Bazira) [14:20:01] https://github.com/opencontainers/runc/commit/81707abd33d2ddebcd8ceeb08dfc01bf86d8badd [14:21:33] https://github.com/opencontainers/runc/commit/efb8552b05431520d66ecd970628b35126039629 [14:21:51] "The test verifies if the device file can be queried using [14:21:52] 'access(dev_name, F_OK)' when the permissions are set to 'rw'. The call [14:21:52] does not explicitly trigger read or write access but should succeed." [14:22:27] ok so now the last bit is to verify if our runc version contains or not https://github.com/opencontainers/runc/commit/81707abd33d2ddebcd8ceeb08dfc01bf86d8badd [14:22:32] if not, then we have found the issue [14:23:00] wow, great work digging that up! [14:23:34] again credits to Janis since I didn't know what runc was responsible to add the eBPF rule/program to check dev permissions [14:23:35] I hope we don't included it! (hoping that this is the better outcome) [14:23:42] exactly yes! [14:23:49] TIL runc [14:24:50] and of course salsa.debian.org is under maintenance now :D [14:25:19] anyway, we have runc version 1.0.0~rc93+ds1-5+deb11u3 [14:26:03] and the first tag that I see for the commit in https://github.com/opencontainers/runc/commit/81707abd33d2ddebcd8ceeb08dfc01bf86d8badd is 1.0.0~rc94 [14:26:43] so yes this is it :) [14:26:49] the bookworm version is fixed [14:28:11] haha great work Janis and Luca! [14:32:06] I don't see runc in bullseye-backports https://packages.debian.org/search?searchon=sourcenames&keywords=runc [14:32:31] so ideally we should upgrade to Bookworm, but afaics it is not yet supported by our k8s infra [14:34:03] moving to bookworm would also allow us to test the new worker setup without rocm packages [14:36:08] mmm wait I see the kubelet package being present in bookworm [14:36:55] this might be a huge relief [14:37:03] what shall we upgrade to bookworm? [14:37:18] ml-staging2001 basically [14:37:37] but first I'd need refactor puppet to avoid deploying rocm packages [14:37:41] ack [14:45:09] 06Machine-Learning-Team: Update Pytorch base image to 2.3.0 - https://phabricator.wikimedia.org/T365166 (10isarantopoulos) 03NEW [14:45:49] 06Machine-Learning-Team: Update Pytorch base image to 2.3.0 - https://phabricator.wikimedia.org/T365166#9805114 (10isarantopoulos) [14:51:15] TIL runc too!! [14:53:10] (03CR) 10Kevin Bazira: [C:03+2] "thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031590 (https://phabricator.wikimedia.org/T363506) (owner: 10Kevin Bazira) [14:53:55] (03Merged) 10jenkins-bot: logo-detection: process image objects instead of image URLs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1031590 (https://phabricator.wikimedia.org/T363506) (owner: 10Kevin Bazira) [15:12:44] FIRING: LiftWingServiceErrorRate: ... [15:12:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:13:31] nooo [15:14:01] ah viwiki again.. [15:14:15] probably that Ip [15:31:26] 06Machine-Learning-Team, 13Patch-For-Review: Test if we can avoid ROCm debian packages on k8s nodes - https://phabricator.wikimedia.org/T363191#9805400 (10elukey) In order to solve this task and T362984 we should upgrade to Bookworm, but we'd be the first ones to test it. So far: * amd-k8s-device-plugin was... [15:35:47] I made a recommendation to bump number of replicas until we deal with this issue https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1032517 [15:36:12] isaranto: are we already running with 4 instances? [15:36:13] the same that we did for ruwiki-damaging, but this time for viwiki-reverted [15:37:18] yes sure, but maxReplicas is 4, I am wondering if scaling already took care of bumping instances to 4 [15:37:22] and if we are saturating the CPU [15:37:32] if not, bumping to 6 doesn't make much sense [15:37:44] RESOLVED: LiftWingServiceErrorRate: ... [15:37:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-reverted&var-backend=viwiki-reverted-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:42:09] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9805465 (10elukey) Finally we found the issue, see https://github.com/ROCm/k8s-device-plugin/issues/65#issuecomment-2115414637 The only option seems to be to upgrade `ml-staging2001` to Bookw... [15:42:39] elukey: you are right I see that autoscaling kicked in, we got 4 replicas and the alert was resolved [15:42:52] isaranto: so from https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlserve&var-knative_namespace=knative-serving&var-revisions_namespace=revscoring-editquality-reverted&viewPanel=24&from=now-24h&to=now it seems that we are already running with 4 replicas [15:43:35] and yeah [15:43:35] https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-reverted&var-pod=viwiki-reverted-predictor-default-00019-deployment-bcd64fbqdtbq&var-pod=viwiki-reverted-predictor-default-00019-deployment-bcd64fbw6svr&var-pod=viwiki-reverted-predictor-default-00019-deployment-bcd64fbwd25v&var-pod=viwiki-revert [15:43:41] ed-predictor-default-00019-deployment-bcd64fbxqtnw&var-container=All shows cpu usage [15:43:43] nevermind my previous comment. indeed the pods have been there for 7 days [15:44:09] go back 6 hours in the last grafana link [15:44:16] you can see the increase in cpu usage [15:44:26] pretty sure we can track it down to a single client [15:47:50] yes the CPU spikes match exactly the increased latencies in the isvc https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-reverted&var-component=All&var-model_name=viwiki-reverted&from=now-12h&to=now&viewPanel=33 [15:50:05] https://logstash.wikimedia.org/goto/f853ac2e71e93489d27c15451b3c6c0c [15:50:14] this is the traffic as seen by the viwiki reverted pod [15:51:51] I see one ipv6 ip hitting us [15:52:36] isaranto: I'd say that we could raise the max instances for all reverted to 6 [15:52:52] so other isvcs will benefit of the new autoscaling settings [15:53:07] I suspect we get the same issue if the others start seeing traffic [15:58:33] I'll do that first thing in the morning then [15:58:43] let's do it now if you have 5 mins [15:58:47] sure [15:58:55] because I think we may see another alert otherwise [15:58:58] for all? or just viwiki? [15:59:29] I'd do 6 as max for all, and bump the max for viwiki to 8 for the moment [16:00:50] ack! [16:00:52] on it [16:06:19] just give me a sec , I'm also responding to some messages for an issue with the automoderator [16:07:45] I can do it if you want [16:07:50] so we can split [16:12:00] it is ready https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1032517 but CI doesnt like me [16:12:12] jenkins fails on all patches with the same message at the moment [16:12:24] https://integration.wikimedia.org/ci/job/helm-lint/17448/console [16:13:32] same thing for me https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1032531 [16:13:38] sigh [16:14:07] yeah I saw it in other patches as well :( [16:15:29] pinged releng on #operations, they are working on it [16:16:59] isaranto: I manually bumped the max viwiki replicas to 8, so we can patch it tomorrow without staying too late [16:17:01] thanks! I saw some references in #releng and here https://phabricator.wikimedia.org/T282893 [16:17:42] Grazie! [16:18:00] np! Let's review it tomorrow :) [16:18:08] going afk for today, o/ [16:18:14] have a nice rest of the day folks [16:18:54] I'm logging off as well, cu folks! [16:33:18] night all! [16:50:34] 07artificial-intelligence, 06Machine-Learning-Team, 10articlequality-modeling: Articlequality model for nlwiki doesn't seem to track images correctly. - https://phabricator.wikimedia.org/T304973#9806023 (10Aklapper) a:05Halfak→03None @Halfak: Removing task assignee as this open task has been assigned for... [16:51:53] 07artificial-intelligence, 06Machine-Learning-Team, 10Edit-Review-Improvements-RC-Page, 10editquality-modeling, and 3 others: Enable ORES in RecentChanges for Hindi Wikipedia - https://phabricator.wikimedia.org/T303293#9806025 (10Aklapper) a:05Halfak→03None @Halfak: Removing task assignee as this open... [16:54:55] 10Lift-Wing: Test LiftWing API/Predictions from Hadoop - https://phabricator.wikimedia.org/T304425#9806096 (10Aklapper) a:05gmodena→03None @gmodena: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15. Please assign this... [17:12:11] 06Machine-Learning-Team, 10Add-Link, 10Growth-Scaling, 06Growth-Team: Establish processes for running the dataset pipeline - https://phabricator.wikimedia.org/T276438#9806304 (10Aklapper) a:05kevinbazira→03None @kevinbazira: Removing task assignee as this open task has been assigned for more than two y... [18:40:01] isaranto: thanks for the links; it’s great to see per-language results! To clarify, I wasn’t looking so much for realtime information— but just wanting to be sure that any performance data actually applied to models in use at the time— so this is just what I need. [18:45:55] One last question: what’s the canonical reference for the model cards? I see that there’s: [18:45:58] - the page on Meta: https://meta.wikimedia.org/wiki/Machine_learning_models [18:46:05] - the data in Gitlab: https://gitlab.wikimedia.org/htriedman/ores-data/-/tree/main/model_info?ref_type=heads [18:46:15] What is the canonical reference? Thansk! [19:33:38] 06Machine-Learning-Team, 13Patch-For-Review: Deploy RR-language-agnostic batch version to prod - https://phabricator.wikimedia.org/T358744#9806775 (10CodeReviewBot) aikochou opened https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/merge_requests/40 feat(revertrisk): add support for batch predi...