[07:00:53] good morning [07:03:51] good morning! [07:05:11] Guten Tag! [07:13:35] dzien dobry [07:42:20] 06Machine-Learning-Team: Deploy peacock/tone check model to production - https://phabricator.wikimedia.org/T394779#10842698 (10isarantopoulos) [07:42:52] elukey: o/ we plan to rollout the SLO for the peacock check model (to be renamed to tone check) some time next week https://phabricator.wikimedia.org/T390706 [07:43:05] we plan to lead this ourselves and ask for help wherever needed [07:44:03] georgekyz: o/ shall we create a new namespace for the model in production? [07:44:13] isaranto: that's great! I'll add some info to the task, but the first step is to work on https://wikitech.wikimedia.org/wiki/SLO/Template_instructions [07:44:18] I was thinking that edit-check would be enough [07:44:22] wdyt? [07:45:38] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create SLO dashboard for tone check model - https://phabricator.wikimedia.org/T390706#10842712 (10isarantopoulos) [07:45:39] Yes I've started already. Should I remove it from experimental as well? [07:45:55] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#10842715 (10isarantopoulos) [07:46:00] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#10842717 (10elukey) The first step is to read and create a draft of https://wikitech.wikimedia.org/wiki/SLO/Template_instructions. I am available to have a meetin... [07:46:22] let's leave it there and we can remove it after we do the switch in the API Gateway [07:46:30] alright perfect [07:47:38] georgekyz: you can find examples of how to create a namespace in past patches in admin-ng (lemme pull one for ya). we'd need an sre to deploy that and create the namespace [07:49:48] here it is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1072193 [07:49:57] I'm adding it in the task as a reference [07:50:30] 06Machine-Learning-Team: Deploy peacock/tone check model to production - https://phabricator.wikimedia.org/T394779#10842721 (10isarantopoulos) [07:50:39] isaranto: thank you [08:27:23] morning folks :) [08:31:59] おはよう! [08:36:29] 06Machine-Learning-Team, 13Patch-For-Review: Deploy peacock/tone check model to production - https://phabricator.wikimedia.org/T394779#10842968 (10gkyziridis) [08:41:57] elukey: I've updated patch with s3cmd and we can go ahead with s3 credential in helm. [08:46:09] Should we deploy the latest edit-check image on prod (the one which is based on `amd-pytorch23` and we test it on cpu/gpu)? Or the exact same version which runs on staging (based on `amd-pytorch25`, running on cpu but not on gpu)? [08:49:09] kart_: ack! klausman do you have time for the s3 cred rollout for machine translation? [08:49:34] in theory it should be sufficient to add the right AWS_ env variables to hieradata/role/common/deployment_server/kubernetes.yaml for mint [08:49:53] can do that, sure [08:49:53] and then upon next mint deployment, we should see the right values popping up [08:51:56] edit-check ns patch ready, review when you have time folks ---> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1148803 [08:52:44] wait, were are the envs defined in that file? [08:55:02] try to look for AWS_SECRET_ACCESS_KEY [08:56:32] ITYM charts/kserve/templates/kserve.yaml in deployment charts, not the puppet hieradata file :) [08:56:56] Or I am holding grep wrong [08:58:10] so in this case mint is deployed like ores-legacy (so outside the kserve machinery, sadly), and on Wikikube [08:58:40] for private env variables, the modules provide a way to pick them up from some helmfile config [08:59:03] since it cannot be public, we use puppet to deploy some base helmfile configs on deploy1003, that you pick up when you helmfile/helm deploy [08:59:15] Ah, I see [08:59:45] Similar to secret keys for MTs in cxserver. They are in private puppet repo. [08:59:50] yes it is not 100% straightforward but it is the best compromise [09:00:32] in this case, in theory adding the AWS_ user/pass keys as private env vars should allow every pod to use s3cmd without issues [09:01:21] 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10843098 (10Urbanecm_WMF) >>! In T393474#10834359, @Michael wrote: >>>! In T393474#10833889, @OKarakaya-WMF wrote: >> Hello @Michael , @kostajh , @Tg... [09:13:45] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10843125 (10Michael) @Urbanecm_WMF Yes, that it explains it well. Thank you! [09:16:21] klausman: o/ yesterday, we completed reviews on the vllm patch but couldn't merge it because the build is resource-intensive: https://gerrit.wikimedia.org/r/1146891 [09:16:21] Luca advised that we'll build and push this image to the docker registry using ml-lab. we created this task as a next step: https://phabricator.wikimedia.org/T394778 [09:16:21] please have a look whenever you get a minute. thanks! [09:22:51] kevinbazira: ack [09:23:40] elukey: I am still utterly puzzled by hieradata/.../kubernetes.yaml I don't see anything related to secrets, environments, or similar in there. [09:27:54] so if you check the AWS_ variables, they are under config->private [09:28:57] You mean in the private repo? [09:29:04] I can find them there, sure. [09:29:25] yep yep (sorry I was getting info) [09:29:41] if you pick the tegola use case: on deploy1003, you'll see /etc/helmfile-defaults/private/main_services/tegola-vector-tiles/eqiad.yaml [09:29:47] that thing is rendered by puppet [09:30:02] and it will be available when deploying, it is included in helmfiles [09:30:28] config->private comes from the "modules" dir in deployment-charts, that mint uses (brb meeting) [09:30:54] ack, will keep digging [09:34:27] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10843205 (10Michael) >>! In T393474#10838033, @OKarakaya-WMF wrote: > # Prod vs New Pipeline > > As we are likely to proceed with eith... [09:49:16] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10843287 (10Michael) >>! In T393474#10838069, @OKarakaya-WMF wrote: > Got results for top 10 languages with the new pipeline. > New pip... [10:05:09] * klausman lunch [10:12:30] back! [10:16:17] klausman: in deployment-charts, you can check charts/machinetranslation/templates/vendor/app/generic_1.0.3.tpl, at line 35 you have [10:16:23] {{- range $k, $v := .Values.config.private }} [10:17:03] that is contained in {{- define "app.generic.container" }}, a define that in the default app scaffolding from serviceops is included for every container [10:18:21] the .Values.config.private is looked from various sources, including /etc/helmfile-defaults/private/main_services/tegola-vector-tiles/eqiad.yaml [10:18:56] why that file? Check in deployment-charts helmfile.d/services/machinetranslation/helmfile.yaml, line 51 [10:19:07] - "/etc/helmfile-defaults/private/main_services/machinetranslation/{{ .Environment.Name }}.yaml" # prod-specific secrets, controlled by SRE [10:19:28] so the trick is to make puppet private passwords rendered to --^ [10:19:43] and on deploy1003, a helmfile diff for machinetranslation will pick up the new values [10:25:08] the config private bit is nice since the values for the env variables are picked from a secret, so they will not be visible with a simple kubectl describe pod [10:25:23] (there is also config public for "regular" env variables) [10:43:51] -- [10:44:12] for some weird reasons I have partman issues when reimaging, just downtimed, will work on it later on [11:08:42] can I merge this folks?: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1148803 [11:31:32] isaranto: aiko: Should we deploy the latest edit-check image on prod (the one which is based on `amd-pytorch23` and we test it on cpu/gpu)? Or the exact same image-version which runs on staging (based on `amd-pytorch25`, running on cpu but not on gpu)? What are your thoughts ? [11:34:37] georgekyz: I'd suggest to use the one based on pytorch23 if it works on both cpu and gpu well. Then enabling the gpu is one deployment away. I feel that otherwise it is kind of risky. We'll need to upgrade ofc in the future -- we can focus on that when we have the training pipeline. How does that sound? [11:35:00] elukey: thanks for the detailed explanation, I'll cook something up once I grok it. [11:41:04] georgekyz: I vote for pytorch23 that works on both cpu and gpu [11:41:09] georgekyz: even if you merge the namespace patch you can't deploy. so let's wait to get a review & deployment from Tobias when he has time (his hands are full atm) [11:42:25] isaranto: aiko: I totally aggree. Thnx for the response. [11:45:24] (03PS7) 10Kosta Harlan: [WIP] Add AbuseFilter variable for revertrisk score [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1051837 (https://phabricator.wikimedia.org/T364705) [11:55:48] 06Machine-Learning-Team, 10Editing-team (Tracking): Peacock detection model GPU deployment returns inconsistent results - https://phabricator.wikimedia.org/T393154#10843806 (10gkyziridis) 05Open→03Resolved [11:56:14] 06Machine-Learning-Team, 13Patch-For-Review: Deploy peacock/tone check model to production - https://phabricator.wikimedia.org/T394779#10843809 (10gkyziridis) a:03gkyziridis [11:58:00] (03CR) 10CI reject: [V:04-1] [WIP] Add AbuseFilter variable for revertrisk score [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1051837 (https://phabricator.wikimedia.org/T364705) (owner: 10Kosta Harlan) [12:10:49] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10843831 (10OKarakaya-WMF) Hey @Michael , Thank you very much for the comments and sharing the link to issues. It's great to see you h... [12:30:16] 06Machine-Learning-Team: Investigate revertrisk lanugage agnostic errors - https://phabricator.wikimedia.org/T394910 (10isarantopoulos) 03NEW [12:34:44] 06Machine-Learning-Team: Investigate revertrisk lanugage agnostic errors - https://phabricator.wikimedia.org/T394910#10843909 (10isarantopoulos) Running the following query on superset to check what has happened in the steaming data for these revisions ` SELECT page.page_title, wiki_id, revision.rev_id,... [12:34:55] fyi --^ let's discuss in the team meeting [13:30:30] 06Machine-Learning-Team: Investigate revertrisk lanugage agnostic errors - https://phabricator.wikimedia.org/T394910#10844229 (10isarantopoulos) a:03achou [13:37:57] 06Machine-Learning-Team: Investigate revertrisk language agnostic errors - https://phabricator.wikimedia.org/T394910#10844240 (10Aklapper) [13:49:23] ml-serve1001 back in service [13:49:57] ty! [13:56:10] 06Machine-Learning-Team: Investigate null scores being returned by revertrisk language agnostic - https://phabricator.wikimedia.org/T394910#10844287 (10SSalgaonkar-WMF) [13:56:13] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10844289 (10elukey) [14:20:08] 06Machine-Learning-Team, 13Patch-For-Review: Simplify pre-commit hooks within inference-services repository. - https://phabricator.wikimedia.org/T393865#10844411 (10BWojtowicz-WMF) Following up on the message above: There are actually more models running on Python 3.9, which I did not notice initially. This i... [14:28:04] ^ I've listed all models using bullseye as base image here https://phabricator.wikimedia.org/T393865#10844411 [14:39:42] Dziękuję! [14:39:48] :D [14:59:39] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10844694 (10OKarakaya-WMF) >>! In T393474#10843098, @Urbanecm_WMF wrote: >> [...] Awesome summary! Thank you very much @Urbanecm 💟 [16:06:03] * isaranto afk! [16:25:15] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to - https://phabricator.wikimedia.org/T392148#10845099 (10Kgraessle) >>! In T392148#10838253, @gkyziridis wrot... [17:37:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:37:49] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [17:37:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:12:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [18:12:49] Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ... [18:12:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:24:30] (03PS1) 10Sbisson: Make SearchRecommender inherit from BaseRecommender [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1148932