[07:00:53] <georgekyz>	 good morning 
[07:03:51] <bartosz>	 good morning! 
[07:05:11] <isaranto>	 Guten Tag!
[07:13:35] <ozge_>	 dzien dobry
[07:42:20] <wikibugs>	 06Machine-Learning-Team: Deploy peacock/tone check model to production - https://phabricator.wikimedia.org/T394779#10842698 (10isarantopoulos)
[07:42:52] <isaranto>	 elukey: o/ we plan to rollout the SLO for the peacock check model (to be renamed to tone check) some time next week https://phabricator.wikimedia.org/T390706
[07:43:05] <isaranto>	 we plan to lead this ourselves and ask for help wherever needed
[07:44:03] <isaranto>	 georgekyz: o/ shall we create a new namespace for the model in production?
[07:44:13] <elukey>	 isaranto: that's great! I'll add some info to the task, but the first step is to work on https://wikitech.wikimedia.org/wiki/SLO/Template_instructions
[07:44:18] <isaranto>	 I was thinking that edit-check would be enough
[07:44:22] <isaranto>	 wdyt?
[07:45:38] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create SLO dashboard for tone check model - https://phabricator.wikimedia.org/T390706#10842712 (10isarantopoulos)
[07:45:39] <georgekyz>	 Yes I've started already. Should I remove it from experimental as well? 
[07:45:55] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#10842715 (10isarantopoulos)
[07:46:00] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#10842717 (10elukey) The first step is to read and create a draft of https://wikitech.wikimedia.org/wiki/SLO/Template_instructions. I am available to have a meetin...
[07:46:22] <isaranto>	 let's leave it there and we can remove it after we do the switch in the API Gateway
[07:46:30] <georgekyz>	 alright perfect
[07:47:38] <isaranto>	 georgekyz: you can find examples of how to create a namespace in past patches in admin-ng (lemme pull one for ya). we'd need an sre to deploy that and create the namespace
[07:49:48] <isaranto>	 here it is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1072193
[07:49:57] <isaranto>	 I'm adding it in the task as a reference
[07:50:30] <wikibugs>	 06Machine-Learning-Team: Deploy peacock/tone check model to production - https://phabricator.wikimedia.org/T394779#10842721 (10isarantopoulos)
[07:50:39] <georgekyz>	 isaranto: thank you
[08:27:23] <aiko>	 morning folks :)
[08:31:59] <klausman>	 おはよう!
[08:36:29] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Deploy peacock/tone check model to production - https://phabricator.wikimedia.org/T394779#10842968 (10gkyziridis)
[08:41:57] <kart_>	 elukey: I've updated patch with s3cmd and we can go ahead with s3 credential in helm. 
[08:46:09] <georgekyz>	 Should we deploy the latest edit-check image on prod (the one which is based on `amd-pytorch23` and we test it on cpu/gpu)? Or the exact same version which runs on staging (based on `amd-pytorch25`, running on cpu but not on gpu)? 
[08:49:09] <elukey>	 kart_: ack! klausman do you have time for the s3 cred rollout for machine translation?
[08:49:34] <elukey>	 in theory it should be sufficient to add the right AWS_ env variables to hieradata/role/common/deployment_server/kubernetes.yaml for mint
[08:49:53] <klausman>	  can do that, sure
[08:49:53] <elukey>	 and then upon next mint deployment, we should see the right values popping up
[08:51:56] <georgekyz>	 edit-check ns patch ready, review when you have time folks --->  https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1148803
[08:52:44] <klausman>	 wait, were are the envs defined in that file?
[08:55:02] <elukey>	 try to look for AWS_SECRET_ACCESS_KEY
[08:56:32] <klausman>	 ITYM charts/kserve/templates/kserve.yaml in deployment charts, not the puppet hieradata file :)
[08:56:56] <klausman>	 Or I am holding grep wrong
[08:58:10] <elukey>	 so in this case mint is deployed like ores-legacy (so outside the kserve machinery, sadly), and on Wikikube
[08:58:40] <elukey>	 for private env variables, the modules provide a way to pick them up from some helmfile config
[08:59:03] <elukey>	 since it cannot be public, we use puppet to deploy some base helmfile configs on deploy1003, that you pick up when you helmfile/helm deploy
[08:59:15] <klausman>	 Ah, I see
[08:59:45] <kart_>	 Similar to secret keys for MTs in cxserver. They are in private puppet repo.
[08:59:50] <elukey>	 yes it is not 100% straightforward but it is the best compromise
[09:00:32] <elukey>	 in this case, in theory adding the AWS_ user/pass keys as private env vars should allow every pod to use s3cmd without issues
[09:01:21] <wikibugs>	 06Machine-Learning-Team, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10843098 (10Urbanecm_WMF) >>! In T393474#10834359, @Michael wrote: >>>! In T393474#10833889, @OKarakaya-WMF wrote: >> Hello @Michael , @kostajh , @Tg...
[09:13:45] <wikibugs>	 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10843125 (10Michael) @Urbanecm_WMF Yes, that it explains it well. Thank you!
[09:16:21] <kevinbazira>	 klausman: o/ yesterday, we completed reviews on the vllm patch but couldn't merge it because the build is resource-intensive: https://gerrit.wikimedia.org/r/1146891
[09:16:21] <kevinbazira>	 Luca advised that we'll build and push this image to the docker registry using ml-lab. we created this task as a next step: https://phabricator.wikimedia.org/T394778
[09:16:21] <kevinbazira>	 please have a look whenever you get a minute. thanks!
[09:22:51] <klausman>	 kevinbazira: ack
[09:23:40] <klausman>	 elukey: I am still utterly puzzled by hieradata/.../kubernetes.yaml I don't see anything related to secrets, environments, or similar in there.
[09:27:54] <elukey>	 so if you check the AWS_ variables, they are under config->private
[09:28:57] <klausman>	 You mean in the private repo?
[09:29:04] <klausman>	 I can find them there, sure.
[09:29:25] <elukey>	 yep yep (sorry I was getting info)
[09:29:41] <elukey>	 if you pick the tegola use case: on deploy1003, you'll see /etc/helmfile-defaults/private/main_services/tegola-vector-tiles/eqiad.yaml
[09:29:47] <elukey>	 that thing is rendered by puppet
[09:30:02] <elukey>	 and it will be available when deploying, it is included in helmfiles
[09:30:28] <elukey>	 config->private comes from the "modules" dir in deployment-charts, that mint uses (brb meeting)
[09:30:54] <klausman>	 ack, will keep digging
[09:34:27] <wikibugs>	 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10843205 (10Michael) >>! In T393474#10838033, @OKarakaya-WMF wrote: > # Prod vs New Pipeline >  > As we are likely to proceed with eith...
[09:49:16] <wikibugs>	 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10843287 (10Michael) >>! In T393474#10838069, @OKarakaya-WMF wrote: > Got results for top 10 languages with the new pipeline. > New pip...
[10:05:09] * klausman lunch
[10:12:30] <elukey>	 back!
[10:16:17] <elukey>	 klausman: in deployment-charts, you can check charts/machinetranslation/templates/vendor/app/generic_1.0.3.tpl, at line 35 you have
[10:16:23] <elukey>	   {{- range $k, $v := .Values.config.private }}
[10:17:03] <elukey>	 that is contained in {{- define "app.generic.container" }}, a define that in the default app scaffolding from serviceops is included for every container
[10:18:21] <elukey>	 the .Values.config.private is looked from various sources, including /etc/helmfile-defaults/private/main_services/tegola-vector-tiles/eqiad.yaml
[10:18:56] <elukey>	 why that file? Check in deployment-charts helmfile.d/services/machinetranslation/helmfile.yaml, line 51
[10:19:07] <elukey>	       - "/etc/helmfile-defaults/private/main_services/machinetranslation/{{ .Environment.Name }}.yaml" # prod-specific secrets, controlled by SRE
[10:19:28] <elukey>	 so the trick is to make puppet private passwords rendered to --^
[10:19:43] <elukey>	 and on deploy1003, a helmfile diff for machinetranslation will pick up the new values
[10:25:08] <elukey>	 the config private bit is nice since the values for the env variables are picked from a secret, so they will not be visible with a simple kubectl describe pod
[10:25:23] <elukey>	 (there is also config public for "regular" env variables)
[10:43:51] <elukey>	 --
[10:44:12] <elukey>	 for some weird reasons I have partman issues when reimaging, just downtimed, will work on it later on
[11:08:42] <georgekyz>	 can I merge this folks?: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1148803
[11:31:32] <georgekyz>	 isaranto: aiko: Should we deploy the latest edit-check image on prod (the one which is based on `amd-pytorch23` and we test it on cpu/gpu)? Or the exact same image-version which runs on staging (based on `amd-pytorch25`, running on cpu but not on gpu)? What are your thoughts ?
[11:34:37] <isaranto>	 georgekyz: I'd suggest to use the one based on pytorch23 if it works on both cpu and gpu well. Then enabling the gpu is one deployment away. I feel that otherwise it is kind of risky. We'll need to upgrade ofc in the future -- we can focus on that when we have the training pipeline. How does that sound?
[11:35:00] <klausman>	 elukey: thanks for the detailed explanation, I'll cook something up once I grok it.
[11:41:04] <aiko>	 georgekyz: I vote for pytorch23 that works on both cpu and gpu 
[11:41:09] <isaranto>	 georgekyz: even if you merge the namespace patch you can't deploy. so let's wait to get a review & deployment from Tobias when he has time (his hands are full atm)
[11:42:25] <georgekyz>	 isaranto: aiko: I totally aggree. Thnx for the response. 
[11:45:24] <wikibugs>	 (03PS7) 10Kosta Harlan: [WIP] Add AbuseFilter variable for revertrisk score [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1051837 (https://phabricator.wikimedia.org/T364705)
[11:55:48] <wikibugs>	 06Machine-Learning-Team, 10Editing-team (Tracking): Peacock detection model GPU deployment returns inconsistent results - https://phabricator.wikimedia.org/T393154#10843806 (10gkyziridis) 05Open→03Resolved
[11:56:14] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Deploy peacock/tone check model to production - https://phabricator.wikimedia.org/T394779#10843809 (10gkyziridis) a:03gkyziridis
[11:58:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [WIP] Add AbuseFilter variable for revertrisk score [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1051837 (https://phabricator.wikimedia.org/T364705) (owner: 10Kosta Harlan)
[12:10:49] <wikibugs>	 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10843831 (10OKarakaya-WMF) Hey @Michael ,  Thank you very much for the comments and sharing the link to issues. It's great to see you h...
[12:30:16] <wikibugs>	 06Machine-Learning-Team: Investigate revertrisk lanugage agnostic errors - https://phabricator.wikimedia.org/T394910 (10isarantopoulos) 03NEW
[12:34:44] <wikibugs>	 06Machine-Learning-Team: Investigate revertrisk lanugage agnostic errors - https://phabricator.wikimedia.org/T394910#10843909 (10isarantopoulos) Running the following query on superset to check what has happened in the steaming data for these revisions  ` SELECT    page.page_title,   wiki_id,   revision.rev_id,...
[12:34:55] <isaranto>	 fyi --^ let's discuss in the team meeting
[13:30:30] <wikibugs>	 06Machine-Learning-Team: Investigate revertrisk lanugage agnostic errors - https://phabricator.wikimedia.org/T394910#10844229 (10isarantopoulos) a:03achou
[13:37:57] <wikibugs>	 06Machine-Learning-Team: Investigate revertrisk language agnostic errors - https://phabricator.wikimedia.org/T394910#10844240 (10Aklapper)
[13:49:23] <elukey>	 ml-serve1001 back in service
[13:49:57] <klausman>	 ty!
[13:56:10] <wikibugs>	 06Machine-Learning-Team: Investigate null scores being returned by revertrisk language agnostic - https://phabricator.wikimedia.org/T394910#10844287 (10SSalgaonkar-WMF)
[13:56:13] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10844289 (10elukey)
[14:20:08] <wikibugs>	 06Machine-Learning-Team, 13Patch-For-Review: Simplify pre-commit hooks within inference-services repository. - https://phabricator.wikimedia.org/T393865#10844411 (10BWojtowicz-WMF) Following up on the message above:  There are actually more models running on Python 3.9, which I did not notice initially. This i...
[14:28:04] <bartosz>	 ^ I've listed all models using bullseye as base image here https://phabricator.wikimedia.org/T393865#10844411
[14:39:42] <isaranto>	 Dziękuję!
[14:39:48] <isaranto>	 :D
[14:59:39] <wikibugs>	 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: Q4 24-25 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10844694 (10OKarakaya-WMF) >>! In T393474#10843098, @Urbanecm_WMF wrote: >> [...]  Awesome summary! Thank you very much @Urbanecm 💟
[16:06:03] * isaranto afk!
[16:25:15] <wikibugs>	 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team: Run analysis to retrieve thresholds for high impact wikis to deploy recent changes revert risk language agnostic filters to - https://phabricator.wikimedia.org/T392148#10845099 (10Kgraessle) >>! In T392148#10838253, @gkyziridis wrot...
[17:37:49] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[17:37:49] <jinxer-wm>	 Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ...
[17:37:49] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[18:12:49] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[18:12:49] <jinxer-wm>	 Deployment reference-need-predictor-00010-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00010-deployment - ...
[18:12:49] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[18:24:30] <wikibugs>	 (03PS1) 10Sbisson: Make SearchRecommender inherit from BaseRecommender [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1148932