[06:34:30] good morning [06:52:18] morning! :) [07:03:34] hello! [08:00:09] Mornin [08:00:24] elukey: I'll do 1011 in a minute [08:13:22] 06Machine-Learning-Team: Article Summary Generation and Evaluation Pipeline using vLLM image - https://phabricator.wikimedia.org/T395246#10858281 (10kevinbazira) Building on top of the Research team's [[ https://docs.google.com/document/d/1Ukw8Rw46rKyttAviwLsZuLsIlheuTeb1MXnePdiyvLs/edit?tab=t.0 | work ]] that r... [08:13:45] 06Machine-Learning-Team: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10858283 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1003 for host ml-serve1011.eqiad.wmnet with OS bookworm [08:14:49] morning morning o/ [08:14:49] isaranto: as we discussed yesterday, I worked on an initial pipeline for generating and evaluating article-summaries using the vllm image on ml-lab1002: https://phabricator.wikimedia.org/T395246#10858281 [08:14:49] going to continue working on decoupling model loading and inference as we do in the ML isvcs [08:23:28] o/ kevinbazira going to take a look in a bit. thanks for the update! [08:36:56] okok [08:37:51] kevinbazira: I read the update sounds wonderful! may I suggest sth? I don't know how the code is structured at the moment but as you mentioned decoupling model loading from inference makes perfect sense. we want to ability to switch between the following: [model loaded locally, request to vllm docker container running locally, lifwing service (in the future) or any external api] so having this in a separate function would [08:37:52] be great [08:39:08] yep, your suggestion is good. that's what I aiming for. [09:00:41] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10858480 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1003 for host ml-serve1011.eqiad.wmnet with OS bookworm completed:... [09:02:34] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10858486 (10klausman) [09:03:30] awesome. thank you for sharing early! [09:09:19] klausman: o/ ack (forgot to do it earlier) [09:09:24] I'll try to do 1005 later on [09:12:55] Roger. I also have https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151134 ready for 1010. I self-+2'd the 1011 change since I had already started the reimage cookbook and didn't want to let it time-out [09:13:40] yes perfect makes sense! [09:14:26] at this rate we can probably target the PSS migration to early next week [09:15:46] Yeah, that makes sense. Note that I am out Thursday (public holiday, Giorno dell'Ascensione) [09:23:14] I'll be out on Tue, but I can finish the work on Thu if needed [09:23:25] reimages are relatively low effort [09:23:53] for the PSS migration we'll probably need 2/3 hours, nothing big [09:36:19] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10858618 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1003 for host ml-serve1010.eqiad.wmnet with OS bookworm [10:09:13] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck: Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#10858795 (10gkyziridis) a:03gkyziridis [10:27:48] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10858818 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1003 for host ml-serve1010.eqiad.wmnet with OS bookworm completed:... [10:29:07] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10858819 (10klausman) [10:29:39] * klausman lunch [10:35:15] depooling and working on ml-serve1005 [11:17:34] going afk for lunch, ml-serve1005 should be ready right after that [11:59:33] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10859015 (10Kgraessle) a:03Kgraessle [12:02:30] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Productionize peacock detection model - https://phabricator.wikimedia.org/T391940#10859027 (10gkyziridis) [12:19:50] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Productionize peacock detection model - https://phabricator.wikimedia.org/T391940#10859137 (10gkyziridis) ==== Updates: ===== Introducing ToneCheck - Request schema changed from `peacock` to `tone`: - `"instances": [{"lang": "en", "check_type": "tone", "o... [12:46:59] ml-serve1005 back in service, proceeding with 1006 [12:50:45] I'll start 1009 in a moment [12:58:54] 06Machine-Learning-Team, 05Goal: Q4 24-25 Goal: Productionize peacock detection model - https://phabricator.wikimedia.org/T391940#10859253 (10achou) Update: - HIL model evaluation for French, Spanish, Japanese, Portuguese, English is in progress - Restructure the data generation notebook to a [[ https://gitlab... [13:00:06] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10859267 (10gkyziridis) ==== Next steps: 1. Highliting UI issue investigation: @Kgraessle 2. QA on classification is... [13:02:08] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10859286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1003 for host ml-serve1009.eqiad.wmnet with OS bookworm [13:03:32] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10859294 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1003 for host ml-serve1009.eqiad.wmnet with OS bookworm executed wi... [13:05:22] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10859297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1003 for host ml-serve1009.eqiad.wmnet with OS bookworm [13:27:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:27:49] Deployment article-descriptions-predictor-default-00015-deployment in article-descriptions at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [13:27:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=article-descriptions&var-deployment=article-descriptions-predictor-default-00015-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:29:30] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10859372 (10gkyziridis) >>! In T395256#10857325, @isarantopoulos wrote: > 1. highlighting indeed didn't work either on... [13:32:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [13:32:49] Deployment article-descriptions-predictor-default-00015-deployment in article-descriptions at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [13:32:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=article-descriptions&var-deployment=article-descriptions-predictor-default-00015-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:43:13] 06Machine-Learning-Team: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10859411 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1003 for host ml-serve1009.eqiad.wmnet with OS bookworm completed: - ml-serve1009 (**PASS... [13:43:49] 06Machine-Learning-Team: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10859413 (10klausman) [13:55:40] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10859462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1003 for host ml-serve1008.eqiad.wmnet with OS bookworm [14:59:10] ml-serve1006 done! [14:59:40] 06Machine-Learning-Team: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10859763 (10elukey) [15:09:59] 06Machine-Learning-Team: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10859807 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1003 for host ml-serve1008.eqiad.wmnet with OS bookworm completed: - ml-serve1008 (**PASS... [15:13:13] elukey: 8 also done. I'll sned a final patch form 2007 and cleanup [15:13:18] er 1007 [15:18:26] ml-serve1008.eqiad.wmnet [15:18:32] oops, mispaste [15:18:34] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151237 [15:33:04] 06Machine-Learning-Team: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10859882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1003 for host ml-serve1007.eqiad.wmnet with OS bookworm [15:42:25] really nice! [15:44:46] it's always satisfying to drop a bunch of special cases :) [15:59:19] definitely :) [15:59:27] I think that tomorrow we can start talking about https://phabricator.wikimedia.org/T369493#10792884 [16:14:38] 06Machine-Learning-Team: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10860187 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1003 for host ml-serve1007.eqiad.wmnet with OS bookworm completed: - ml-serve1007 (**PASS... [16:15:58] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10860191 (10DMburugu) p:05Triage→03High [16:22:32] 06Machine-Learning-Team: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10860242 (10klausman) [17:06:44] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10860513 (10Kgraessle) @isarantopoulos @gkyziridis Hello, I'm just following up on the highlighting issue after chat... [17:09:43] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10860534 (10Kgraessle) [17:35:39] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10860746 (10isarantopoulos) After a conversation on slack we have scheduled the deployment for UTC morning backport wi... [19:04:09] 06Machine-Learning-Team, 05Goal: 2024-25 Q4: Users can "pip install liftwing" and access 20% of models - https://phabricator.wikimedia.org/T359140#10861145 (10Aklapper) [19:04:47] 06Machine-Learning-Team, 05Goal: 2024-25 Q4: Users can "pip install liftwing" and access 20% of models - https://phabricator.wikimedia.org/T359140#10861146 (10Aklapper) a:05Mercelisvaughan→03None @calbon: Unassigning inactive task assignee who seems to have been a WMF contractor though did not link their P...