[00:11:18] (03PS2) 10Nik Gkountas: refactor recommenders size filtering [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211850 [00:11:56] (03CR) 10CI reject: [V:04-1] refactor recommenders size filtering [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211850 (owner: 10Nik Gkountas) [05:31:56] (03PS1) 10Kevin Bazira: llm: use fa2 that supports both MI200 and MI300X GPUs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211946 (https://phabricator.wikimedia.org/T410906) [06:47:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [06:47:49] Deployment reference-need-predictor-00012-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00012-deployment - ... [06:47:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [06:52:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [06:52:49] Deployment reference-need-predictor-00012-deployment in revision-models at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-deployment=reference-need-predictor-00012-deployment - ... [06:52:49] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:10:38] (03CR) 10Dpogorzelski: [C:03+1] llm: use fa2 that supports both MI200 and MI300X GPUs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211946 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:11:04] (03CR) 10Kevin Bazira: [C:03+2] llm: use fa2 that supports both MI200 and MI300X GPUs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211946 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:11:34] (03Merged) 10jenkins-bot: llm: use fa2 that supports both MI200 and MI300X GPUs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1211946 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [08:54:24] (03CR) 10Nik Gkountas: "recheck" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211850 (owner: 10Nik Gkountas) [08:57:44] FIRING: LiftWingServiceErrorRate: ... [08:57:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:02:44] RESOLVED: LiftWingServiceErrorRate: ... [09:02:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:16:04] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update Aya LLM model-server to run on LiftWing GPUs - https://phabricator.wikimedia.org/T410906#11412126 (10kevinbazira) In P85813, we built a flash-attention2 wheel that supports both gfx90a and gfx942 ROCm targets. Now the llm model-server infe... [09:16:32] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#11412129 (10DPogorzelski-WMF) I would like to resume this discussion and take a practical stab at making the ml-lab1001.eqiad.wmnet an official build machine for the ML team. Let's start... [09:45:34] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#11412197 (10kevinbazira) >>! In T394778#11412129, @DPogorzelski-WMF wrote: > I would like to resume this discussion and take a practical stab at making the ml-lab1001.eqiad.wmnet an offic... [10:08:51] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#11412243 (10elukey) @DPogorzelski-WMF I think the plan is good, I have only a few further questions: * IIUC ml-lab1001 will become a [[ https://wikitech.wikimedia.org/wiki/GitLab/Gitlab_... [10:31:55] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 10PersonalDashboard, and 3 others: Enable revertrisk filters in thwiki - https://phabricator.wikimedia.org/T409438#11412328 (10gkyziridis) ==== Update ==== I configure all the rr thresholds for all the wikis and enabled the m... [10:32:55] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 10PersonalDashboard, 13Patch-For-Review: AI/ML Infrastructure Request: Assistance in Rolling out Revert Risk to wikis that don't have damaging/goodfaith models - https://phabricator.wikimedia.org/T408607#11412332 (10gkyziridi... [10:53:44] FIRING: LiftWingServiceErrorRate: ... [10:53:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [10:58:44] RESOLVED: LiftWingServiceErrorRate: ... [10:58:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [12:47:28] (03PS1) 10Nik Gkountas: collection recommender: split into two classes [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1212135 [12:48:51] (03CR) 10CI reject: [V:04-1] collection recommender: split into two classes [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1212135 (owner: 10Nik Gkountas) [12:53:22] (03PS2) 10Nik Gkountas: collection recommender: split into two classes [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1212135 [13:25:17] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#11412942 (10DPogorzelski-WMF) * let's start with having he machine wiped and configured for ML team access, docker-pkg installed and the host whitelisted to push to the WMF registry, we c... [13:30:40] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#11412974 (10DPogorzelski-WMF) * we can also block access to major public registries in the http_proxy or via iptables on the host: # Allow your internal registry iptables -A OUTPUT -d you... [14:26:44] FIRING: LiftWingServiceErrorRate: ... [14:26:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [14:41:44] RESOLVED: LiftWingServiceErrorRate: ... [14:41:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [14:57:59] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#11413322 (10elukey) +1 on all, seems a good plan, not sure if a higher level approach for blocking registries compared to iptables is available, but that is something that can be investig... [15:07:20] Goerge and I in campfire meeting ... will folks be joining? [15:10:49] 10 mins in ... we are leaving the meeting o/ [15:11:29] o/ info on slack, since there’s no topic for today's campfire meeting, it's cancelled [15:18:54] ack. ty! [15:34:50] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#11413462 (10DPogorzelski-WMF) cool, i'll shoot a message in IRC to the sig regarding "You create a new production-images-like repository in gerrit and commit to it only ML Dockerfiles, wi... [15:52:32] (03PS1) 10Kevin Bazira: llm: trigger image build after fixing dependencies [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1212176 (https://phabricator.wikimedia.org/T410906) [15:53:51] (03CR) 10Kevin Bazira: [C:03+2] llm: trigger image build after fixing dependencies [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1212176 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [15:54:20] (03Merged) 10jenkins-bot: llm: trigger image build after fixing dependencies [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1212176 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [18:30:03] (03CR) 10Sbisson: "Makes sense." [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211850 (owner: 10Nik Gkountas) [19:15:47] (03CR) 10Sbisson: [C:03+2] add support for pagination for single page collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1206409 (https://phabricator.wikimedia.org/T384485) (owner: 10Nik Gkountas) [19:16:23] (03Merged) 10jenkins-bot: add support for pagination for single page collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1206409 (https://phabricator.wikimedia.org/T384485) (owner: 10Nik Gkountas) [19:46:11] (03PS4) 10Sbisson: New endpoint to check if articles are part of a collection [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1211726 (https://phabricator.wikimedia.org/T408844)