[05:15:21] 06Machine-Learning-Team, 07Essential-Work: Update Aya LLM model-server to run on LiftWing GPUs - https://phabricator.wikimedia.org/T410906#11418131 (10kevinbazira) [05:15:23] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11418132 (10kevinbazira) [05:26:59] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11418149 (10kevinbazira) In T410906#11415517, we successfully tested the [llm model-server](https://github.com/wikimedia/machinelearning-liftwing-inference-services/tree/06aa6a4fbc36bcdc2374... [08:30:28] 06Machine-Learning-Team: Setup & experiments for MI300x GPUs used for LiftWing - https://phabricator.wikimedia.org/T403599#11418322 (10elukey) This is a great milestone! Thanks a lot for the work Kevin :) After the last chat on Slack I'd do another quick/little test to see if the AMD GPU plugin works as expecte... [09:52:03] dpogorzelski, klausman o/ I am working with netops to remove ml-serve1013 from the analytics vlan, like we did with 1012, I am going to turn it off etc.. [10:06:23] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11418705 (10elukey) To keep archives happy - the ml-serve1012 and 1013 hosts have been removed from the analytics vlan. [10:19:44] I am reimaging now, so ml-serve1013 will be ready to be added to k8s as well [10:38:40] ty! [10:51:02] ml-serve1013 doesn't seem to respond well on the reimage, and the mgmt console doesn't show much. Tried to reprovision (no diff), and not powercycling [10:59:01] ah ok interesting, I had to power it down for Arzhel to fix ip addresses in netbox etc.. but the reimage powercycle and the powercycle seems to not have power it up [10:59:16] I connected to the BMC's webUI and power it on [10:59:38] ack. I have found the webui to be a bit morte reliable than the SSH interface in the past [11:00:51] but even the boot state via Redfish API was ok with the powercycle [11:01:11] anyway, this is a very custom supermicro host, the combination of the two leads to these situations :D [11:15:26] a little better now, but it is stuck in initialization of hw [11:15:29] lovely [11:29:04] nope, no luck, will try again later on [11:37:26] (it is just horribly slow to reboot sometimes) [12:01:52] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#11419023 (10akosiaris) Thanks for the nice discussion everyone. Overall, I think with the suggestion of building images on a dedicated ML machine and with the precautions discussed, we ar... [12:27:46] (03CR) 10Nik Gkountas: [C:03+2] Support checking collection membership by language and titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1212678 (https://phabricator.wikimedia.org/T408845) (owner: 10Sbisson) [12:29:34] (03Merged) 10jenkins-bot: Support checking collection membership by language and titles [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1212678 (https://phabricator.wikimedia.org/T408845) (owner: 10Sbisson) [12:59:19] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11419191 (10achou) 05Open→03Resolved a:03achou Thanks for everyone's help. This task is resolved. :) [12:59:57] 06Machine-Learning-Team: Configure Lift Wing isvc Integration with Cassandra - https://phabricator.wikimedia.org/T409414#11419197 (10achou) a:05achou→03DPogorzelski-WMF [13:04:31] 06Machine-Learning-Team: Build and push images to the docker registry from ml-lab - https://phabricator.wikimedia.org/T394778#11419204 (10achou) a:03DPogorzelski-WMF [13:26:36] ml-serve1013 is ready to become a new k8s worker :) [13:31:14] 06Machine-Learning-Team, 06Data-Engineering, 06serviceops: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11419269 (10achou) [13:31:44] 06Machine-Learning-Team, 06Data-Engineering, 06serviceops: Enable ChangeProp to consume mediawiki.page_content_change.v1 - https://phabricator.wikimedia.org/T409469#11419273 (10achou) 05Open→03Declined [15:58:42] sorry to be late but. ack. i almost never visit irc :) [16:51:29] 06Machine-Learning-Team, 07Essential-Work: Upgrade AMD GPU + torch version of ML Labs machines - https://phabricator.wikimedia.org/T410663#11420185 (10achou) [21:37:44] FIRING: LiftWingServiceErrorRate: ... [21:37:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [21:57:44] RESOLVED: LiftWingServiceErrorRate: ... [21:57:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate