[02:20:16] (03CR) 10Abijeet Patro: Smoke test to check deployments (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1103355 (owner: 10Sbisson) [06:44:44] FIRING: LiftWingServiceErrorRate: ... [06:44:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-draftquality&var-backend=ptwiki-draftquality-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [06:49:44] RESOLVED: LiftWingServiceErrorRate: ... [06:49:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-draftquality&var-backend=ptwiki-draftquality-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [07:52:02] hello o/ [09:47:10] o/ [09:50:04] \o [10:14:12] I checked the above alert and we weren't able to reach mediawiki api https://logstash.wikimedia.org/goto/82cc1569e2fb8f557c2954e976c15f6e [10:14:36] I was able to make the same requests so there was no issue with the specific revision [10:48:04] 10Lift-Wing, 06Machine-Learning-Team: Build and Publish ROCm-Compatible Python Packages - https://phabricator.wikimedia.org/T381859#10405584 (10isarantopoulos) [13:27:05] (03CR) 10Eamedina: [C:03+1] Smoke test to check deployments [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1103355 (owner: 10Sbisson) [13:42:35] (03PS4) 10Sbisson: Smoke test to check deployments [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1103355 [13:42:43] (03CR) 10Sbisson: Smoke test to check deployments (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1103355 (owner: 10Sbisson) [13:44:05] (03CR) 10CI reject: [V:04-1] Smoke test to check deployments [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1103355 (owner: 10Sbisson) [14:11:57] (03PS1) 10Ilias Sarantopoulos: llm: update flashattention2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104648 [14:14:42] (03PS2) 10Ilias Sarantopoulos: llm: update flashattention2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104648 [14:15:47] kevinbazira: o/ the intention of the above patch is to try the bitsandbytes package you had built on ml-lab [14:15:56] iirc we never tested it on lift wing [14:16:49] isaranto: o/ you mean the flash-attention package? [14:17:23] yes of course, wrong mention :D [14:17:57] it's ok, you need to rest :) [14:18:26] haha [14:18:44] flashandbytes [14:18:44] I did test the flash-attention package on ml-lab. let me +1 so we can test it on LW [14:18:57] (03CR) 10Kevin Bazira: [C:03+1] llm: update flashattention2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104648 (owner: 10Ilias Sarantopoulos) [14:19:11] thanks, will give it a try on experimental on ml-staging [14:19:19] okok [14:25:17] (03CR) 10CI reject: [V:04-1] llm: update flashattention2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104648 (owner: 10Ilias Sarantopoulos) [14:28:49] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104648 (owner: 10Ilias Sarantopoulos) [14:48:51] Deploying rec-api.. [14:50:35] ack! [14:52:32] 06Machine-Learning-Team, 10Data-Platform-SRE (2024.11.30 - 2024.12.20), 13Patch-For-Review: Move Lab machines into analytics net for DL access and switch to homedirs on Ceph - https://phabricator.wikimedia.org/T380279#10406273 (10BTullis) I have performed the first test mount of the cephfs `home` directory t... [14:52:58] 06Machine-Learning-Team, 10Data-Platform-SRE (2024.11.30 - 2024.12.20), 13Patch-For-Review: Move Lab machines into analytics net for DL access and switch to homedirs on Ceph - https://phabricator.wikimedia.org/T380279#10406278 (10BTullis) [14:54:37] (03PS3) 10Ilias Sarantopoulos: llm: update flashattention2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104648 [14:59:27] And, done :) [15:00:13] nice! [15:12:04] (03CR) 10CI reject: [V:04-1] llm: update flashattention2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104648 (owner: 10Ilias Sarantopoulos) [15:52:16] (03CR) 10Eamedina: [C:03+2] remove support for default collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1102356 (https://phabricator.wikimedia.org/T374597) (owner: 10Nik Gkountas) [15:52:57] (03Merged) 10jenkins-bot: remove support for default collections [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1102356 (https://phabricator.wikimedia.org/T374597) (owner: 10Nik Gkountas) [16:05:01] my attempt seems to fail to install it. I'm rebuilding it an env with torch 2.5.1 [18:32:50] (03PS4) 10Ilias Sarantopoulos: llm: update flashattention2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104648 [18:33:48] (03CR) 10CI reject: [V:04-1] llm: update flashattention2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104648 (owner: 10Ilias Sarantopoulos) [18:36:22] (03PS5) 10Ilias Sarantopoulos: llm: update flashattention2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104648 [18:42:10] (03CR) 10Ilias Sarantopoulos: [C:03+2] llm: update flashattention2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104648 (owner: 10Ilias Sarantopoulos) [18:42:55] (03Merged) 10jenkins-bot: llm: update flashattention2 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104648 (owner: 10Ilias Sarantopoulos) [19:17:08] (03PS1) 10Ilias Sarantopoulos: llm(fix): transfer inputs on correct device [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104722 [19:18:55] (03CR) 10Ilias Sarantopoulos: [C:03+2] llm(fix): transfer inputs on correct device [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104722 (owner: 10Ilias Sarantopoulos) [19:19:41] (03Merged) 10jenkins-bot: llm(fix): transfer inputs on correct device [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1104722 (owner: 10Ilias Sarantopoulos) [19:35:44] * isaranto afk! [19:39:50] (03PS1) 10Nik Gkountas: Page collections: Remove the exhausted iterator properly [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1104724 (https://phabricator.wikimedia.org/T382278) [20:33:08] (03CR) 10Sbisson: [C:03+2] Page collections: Remove the exhausted iterator properly [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1104724 (https://phabricator.wikimedia.org/T382278) (owner: 10Nik Gkountas) [20:33:49] (03Merged) 10jenkins-bot: Page collections: Remove the exhausted iterator properly [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1104724 (https://phabricator.wikimedia.org/T382278) (owner: 10Nik Gkountas)