[05:09:32] 06Machine-Learning-Team, 13Patch-For-Review: Run unit tests for the inference-services repo in CI - https://phabricator.wikimedia.org/T360120#10351498 (10kevinbazira) [07:05:53] (03PS1) 10Santhosh: Make sure application is started with initialized cache [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1097181 (https://phabricator.wikimedia.org/T380699) [07:11:06] (03PS2) 10Santhosh: Make sure application is started with initialized cache [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1097181 (https://phabricator.wikimedia.org/T380699) [07:17:59] (03PS1) 10Kevin Bazira: test: update huggingface predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1097185 (https://phabricator.wikimedia.org/T360120) [08:26:25] buongiorno! [08:37:22] kalimera :) [08:44:54] :D [09:01:43] I played a bit with the new alpha release of bitsandbytes that suppots ROCm for quantization https://github.com/bitsandbytes-foundation/bitsandbytes?tab=readme-ov-file#bitsandbytes-multi-backend-alpha-release-is-out [09:03:46] was able to load models but was getting errors during inference (at least for aya-expanse-32B). I'm test again today and document things on the related task https://phabricator.wikimedia.org/T377848 [09:05:59] 10Lift-Wing, 06Machine-Learning-Team: [LLM] Allow loading model weights as int8 with HF - https://phabricator.wikimedia.org/T377848#10351691 (10isarantopoulos) [09:10:57] (03PS3) 10Ilias Sarantopoulos: Revert "llm: remove model-server" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1092873 [11:38:13] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1092873 (owner: 10Ilias Sarantopoulos) [11:39:07] (03CR) 10Ilias Sarantopoulos: [C:03+1] test: update huggingface predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1097185 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [11:39:16] (03CR) 10Ilias Sarantopoulos: [C:03+1] test: update articlequality test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1094286 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [11:47:50] * klausman lunch [11:55:10] * isaranto also lunch ! [12:09:09] (03CR) 10Ilias Sarantopoulos: [C:03+2] Revert "llm: remove model-server" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1092873 (owner: 10Ilias Sarantopoulos) [12:20:08] (03Merged) 10jenkins-bot: Revert "llm: remove model-server" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1092873 (owner: 10Ilias Sarantopoulos) [12:51:40] (03PS2) 10Kevin Bazira: test: update articlequality test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1094286 (https://phabricator.wikimedia.org/T360120) [12:53:03] (03CR) 10Kevin Bazira: [C:03+2] test: update articlequality test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1094286 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [12:53:48] (03Merged) 10jenkins-bot: test: update articlequality test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1094286 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [12:58:20] (03PS2) 10Kevin Bazira: test: update huggingface predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1097185 (https://phabricator.wikimedia.org/T360120) [12:59:34] (03CR) 10Kevin Bazira: [C:03+2] test: update huggingface predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1097185 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [13:00:19] (03Merged) 10jenkins-bot: test: update huggingface predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1097185 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [13:26:11] 06Machine-Learning-Team, 06Data-Platform, 06Data-Platform-SRE: Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10352908 (10Gehel) p:05Triage→03Medium [13:32:41] 06Machine-Learning-Team, 06Data-Platform, 10Data-Platform-SRE (2024.11.09 - 2024.11.29): Create an Airflow instance for ML - https://phabricator.wikimedia.org/T380258#10352939 (10Gehel) [14:01:32] pip install --upgrade pip [14:01:34] lol [14:01:48] wrong window - no wonder it didnt work :P [14:33:08] 06Machine-Learning-Team, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, and 3 others: Update knative-serving+net-istio to v1.12.x on ML clusters - https://phabricator.wikimedia.org/T380723#10353244 (10klausman) V1.16.0 requires Istio 1.20, so I have backed down the version to the v.1.12... [14:35:05] 06Machine-Learning-Team, 06collaboration-services, 06Data-Platform-SRE, 10Prod-Kubernetes, and 3 others: Update knative-serving+net-istio to v1.12.x on ML clusters - https://phabricator.wikimedia.org/T380723#10353232 (10klausman) [14:43:16] lollll [16:00:22] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1092759 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [16:00:25] (03PS2) 10Kevin Bazira: test: update outlink transformer test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1092759 (https://phabricator.wikimedia.org/T360120) [16:29:45] hello [16:30:04] Heyo Chris [16:30:08] It is thanksgiving week, but I'm around mostly for hiring [16:54:53] hey Chris o/ [18:00:46] going afk folks, have a nice evening/rest of day! [20:23:58] (03CR) 10Nik Gkountas: [C:03+2] Make sure application is started with initialized cache [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1097181 (https://phabricator.wikimedia.org/T380699) (owner: 10Santhosh) [20:24:40] (03Merged) 10jenkins-bot: Make sure application is started with initialized cache [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1097181 (https://phabricator.wikimedia.org/T380699) (owner: 10Santhosh) [21:12:44] FIRING: LiftWingServiceErrorRate: ... [21:12:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=readability&var-backend=readability-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [21:42:44] RESOLVED: LiftWingServiceErrorRate: ... [21:42:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=readability&var-backend=readability-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate