[00:01:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment recommendation-api-ng-main in recommendation-api-ng at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [04:01:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment recommendation-api-ng-main in recommendation-api-ng at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [05:28:59] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use vllm for ROCm in huggingface image - https://phabricator.wikimedia.org/T370149#10360373 (10achou) I tried to follow the instructions [[ https://docs.vllm.ai/en/latest/getting_started/amd-installation.html#build-from-source-rocm | here ]] and... [05:59:44] (03CR) 10Santhosh: [C:03+2] Use sitematrix and interwiki map to properly find dbname for links [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098124 (https://phabricator.wikimedia.org/T380838) (owner: 10Nik Gkountas) [06:00:08] (03CR) 10Santhosh: [C:04-1] "This patch is no longer required right?" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098090 (https://phabricator.wikimedia.org/T380838) (owner: 10Sbisson) [06:01:16] (03Merged) 10jenkins-bot: Use sitematrix and interwiki map to properly find dbname for links [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098124 (https://phabricator.wikimedia.org/T380838) (owner: 10Nik Gkountas) [06:34:41] (03PS6) 10Santhosh: Cache update: skip iw links already discovered through wikidata [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098115 (https://phabricator.wikimedia.org/T380838) (owner: 10Sbisson) [06:35:39] (03CR) 10Santhosh: "Rebased and moved the dbname calculation close to the API call." [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098115 (https://phabricator.wikimedia.org/T380838) (owner: 10Sbisson) [06:35:43] (03CR) 10Santhosh: [C:03+2] Cache update: skip iw links already discovered through wikidata [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098115 (https://phabricator.wikimedia.org/T380838) (owner: 10Sbisson) [06:36:01] (03PS2) 10Sbisson: Let the page collection cache return [] when empty [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098139 (https://phabricator.wikimedia.org/T380838) [06:36:05] (03CR) 10Santhosh: [C:03+2] Let the page collection cache return [] when empty [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098139 (https://phabricator.wikimedia.org/T380838) (owner: 10Sbisson) [06:36:34] (03Merged) 10jenkins-bot: Cache update: skip iw links already discovered through wikidata [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098115 (https://phabricator.wikimedia.org/T380838) (owner: 10Sbisson) [06:37:36] (03PS3) 10Sbisson: Filter out articles in other NS for IW links [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098130 [06:37:44] (03Merged) 10jenkins-bot: Let the page collection cache return [] when empty [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098139 (https://phabricator.wikimedia.org/T380838) (owner: 10Sbisson) [06:40:22] (03CR) 10Santhosh: [C:03+2] Filter out articles in other NS for IW links [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098130 (owner: 10Sbisson) [06:41:13] (03Merged) 10jenkins-bot: Filter out articles in other NS for IW links [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098130 (owner: 10Sbisson) [06:56:11] (03PS1) 10Santhosh: get_articles_by_titles: Skip if dbname does not exist [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098412 (https://phabricator.wikimedia.org/T380838) [07:00:12] (03Merged) 10jenkins-bot: get_articles_by_titles: Skip if dbname does not exist [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098412 (https://phabricator.wikimedia.org/T380838) (owner: 10Santhosh) [07:17:50] 06Machine-Learning-Team, 13Patch-For-Review: Run unit tests for the inference-services repo in CI - https://phabricator.wikimedia.org/T360120#10360463 (10kevinbazira) [07:19:44] (03PS1) 10Kevin Bazira: test: update nsfw predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098420 (https://phabricator.wikimedia.org/T360120) [08:01:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment recommendation-api-ng-main in recommendation-api-ng at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:02:15] (03PS1) 10Kevin Bazira: test: update ores-legacy test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098464 (https://phabricator.wikimedia.org/T360120) [08:23:29] hello! [08:55:24] (03CR) 10Nikerabbit: Filter out articles in other NS for IW links (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098130 (owner: 10Sbisson) [09:50:14] (03CR) 10Ilias Sarantopoulos: [C:03+2] test: update nsfw predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098420 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [09:50:44] (03CR) 10Ilias Sarantopoulos: [V:03+2 C:03+2] test: update nsfw predictor test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098420 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [09:52:35] is anybody else having issues with gerrit or is it just me? [09:53:17] o/ gerrit seems fine to me! [09:55:43] ack. it isn't playing nice with me, taking too long to load [09:56:22] cant even pull changes [09:56:31] anyway thanks for checking, I'll recheck in a bit [09:57:21] okok ... might be transient [09:58:28] (03PS2) 10Kevin Bazira: test: update ores-legacy test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098464 (https://phabricator.wikimedia.org/T360120) [10:14:55] now it works again :) [10:15:28] (03CR) 10Ilias Sarantopoulos: [C:03+1] test: update ores-legacy test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098464 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [10:24:19] klausman: Guten tag o/ . any news on the sudo nvtop topic? [10:27:14] (03PS1) 10Nikerabbit: Fetcher: small fixes to comments and variable names [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098476 [10:27:30] I am hacking away at it as we speak, should be ready in a few min [10:27:41] ok, thank youuu [10:38:14] (03CR) 10Kevin Bazira: [C:03+2] test: update ores-legacy test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098464 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [10:38:57] (03Merged) 10jenkins-bot: test: update ores-legacy test image to support latest ci tests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1098464 (https://phabricator.wikimedia.org/T360120) (owner: 10Kevin Bazira) [11:02:43] thanks for the reviews, Ilias! going to deploy article-country on LW staging [11:03:35] ack! [11:09:35] article-country is up and running on staging: https://phabricator.wikimedia.org/P71212 [11:20:46] nice! [11:54:13] * klausman lunch [11:58:52] (03CR) 10Nik Gkountas: [C:03+2] Fetcher: small fixes to comments and variable names [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098476 (owner: 10Nikerabbit) [11:59:33] (03Merged) 10jenkins-bot: Fetcher: small fixes to comments and variable names [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098476 (owner: 10Nikerabbit) [12:01:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment recommendation-api-ng-main in recommendation-api-ng at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:32:22] Deploying rec-api in a few minutes, hopefully it will solve some issues we're facing including unavailable replicas. Fingers crossed! [12:36:03] ack! [12:36:18] I'm going afk for lunch but will be around in case you need something [12:39:28] Thanks! [12:47:53] Looks like deployment is stuck in the staging. Still not finished with staging :/ [12:50:41] and it failed.. [13:08:25] On it ^^ Need to update LANGUAGE_PAIRS_API [13:21:32] Another attempt, anohter failure :/ [13:34:16] kart_: o/ how did it failed? where you able to check the pod logs? [13:36:25] Yes. Pods logs were useful. We found the issue. [13:37:19] `File "/app/recommendation/main.py", line 7, in [13:37:19] import psutil [13:37:19] ModuleNotFoundError: No module named 'psutil'` [13:41:16] ah okok super [13:41:45] (03Abandoned) 10Sbisson: Minimal fix for zh-min-nan site name [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098090 (https://phabricator.wikimedia.org/T380838) (owner: 10Sbisson) [13:49:44] FIRING: LiftWingServiceErrorRate: ... [13:49:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [13:54:44] RESOLVED: LiftWingServiceErrorRate: ... [13:54:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [14:00:41] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10361552 (10isarantopoulos) [14:02:16] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10361556 (10isarantopoulos) [14:13:10] (03PS1) 10Nik Gkountas: Update LANGUAGE_PAIRS_API configuration parameter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098515 [14:14:15] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10361621 (10MunizaA) [14:15:09] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10361624 (10isarantopoulos) I also tried to build flash attention on ml-lab and came to a similar conclusion: `hipcc -v` fails with the following error: ` Can't exec "/opt/roc... [14:16:57] (03PS1) 10Nik Gkountas: Include psutil in dependencies [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098520 [14:17:56] (03CR) 10CI reject: [V:04-1] Include psutil in dependencies [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098520 (owner: 10Nik Gkountas) [14:21:22] (03CR) 10Sbisson: [C:03+2] Update LANGUAGE_PAIRS_API configuration parameter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098515 (owner: 10Nik Gkountas) [14:22:00] (03Merged) 10jenkins-bot: Update LANGUAGE_PAIRS_API configuration parameter [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098515 (owner: 10Nik Gkountas) [14:24:23] (03PS2) 10Nik Gkountas: Include psutil in dependencies [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098520 [14:28:31] (03CR) 10Sbisson: [C:03+2] Include psutil in dependencies [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098520 (owner: 10Nik Gkountas) [14:29:10] (03Merged) 10jenkins-bot: Include psutil in dependencies [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098520 (owner: 10Nik Gkountas) [14:48:12] Deploy rec-api -- let's see how it goes now! [14:49:44] I'm switching ml-etcd1003 to move it to a new ganeti node, latencies go up for a bit [14:50:28] ack for both! [14:52:01] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use vllm for ROCm in huggingface image - https://phabricator.wikimedia.org/T370149#10361833 (10isarantopoulos) [14:53:24] Staging seems happy for rec-api.. [14:56:44] FIRING: LiftWingServiceErrorRate: ... [14:56:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:01:44] RESOLVED: LiftWingServiceErrorRate: ... [15:01:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=recommendation-api-ng&var-backend=recommendation-api-ng-main.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:01:49] RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment recommendation-api-ng-main in recommendation-api-ng at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:02:06] 10Lift-Wing, 06Machine-Learning-Team: [LLM] Allow loading model weights as int8 with HF - https://phabricator.wikimedia.org/T377848#10361882 (10isarantopoulos) [15:03:39] Seems unavailable replicas -- resolved! [15:05:22] ml-etcd1003 is back to normal [15:06:04] merci! [15:34:50] (03CR) 10Nikerabbit: "v1 is deprecated and to be removed soon. Can you migrate to v2?" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098515 (owner: 10Nik Gkountas) [15:46:50] https://logstash.wikimedia.org/app/dashboards#/view/d6685be0-ab33-11ef-9f7a-bbecdce3b972?_g=h@42b0d52&_a=h@dc1fc86 -- rec-api logs seems empty - checking what's going on. [16:17:58] isaranto: I've already done the simple symlink bit on 1001, in case you want to test before we meet tomorrow [16:22:47] (03PS1) 10Sbisson: Use cx-server language pairs API v2 [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098555 [16:40:09] thanks! symlinking worked but still get issues for hipcc. I'll update the task and we can chat tomorrow [16:44:14] :+1: [16:45:08] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10362517 (10MunizaA) >>! In T371344#10361276, @MunizaA wrote: > > Also note that the output says `HIP version : 5.2.21153-0` but I would've expected it to be something like `... [16:50:05] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10362534 (10isarantopoulos) Tobias has added symlinks pointing to clang-17 (@klausman I see that you have added symlinks for clang-14 not clang-15) and we get a different error... [16:53:09] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10362571 (10klausman) >>! In T371344#10362533, @isarantopoulos wrote: > Tobias has added symlinks pointing to clang-17 (@klausman I see that you have added symlinks for clang-1... [17:05:11] kart_: I found this dashboard that works https://logstash.wikimedia.org/goto/95a46285f2b829d7e12a48ea59ff19f8 [17:05:39] the Apps logs ECS one isn't working (I don't know who maintains it) [17:05:50] at least I can see all the logs in the new one [17:10:18] Yes. I created similar dashboard, will update some visualizations tomorrow to fix some data. [17:10:44] And, then - will add it in the Dashboard panel, so we can find it easily! [17:17:54] thanks! [17:18:07] going afk folks, cu tomorrow! [18:01:24] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10362864 (10MunizaA) It looks you can override just invocations of `nvcc` or `hipcc` without overriding invocations of g++ or clang++ when building extensions (which is what `C... [18:58:16] (03PS1) 10Sbisson: Extra logging for cache debugging [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1098584 [19:47:30] 10Lift-Wing, 06Machine-Learning-Team, 07OKR-Work: Request to host article-country model on Lift Wing - https://phabricator.wikimedia.org/T371897#10363540 (10Isaac) Thanks @kevinbazira! Some feedback but this is a big improvement and thanks for implementing the logic: * I get `{"error":"ValueError : min() arg... [22:56:46] 06Machine-Learning-Team, 13Patch-For-Review: [LLM] Use Flash attention 2 for GPU inference - https://phabricator.wikimedia.org/T371344#10364099 (10MunizaA) > it isn't done building yet but the build has started successfully. This finally finished running. ` (flash-env) mnz@ml-lab1001:~/scratch/flash-attn-2$ p...