[02:23:54] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:24:47] (03CR) 10KartikMistry: [C:03+2] Cache update: randomize sleep time after failure [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1240745 (owner: 10Sbisson) [05:26:39] (03Merged) 10jenkins-bot: Cache update: randomize sleep time after failure [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1240745 (owner: 10Sbisson) [05:27:23] (03CR) 10KartikMistry: [C:03+2] "Done" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1240745 (owner: 10Sbisson) [05:28:11] (03PS1) 10KartikMistry: Update dependencies [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1245360 [06:24:13] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:59:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [07:59:49] Deployment revertrisk-multilingual-predictor-00003-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [07:59:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-multilingual-predictor-00003-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:09:30] 06Machine-Learning-Team, 10Semantic Search, 07OKR-Work: Migrate embeddings inference service from Transformers+FA2 to vLLM - https://phabricator.wikimedia.org/T418976 (10kevinbazira) 03NEW [08:39:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [08:39:49] Deployment revertrisk-multilingual-predictor-00003-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [08:39:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-multilingual-predictor-00003-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:19:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [09:19:49] Deployment revertrisk-multilingual-predictor-00003-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [09:19:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-multilingual-predictor-00003-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:24:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [09:24:49] Deployment revertrisk-multilingual-predictor-00003-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [09:24:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-multilingual-predictor-00003-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:59:56] (03PS1) 10Kevin Bazira: embeddings: migrate inference backend from transformers+fa2 to vLLM [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1247945 (https://phabricator.wikimedia.org/T418976) [10:24:13] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:44:06] (03PS7) 10Bartosz W贸jtowicz: article-topics: Add outlink cache adapter for outlink-topic-model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1245307 (https://phabricator.wikimedia.org/T418493) [11:56:49] (03CR) 10AikoChou: [C:03+1] "LGTM! I tested it locally and it works well 馃槃 I have a couple questions, but they shouldn't block staging testing. Feel free to merge!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1245307 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz W贸jtowicz) [12:07:19] (03CR) 10Bartosz W贸jtowicz: [C:03+2] "Thank you for the review!! I'll move to testing on staging (and testing if the build-publish works :D)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1245307 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz W贸jtowicz) [12:33:36] (03CR) 10Bartosz W贸jtowicz: [C:03+2] "CI re-check" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1245307 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz W贸jtowicz) [12:46:09] (03CR) 10Ozge: embeddings: migrate inference backend from transformers+fa2 to vLLM (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1247945 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [12:50:40] (03CR) 10Ozge: embeddings: migrate inference backend from transformers+fa2 to vLLM (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1247945 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [13:05:05] (03CR) 10Ozge: embeddings: migrate inference backend from transformers+fa2 to vLLM (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1247945 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [13:05:58] 06Machine-Learning-Team: Reduce logstash logs from machine learning infra - https://phabricator.wikimedia.org/T416384#11673023 (10elukey) I think that the best course of action is to split logs by namespace: * `istio-system` - https://logstash.wikimedia.org/goto/2b96a9c732952692351003f0b5229bd7 - is likely some... [13:19:27] (03PS8) 10Bartosz W贸jtowicz: article-topics: Add outlink cache adapter for outlink-topic-model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1245307 (https://phabricator.wikimedia.org/T418493) [13:19:50] (03CR) 10Bartosz W贸jtowicz: article-topics: Add outlink cache adapter for outlink-topic-model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1245307 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz W贸jtowicz) [13:24:52] (03PS9) 10Bartosz W贸jtowicz: article-topics: Add outlink cache adapter for outlink-topic-model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1245307 (https://phabricator.wikimedia.org/T418493) [13:26:09] ^ trying to trigger jenkins-bot on the patch, but it doesn't seem interested.. [13:39:56] 06Machine-Learning-Team, 13Patch-For-Review: Reduce logstash logs from machine learning infra - https://phabricator.wikimedia.org/T416384#11673202 (10elukey) Tested staging and the knative traffic volume dropped: {F72503325} [13:49:10] (03PS10) 10Bartosz W贸jtowicz: article-topics: Add outlink cache adapter for outlink-topic-model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1245307 (https://phabricator.wikimedia.org/T418493) [13:50:35] bartosz: usually there will be two pipelines in .pipeline/config.yaml, one for test and build, one for publish. I only see the publish one. But I also see some models don't have two.. like the embedding and policy-violation [13:51:31] aiko: yess, that was the intention to have only publish like other models. the pipeline also worked well, last time today at 11:54 and further patchsets changed only README [13:53:11] I think this might a general jenkins-bot problem, I also see similar behaviour across other patches here https://gerrit.wikimedia.org/r/dashboard/75 [13:55:11] I see. probably we just need to wait for a little [13:55:22] yeah, this seems to be discussed in #wikimedia-releng [13:59:21] (03PS11) 10Bartosz W贸jtowicz: article-topics: Add outlink cache adapter for outlink-topic-model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1245307 (https://phabricator.wikimedia.org/T418493) [14:14:48] (03CR) 10Kevin Bazira: embeddings: migrate inference backend from transformers+fa2 to vLLM (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1247945 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [14:24:13] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:30:28] 06Machine-Learning-Team, 13Patch-For-Review: Reduce logstash logs from machine learning infra - https://phabricator.wikimedia.org/T416384#11673483 (10elukey) Knative should be good now: {F72504419} For the kserve controller we sadly cannot do much from what I can see, since we'd need https://github.com/kser... [14:44:42] 06Machine-Learning-Team: Reduce logstash logs from machine learning infra - https://phabricator.wikimedia.org/T416384#11673555 (10elukey) @DPogorzelski-WMF @klausman we already have kserve 0.13 in production-images, so in theory we could simply upgrade the control plane + helm chart to include the above commit a... [14:45:54] (03CR) 10Ozge: [C:03+1] embeddings: migrate inference backend from transformers+fa2 to vLLM (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1247945 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [14:49:00] (03CR) 10Kevin Bazira: [C:03+2] embeddings: migrate inference backend from transformers+fa2 to vLLM (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1247945 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [14:50:15] (03Merged) 10jenkins-bot: embeddings: migrate inference backend from transformers+fa2 to vLLM [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1247945 (https://phabricator.wikimedia.org/T418976) (owner: 10Kevin Bazira) [15:03:12] (03CR) 10Bartosz W贸jtowicz: [C:03+2] article-topics: Add outlink cache adapter for outlink-topic-model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1245307 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz W贸jtowicz) [15:06:56] (03Merged) 10jenkins-bot: article-topics: Add outlink cache adapter for outlink-topic-model [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1245307 (https://phabricator.wikimedia.org/T418493) (owner: 10Bartosz W贸jtowicz) [15:35:08] 06Machine-Learning-Team, 10Semantic Search, 07OKR-Work: Migrate embeddings inference service from Transformers+FA2 to vLLM - https://phabricator.wikimedia.org/T418976#11673829 (10kevinbazira) The embeddings model-server's inference backend has been migrated from transformers+fa2 to vLLM 0.14. I've tested the... [17:12:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [17:12:49] Deployment revertrisk-multilingual-predictor-00003-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [17:12:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-multilingual-predictor-00003-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [17:30:14] 10Lift-Wing, 07Documentation: Revscoring model cards should link to and be linked from Lift Wing API docs - https://phabricator.wikimedia.org/T419037 (10TBurmeister) 03NEW [17:31:48] 10Lift-Wing, 06Machine-Learning-Team, 07Documentation: Revscoring model cards should link to and be linked from Lift Wing API docs - https://phabricator.wikimedia.org/T419037#11674456 (10TBurmeister) [17:34:43] 06Machine-Learning-Team: kserve helm status is broken across ml clusters - https://phabricator.wikimedia.org/T419040 (10elukey) 03NEW [17:36:35] 10Lift-Wing, 06Machine-Learning-Team, 07Documentation: Revscoring model cards should link to and be linked from Lift Wing API docs - https://phabricator.wikimedia.org/T419037#11674479 (10TBurmeister) Note: This task is related to T406369, but it's TBD whether any of that work is blocked by these model card c... [17:57:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [17:57:49] Deployment revertrisk-multilingual-predictor-00003-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [17:57:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-multilingual-predictor-00003-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [18:24:13] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:22:29] (03PS7) 10Eamedina: Update section suggestion fetching to request multiple at once [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1217170 (owner: 10Nik Gkountas) [20:40:34] (03CR) 10Eamedina: Update section suggestion fetching to request multiple at once (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1217170 (owner: 10Nik Gkountas) [21:24:34] 06Machine-Learning-Team, 10ORES, 07PHP 8.5 support: PHP 8.5 CI failure in ORES: "Using null as an array offset is deprecated, use an empty string instead" - https://phabricator.wikimedia.org/T419071 (10Jdforrester-WMF) 03NEW [21:30:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [21:30:49] Deployment revertrisk-multilingual-predictor-00003-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [21:30:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-multilingual-predictor-00003-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:35:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [21:35:49] Deployment revertrisk-multilingual-predictor-00003-deployment in revertrisk at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [21:35:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revertrisk&var-deployment=revertrisk-multilingual-predictor-00003-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [22:24:13] FIRING: [3x] SLOMetricAbsent: revertrisk-la-availability - https://slo.wikimedia.org/?search=revertrisk-la-availability - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:57:10] 06Machine-Learning-Team, 10ORES, 07PHP 8.5 support: PHP 8.5 CI failure in ORES: "Using null as an array offset is deprecated, use an empty string instead" - https://phabricator.wikimedia.org/T419071#11675950 (10Jdforrester-WMF)