[07:11:47] (03PS1) 10Kevin Bazira: policy-violation: add configurable disable_custom_all_reduce flag to gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1266857 (https://phabricator.wikimedia.org/T418350) [07:19:56] (03CR) 10Ozge: [C:03+1] policy-violation: add configurable disable_custom_all_reduce flag to gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1266857 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [07:23:54] (03CR) 10Kevin Bazira: [C:03+2] policy-violation: add configurable disable_custom_all_reduce flag to gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1266857 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [07:25:05] (03Merged) 10jenkins-bot: policy-violation: add configurable disable_custom_all_reduce flag to gpt model-server [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1266857 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [07:41:15] 06Machine-Learning-Team, 06Data-Engineering, 06Data-Engineering-Radar, 10Event-Platform: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change - https://phabricator.wikimedia.org/T415892#11780933 (10gkyziridis) === Update === The [[ https://gerrit.wikimedia.org/r/plugins/g... [07:56:49] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment revertrisk-multilingual-predictor-00004-deployment in revertrisk at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:01:44] FIRING: LiftWingServiceErrorRate: ... [08:01:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-multilingual-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [08:31:49] RESOLVED: [2x] KubernetesDeploymentUnavailableReplicas: Deployment revertrisk-multilingual-predictor-00004-deployment in revertrisk at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:41:44] RESOLVED: LiftWingServiceErrorRate: ... [08:41:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-multilingual-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:10:44] FIRING: LiftWingServiceErrorRate: ... [09:10:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-multilingual-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [09:35:44] RESOLVED: LiftWingServiceErrorRate: ... [09:35:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-multilingual-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:41:44] FIRING: LiftWingServiceErrorRate: ... [11:41:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-wikidata-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:51:44] RESOLVED: LiftWingServiceErrorRate: ... [11:51:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-wikidata-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [13:25:44] FIRING: LiftWingServiceErrorRate: ... [13:25:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-wikidata-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [13:34:56] 06Machine-Learning-Team, 07OKR-Work: Add 'iommu=pt' kernel parameter on MI300x nodes for direct GPU-to-GPU communication (PCIe P2P) - https://phabricator.wikimedia.org/T421461#11782388 (10MoritzMuehlenhoff) >>! In T421461#11762894, @elukey wrote: > I did some reading and my understanding is that with `iommu=pt... [13:40:44] RESOLVED: LiftWingServiceErrorRate: ... [13:40:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revertrisk&var-backend=revertrisk-wikidata-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:22:58] 06Machine-Learning-Team, 06Data-Engineering, 06Data-Engineering-Radar, 10Event-Platform: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change - https://phabricator.wikimedia.org/T415892#11782975 (10gkyziridis) === Update === We reverted the the changes on production becau... [15:24:02] 06Machine-Learning-Team, 06Data-Engineering, 06Data-Engineering-Radar, 10Event-Platform: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change - https://phabricator.wikimedia.org/T415892#11782977 (10gkyziridis) a:03gkyziridis [17:35:10] 06Machine-Learning-Team, 06Data-Engineering, 06Data-Engineering-Radar, 10Event-Platform: Add Multilingual RevertRisk predictions to mediawiki.page_revert_risk_prediction_change - https://phabricator.wikimedia.org/T415892#11783778 (10Ottomata) > changeprop errors Weird! These indeed look like some logging...