[09:02:48] morning! [09:43:44] 06Machine-Learning-Team, 06Research, 10Temporary accounts: Implement support for temporary accounts in revertrisk models - https://phabricator.wikimedia.org/T376116 (10kostajh) 03NEW [09:45:34] 06Machine-Learning-Team, 06Moderator-Tools-Team, 06Research, 10Temporary accounts, 06Trust and Safety Product Team: RevertRisk model readiness for temporary accounts - https://phabricator.wikimedia.org/T352839#10190821 (10kostajh) >>! In T352839#9897693, @MunizaA wrote: > @kostajh Liftwing is now run... [09:47:07] 06Machine-Learning-Team, 06Research, 10Temporary accounts: Implement support for temporary accounts in revertrisk models - https://phabricator.wikimedia.org/T376116#10190823 (10kostajh) 05Open→03Stalled I've filed this task based on the comment from T352839#9897693. Marking as stalled, as we don't yet ha... [10:24:17] Chào buổi sáng! :) [10:56:16] * klausman lunch [16:29:09] (03PS1) 10AikoChou: locust: update load testing result for reference_quality [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1077069 [17:16:44] FIRING: LiftWingServiceErrorRate: ... [17:16:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=enwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [17:16:56] Hi folks! ores-legacy is issuing 504s with ERROR LiftWing call for model damaging and rev-id <> returned 504 with message upstream request timeout [17:23:37] ml-team ^^ [17:26:44] RESOLVED: [2x] LiftWingServiceErrorRate: LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [17:34:15] FIRING: ORESFetchScoreJobKafkaLag: Kafka consumer lag for ORESFetchScoreJob over threshold for past 1h. ... [17:34:15] - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#Kafka_Consumer_lag_-_ORESFetchScoreJobKafkaLag_alert - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&orgId=1&to=now&var-cluster=main-codfw&var-consumer_group=cpjobqueue-ORESFetchScoreJob&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DORESFetchScoreJobKafkaLag [17:50:13] klausman, chrisalbon: ^^ [19:24:15] RESOLVED: ORESFetchScoreJobKafkaLag: Kafka consumer lag for ORESFetchScoreJob over threshold for past 1h. ... [19:24:15] - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#Kafka_Consumer_lag_-_ORESFetchScoreJobKafkaLag_alert - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&orgId=1&to=now&var-cluster=main-codfw&var-consumer_group=cpjobqueue-ORESFetchScoreJob&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DORESFetchScoreJobKafkaLag [19:24:59] Juat as I got home it seems to have stopped firing. I'l investigate tomorrow [22:23:43] hello folks o/ [22:24:33] I see that there was a large spike in preprocessing times [22:31:55] and this resulted in a bunch of timeouts mostly between 16:30 and 17:30 UTC . This started to improve and at 19:00 UTC was solved [22:32:01] auto-resolved [22:32:10] https://grafana.wikimedia.org/goto/z911XSkHR?orgId=1 [22:34:00] by checking at the logs I see that it is the standard issue that we have with revscoring preprocessing in `get_revscoring_extractor_cache` [22:40:54] I added an update for the incident to the related task https://phabricator.wikimedia.org/T363336#10194199 [22:43:01] I suggest if it happens we enable MP for enwiki-damaging.