[00:14:42] <jinxer-wm>	 FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[00:44:42] <jinxer-wm>	 RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[01:14:42] <jinxer-wm>	 FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[01:44:42] <jinxer-wm>	 RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[02:14:42] <jinxer-wm>	 FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[02:44:42] <jinxer-wm>	 RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[03:14:42] <jinxer-wm>	 FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[03:44:42] <jinxer-wm>	 RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[04:00:22] <wikibugs>	 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Build Tone Check Model feedback-based retraining pipeline - https://phabricator.wikimedia.org/T393103#11007512 (10ppelberg)
[04:14:42] <jinxer-wm>	 FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[04:44:42] <jinxer-wm>	 RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[04:55:07] <kevinbazira>	 ^--- looking
[05:14:42] <jinxer-wm>	 FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[05:17:21] <kevinbazira>	 problem reported in alert:
[05:17:21] <kevinbazira>	 ```
[05:17:21] <kevinbazira>	 prometheus "ops" at http://127.0.0.1:9900/ops has "kafka_burrow_partition_lag" metric but doesn't currently have series matching {group="cpjobqueue-ORESFetchScoreJob"}, such series was last present 1d1h ago
[05:17:22] <kevinbazira>	 ```
[05:19:43] <kevinbazira>	 the summary shows: `Linting problems found for ORESFetchScoreJobKafkaLag`
[05:44:42] <jinxer-wm>	 RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[06:13:49] <kevinbazira>	 managed to dig up the alert that's firing: https://github.com/wikimedia/operations-alerts/blob/1304562d50c3b85c8babf92233dbb497294c4ce7/team-ml/ores_extension_test.yaml#L12
[06:14:42] <jinxer-wm>	 FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[06:16:02] <wikibugs>	 (03CR) 10AikoChou: "Kevin, thank you for working on this! Great job on separating file loading and parameter checking. That's definitely more efficient." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1169195 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira)
[06:28:26] <wikibugs>	 06Machine-Learning-Team: Fix AlertLintProblem for ORESFetchScoreJobKafkaLag - https://phabricator.wikimedia.org/T399683 (10kevinbazira) 03NEW
[06:35:05] <kevinbazira>	 I have created a phab task to fix this `AlertLintProblem` for `ORESFetchScoreJobKafkaLag`:
[06:35:05] <kevinbazira>	 https://phabricator.wikimedia.org/T399683
[06:35:05] <kevinbazira>	 elukey: isaranto: o/ I have added you to this --^ task since you edited this alert about 2 years ago and might know a fix.
[06:35:38] <kevinbazira>	 source: https://github.com/wikimedia/operations-alerts/blame/1304562d50c3b85c8babf92233dbb497294c4ce7/team-ml/ores_extension_test.yaml#L12
[06:44:42] <jinxer-wm>	 RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[06:46:38] <isaranto>	 hello! I'm here now
[06:47:40] <ozge_>	 good morning
[06:48:28] <isaranto>	 what a great night of alerts :D
[06:50:10] <isaranto>	 I'm taking a look and will ping you in the next 30'
[07:04:25] <kevinbazira>	 okok
[07:05:02] <kevinbazira>	 in case you missed it in the noise, here is the task: https://phabricator.wikimedia.org/T399683
[07:08:14] <georgekyz>	 good morning folks
[07:14:42] <jinxer-wm>	 FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[07:19:41] <wikibugs>	 06Machine-Learning-Team: Fix AlertLintProblem for ORESFetchScoreJobKafkaLag - https://phabricator.wikimedia.org/T399683#11007922 (10kevinbazira)
[07:32:14] <kevinbazira>	 the expression causing the ores extension alert: https://github.com/wikimedia/operations-alerts/blob/1304562d50c3b85c8babf92233dbb497294c4ce7/team-ml/ores_extension.yaml#L8
[07:32:47] <wikibugs>	 06Machine-Learning-Team: Fix AlertLintProblem for ORESFetchScoreJobKafkaLag - https://phabricator.wikimedia.org/T399683#11007989 (10elukey) If you check [[ https://prometheus-codfw.wikimedia.org/ops/graph?g0.expr=kafka_burrow_partition_lag%7Bgroup%3D%22cpjobqueue-ORESFetchScoreJob%22%7D&g0.tab=0&g0.stacked=0&g0....
[07:32:54] <elukey>	 hey folks, I left a comment in the task
[07:36:01] <kevinbazira>	 thank you for the context elukey. would the trick you mentioned `# deploy-tag: global`:
[07:36:01] <kevinbazira>	 https://github.com/wikimedia/operations-alerts/blob/1304562d50c3b85c8babf92233dbb497294c4ce7/team-sre/nel.yaml#L1
[07:36:01] <kevinbazira>	 replace both `# deploy-tag: ops` and `# deploy-site: eqiad, codfw`:
[07:36:01] <kevinbazira>	 https://github.com/wikimedia/operations-alerts/blob/1304562d50c3b85c8babf92233dbb497294c4ce7/team-ml/ores_extension.yaml#L1C1-L2C28
[07:37:59] <elukey>	 it would replace deploy-site, the deploy-tag may not be correct (do we need ml instead of ops?)
[07:38:21] <elukey>	 basically the idea is to use thanos for the metric, that has a global view, not eqiad/codfw
[07:38:31] <elukey>	 but we should verify what happened to changeprop at around that time
[07:42:19] <elukey>	 I don't find anything relevant in the SAL or wikimedia-operations, weird
[07:44:42] <jinxer-wm>	 RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[07:44:48] <isaranto>	 I've been checking for changeprop differences as well
[07:47:20] <wikibugs>	 06Machine-Learning-Team: Fix AlertLintProblem for ORESFetchScoreJobKafkaLag - https://phabricator.wikimedia.org/T399683#11008002 (10isarantopoulos) Just adding the info here from the alert for reference since the alert comes and goes :  ` prometheus "ops" at http://127.0.0.1:9900/ops has "kafka_burrow_partition_...
[07:52:24] <isaranto>	 so the issue is with codfw as Luca raised and there is no metric for the `kafka_burrow_partition_lag` metric (not just for the oresJob label but for all)
[08:05:29] <isaranto>	 kevinbazira: we could ask in the wikimedia-operations channel. we can also put a silence in the alert when it occurs again with a comment that we are looking into it + phab task mention to avoid all these recurring alerts
[08:06:15] <kevinbazira>	 ack! where do we silence these alerts from?
[08:07:46] <isaranto>	 from https://alerts.wikimedia.org/?q=
[08:08:07] <isaranto>	 if you are available we can jump on a gmeet and I'll show you
[08:14:42] <jinxer-wm>	 FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[08:21:21] <kevinbazira>	 we have silenced this alert for 1 day as we investigate the issue
[08:33:27] <kevinbazira>	 I have also followed up in the wikimedia-operations channel in case they have any pointers.
[08:35:51] <isaranto>	 thanks Kevin!
[08:37:03] <kevinbazira>	 np!
[09:01:53] <georgekyz>	 isaranto: patch for revertrisk simplewiki/trwiki ready -> https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1170092
[09:27:24] <elukey>	 kevinbazira: one advice - when you want some info from SRE, it is better to use #wikimedia-sre, less crowded from alarms etc..
[09:27:51] <elukey>	 for this case I think it is fine to change the deploy tag
[09:27:53] <elukey>	 and that's it
[09:28:18] <elukey>	 but a ping to SRE about the missing metrics may be good if anything is ongoing
[10:02:39] <kevinbazira>	 elukey: thanks for the pointers. I have also followed up in #wikimedia-sre.
[10:02:40] <kevinbazira>	 I have pushed a patch for this issue here: https://gerrit.wikimedia.org/r/1170107
[10:02:40] <kevinbazira>	 please review whenever you get a minute.
[10:43:44] <jinxer-wm>	 FIRING: LiftWingServiceErrorRate: ...
[10:43:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-need-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[11:07:40] <wikibugs>	 (03PS2) 10Kevin Bazira: RRLA: Validate lang parameter against canonical wikis [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1169195 (https://phabricator.wikimedia.org/T399437)
[11:08:50] <kevinbazira>	 looking at the reference-need-predictor alert
[11:13:54] <wikibugs>	 (03CR) 10Kevin Bazira: "Thank you Aiko. I have fixed the 3 issues you pointed out." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1169195 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira)
[11:50:25] <kevinbazira>	 the reference-need-predictor alert was caused by a spike shown in logstash: https://logstash.wikimedia.org/goto/c558820a21decb143949f8319d3e6fbb in the minute of 10:34
[12:08:08] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11008945 (10OKarakaya-WMF)
[12:10:52] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11008962 (10OKarakaya-WMF)
[12:26:54] <wikibugs>	 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update kserve to 0.15.2 - https://phabricator.wikimedia.org/T367048#11009014 (10isarantopoulos)
[12:27:08] <wikibugs>	 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update kserve to 0.15.2 - https://phabricator.wikimedia.org/T367048#11009021 (10isarantopoulos)
[12:27:11] <wikibugs>	 10Lift-Wing, 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update revertrisk to kserve 0.15.2 - https://phabricator.wikimedia.org/T383119#11009022 (10isarantopoulos)
[12:33:53] <elukey>	 kevinbazira: I see two interesting things: 1) in the logs that you pointed out there are reports like predict_ms: 484540.20357132, that seems a lot 2) From the grafana link in the alert it seems that a lot of connections are ending up in DC state, that should be Downstream Closes (IIRC the client giving up at some point)
[12:34:34] <elukey>	 is it a new model? Maybe under testing?
[12:34:59] <elukey>	 otherwise a client may be calling it with some features that cause the model to take ages to get inference done
[12:36:23] <elukey>	 I don't see throttling in https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&from=now-3h&to=now&timezone=utc&var-datasource=000000026&var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-pod=reference-need-predictor-00012-deployment-7d54649454-2kdnd&var-container=$__all though, that is a good sign
[12:38:22] <kevinbazira>	 elukey: the reference-need model was added in Oct 2024: https://phabricator.wikimedia.org/T371902
[12:41:49] <elukey>	 kevinbazira: https://logstash.wikimedia.org/goto/7eea6b1f2a67a842377d235383eaae7f
[12:42:35] <elukey>	 so I went in the istio dashboard, filtered for ml-serve-eqiad and also for return code "0", that is the DC/client-giving-up code for istio
[12:43:21] <elukey>	 it seems matching with what we are seeing, and it is WME related
[12:43:39] <elukey>	 (I went in the Istio dashboard since all the traffic goes through Istio before reaching the isvcs)
[12:43:46] <elukey>	 (and we have access logs)
[12:44:11] <elukey>	 so I'd reach out to them and ask what they are doing
[12:45:25] <kevinbazira>	 thank you for looking elukey 
[12:45:26] <kevinbazira>	 I am not sure the logstash link you shared is showing what you expected
[12:46:40] <elukey>	 I tried it and it does, what do you see?
[12:50:10] <kevinbazira>	 ok got it. the user agent is: `WME/2.0 (https://enterprise.wikimedia.com/; wme_mgmt@wikimedia.org)`
[12:51:44] <kevinbazira>	 drilled down to the hour of 10: https://logstash.wikimedia.org/goto/efdaccbb1f9112eeb474fdb85ef502f8
[12:57:14] <wikibugs>	 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11009126 (10OKarakaya-WMF)
[12:59:56] <elukey>	 kevinbazira: you also have the pie chart with the user agents, showing which ones are affected etc..
[13:00:01] <elukey>	 WME seems to be  the 99%
[13:00:48] <kevinbazira>	 yep it's super high 99.93%
[13:02:35] * kevinbazira joining ml tech meeting brb
[13:02:48] <elukey>	 kevinbazira: one thing to check is if the rev-ids mentioned in the logs may lead to very long inference time
[13:03:07] <elukey>	 just to see if something is really heavy for $reason and we have to improve the model
[13:13:44] <jinxer-wm>	 RESOLVED: LiftWingServiceErrorRate: ...
[13:13:44] <jinxer-wm>	 LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-need-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate
[14:01:50] * kevinbazira back
[14:03:28] <kevinbazira>	 the reference-need-predictor issue seems to have resolved itself
[14:03:28] <kevinbazira>	 I will continue investigating the root cause
[15:39:36] <wikibugs>	 06Machine-Learning-Team: Investigate reference-need-predictor alert triggered by BrokenProcessPool error - https://phabricator.wikimedia.org/T399733 (10kevinbazira) 03NEW
[15:42:37] <kevinbazira>	 ^--- I have created a phab task that details the investigation of the reference-need-predictor alert: https://phabricator.wikimedia.org/T399733
[15:47:22] * kevinbazira afk
[15:51:08] <wikibugs>	 06Machine-Learning-Team: Investigate reference-need-predictor alert triggered by BrokenProcessPool error - https://phabricator.wikimedia.org/T399733#11009928 (10isarantopoulos) This seems to be the same issue described in {T387019}. In the last comment there is a mention about the broken process pool.  I wonder...
[16:09:32] <wikibugs>	 06Machine-Learning-Team: Inputs for tone check model prediction - https://phabricator.wikimedia.org/T397013#11009981 (10achou) 05Open→03Resolved Resolved the ticket. Changes have been deployed!
[23:13:07] <wikibugs>	 06Machine-Learning-Team, 10Wikilabels: Make a wikilabels view to see a single task. - https://phabricator.wikimedia.org/T208239#11011454 (10Izno)
[23:13:08] <wikibugs>	 06Machine-Learning-Team, 10Wikilabels: Improve Wikilabels UI - https://phabricator.wikimedia.org/T252280#11011453 (10Izno)
[23:13:10] <wikibugs>	 06Machine-Learning-Team, 10Wikilabels: Make Wiki Labels mobile compatible - https://phabricator.wikimedia.org/T105518#11011456 (10Izno)