[00:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [00:44:42] RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [01:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [01:44:42] RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [02:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [02:44:42] RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [03:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [03:44:42] RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [04:00:22] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Build Tone Check Model feedback-based retraining pipeline - https://phabricator.wikimedia.org/T393103#11007512 (10ppelberg) [04:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [04:44:42] RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [04:55:07] ^--- looking [05:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [05:17:21] problem reported in alert: [05:17:21] ``` [05:17:21] prometheus "ops" at http://127.0.0.1:9900/ops has "kafka_burrow_partition_lag" metric but doesn't currently have series matching {group="cpjobqueue-ORESFetchScoreJob"}, such series was last present 1d1h ago [05:17:22] ``` [05:19:43] the summary shows: `Linting problems found for ORESFetchScoreJobKafkaLag` [05:44:42] RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [06:13:49] managed to dig up the alert that's firing: https://github.com/wikimedia/operations-alerts/blob/1304562d50c3b85c8babf92233dbb497294c4ce7/team-ml/ores_extension_test.yaml#L12 [06:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [06:16:02] (03CR) 10AikoChou: "Kevin, thank you for working on this! Great job on separating file loading and parameter checking. That's definitely more efficient." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1169195 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [06:28:26] 06Machine-Learning-Team: Fix AlertLintProblem for ORESFetchScoreJobKafkaLag - https://phabricator.wikimedia.org/T399683 (10kevinbazira) 03NEW [06:35:05] I have created a phab task to fix this `AlertLintProblem` for `ORESFetchScoreJobKafkaLag`: [06:35:05] https://phabricator.wikimedia.org/T399683 [06:35:05] elukey: isaranto: o/ I have added you to this --^ task since you edited this alert about 2 years ago and might know a fix. [06:35:38] source: https://github.com/wikimedia/operations-alerts/blame/1304562d50c3b85c8babf92233dbb497294c4ce7/team-ml/ores_extension_test.yaml#L12 [06:44:42] RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [06:46:38] hello! I'm here now [06:47:40] good morning [06:48:28] what a great night of alerts :D [06:50:10] I'm taking a look and will ping you in the next 30' [07:04:25] okok [07:05:02] in case you missed it in the noise, here is the task: https://phabricator.wikimedia.org/T399683 [07:08:14] good morning folks [07:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [07:19:41] 06Machine-Learning-Team: Fix AlertLintProblem for ORESFetchScoreJobKafkaLag - https://phabricator.wikimedia.org/T399683#11007922 (10kevinbazira) [07:32:14] the expression causing the ores extension alert: https://github.com/wikimedia/operations-alerts/blob/1304562d50c3b85c8babf92233dbb497294c4ce7/team-ml/ores_extension.yaml#L8 [07:32:47] 06Machine-Learning-Team: Fix AlertLintProblem for ORESFetchScoreJobKafkaLag - https://phabricator.wikimedia.org/T399683#11007989 (10elukey) If you check [[ https://prometheus-codfw.wikimedia.org/ops/graph?g0.expr=kafka_burrow_partition_lag%7Bgroup%3D%22cpjobqueue-ORESFetchScoreJob%22%7D&g0.tab=0&g0.stacked=0&g0.... [07:32:54] hey folks, I left a comment in the task [07:36:01] thank you for the context elukey. would the trick you mentioned `# deploy-tag: global`: [07:36:01] https://github.com/wikimedia/operations-alerts/blob/1304562d50c3b85c8babf92233dbb497294c4ce7/team-sre/nel.yaml#L1 [07:36:01] replace both `# deploy-tag: ops` and `# deploy-site: eqiad, codfw`: [07:36:01] https://github.com/wikimedia/operations-alerts/blob/1304562d50c3b85c8babf92233dbb497294c4ce7/team-ml/ores_extension.yaml#L1C1-L2C28 [07:37:59] it would replace deploy-site, the deploy-tag may not be correct (do we need ml instead of ops?) [07:38:21] basically the idea is to use thanos for the metric, that has a global view, not eqiad/codfw [07:38:31] but we should verify what happened to changeprop at around that time [07:42:19] I don't find anything relevant in the SAL or wikimedia-operations, weird [07:44:42] RESOLVED: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [07:44:48] I've been checking for changeprop differences as well [07:47:20] 06Machine-Learning-Team: Fix AlertLintProblem for ORESFetchScoreJobKafkaLag - https://phabricator.wikimedia.org/T399683#11008002 (10isarantopoulos) Just adding the info here from the alert for reference since the alert comes and goes : ` prometheus "ops" at http://127.0.0.1:9900/ops has "kafka_burrow_partition_... [07:52:24] so the issue is with codfw as Luca raised and there is no metric for the `kafka_burrow_partition_lag` metric (not just for the oresJob label but for all) [08:05:29] kevinbazira: we could ask in the wikimedia-operations channel. we can also put a silence in the alert when it occurs again with a comment that we are looking into it + phab task mention to avoid all these recurring alerts [08:06:15] ack! where do we silence these alerts from? [08:07:46] from https://alerts.wikimedia.org/?q= [08:08:07] if you are available we can jump on a gmeet and I'll show you [08:14:42] FIRING: AlertLintProblem: Linting problems found for ORESFetchScoreJobKafkaLag - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [08:21:21] we have silenced this alert for 1 day as we investigate the issue [08:33:27] I have also followed up in the wikimedia-operations channel in case they have any pointers. [08:35:51] thanks Kevin! [08:37:03] np! [09:01:53] isaranto: patch for revertrisk simplewiki/trwiki ready -> https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1170092 [09:27:24] kevinbazira: one advice - when you want some info from SRE, it is better to use #wikimedia-sre, less crowded from alarms etc.. [09:27:51] for this case I think it is fine to change the deploy tag [09:27:53] and that's it [09:28:18] but a ping to SRE about the missing metrics may be good if anything is ongoing [10:02:39] elukey: thanks for the pointers. I have also followed up in #wikimedia-sre. [10:02:40] I have pushed a patch for this issue here: https://gerrit.wikimedia.org/r/1170107 [10:02:40] please review whenever you get a minute. [10:43:44] FIRING: LiftWingServiceErrorRate: ... [10:43:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-need-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [11:07:40] (03PS2) 10Kevin Bazira: RRLA: Validate lang parameter against canonical wikis [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1169195 (https://phabricator.wikimedia.org/T399437) [11:08:50] looking at the reference-need-predictor alert [11:13:54] (03CR) 10Kevin Bazira: "Thank you Aiko. I have fixed the 3 issues you pointed out." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1169195 (https://phabricator.wikimedia.org/T399437) (owner: 10Kevin Bazira) [11:50:25] the reference-need-predictor alert was caused by a spike shown in logstash: https://logstash.wikimedia.org/goto/c558820a21decb143949f8319d3e6fbb in the minute of 10:34 [12:08:08] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11008945 (10OKarakaya-WMF) [12:10:52] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11008962 (10OKarakaya-WMF) [12:26:54] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update kserve to 0.15.2 - https://phabricator.wikimedia.org/T367048#11009014 (10isarantopoulos) [12:27:08] 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update kserve to 0.15.2 - https://phabricator.wikimedia.org/T367048#11009021 (10isarantopoulos) [12:27:11] 10Lift-Wing, 06Machine-Learning-Team, 07Essential-Work, 13Patch-For-Review: Update revertrisk to kserve 0.15.2 - https://phabricator.wikimedia.org/T383119#11009022 (10isarantopoulos) [12:33:53] kevinbazira: I see two interesting things: 1) in the logs that you pointed out there are reports like predict_ms: 484540.20357132, that seems a lot 2) From the grafana link in the alert it seems that a lot of connections are ending up in DC state, that should be Downstream Closes (IIRC the client giving up at some point) [12:34:34] is it a new model? Maybe under testing? [12:34:59] otherwise a client may be calling it with some features that cause the model to take ages to get inference done [12:36:23] I don't see throttling in https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&from=now-3h&to=now&timezone=utc&var-datasource=000000026&var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=revision-models&var-pod=reference-need-predictor-00012-deployment-7d54649454-2kdnd&var-container=$__all though, that is a good sign [12:38:22] elukey: the reference-need model was added in Oct 2024: https://phabricator.wikimedia.org/T371902 [12:41:49] kevinbazira: https://logstash.wikimedia.org/goto/7eea6b1f2a67a842377d235383eaae7f [12:42:35] so I went in the istio dashboard, filtered for ml-serve-eqiad and also for return code "0", that is the DC/client-giving-up code for istio [12:43:21] it seems matching with what we are seeing, and it is WME related [12:43:39] (I went in the Istio dashboard since all the traffic goes through Istio before reaching the isvcs) [12:43:46] (and we have access logs) [12:44:11] so I'd reach out to them and ask what they are doing [12:45:25] thank you for looking elukey [12:45:26] I am not sure the logstash link you shared is showing what you expected [12:46:40] I tried it and it does, what do you see? [12:50:10] ok got it. the user agent is: `WME/2.0 (https://enterprise.wikimedia.com/; wme_mgmt@wikimedia.org)` [12:51:44] drilled down to the hour of 10: https://logstash.wikimedia.org/goto/efdaccbb1f9112eeb474fdb85ef502f8 [12:57:14] 06Machine-Learning-Team, 05Goal: Q1 FY2025-26 Goal: Scaling Add-a-link to more wikis via production (airflow) pipelines - https://phabricator.wikimedia.org/T398950#11009126 (10OKarakaya-WMF) [12:59:56] kevinbazira: you also have the pie chart with the user agents, showing which ones are affected etc.. [13:00:01] WME seems to be the 99% [13:00:48] yep it's super high 99.93% [13:02:35] * kevinbazira joining ml tech meeting brb [13:02:48] kevinbazira: one thing to check is if the rev-ids mentioned in the logs may lead to very long inference time [13:03:07] just to see if something is really heavy for $reason and we have to improve the model [13:13:44] RESOLVED: LiftWingServiceErrorRate: ... [13:13:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revision-models&var-backend=reference-need-predictor.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [14:01:50] * kevinbazira back [14:03:28] the reference-need-predictor issue seems to have resolved itself [14:03:28] I will continue investigating the root cause [15:39:36] 06Machine-Learning-Team: Investigate reference-need-predictor alert triggered by BrokenProcessPool error - https://phabricator.wikimedia.org/T399733 (10kevinbazira) 03NEW [15:42:37] ^--- I have created a phab task that details the investigation of the reference-need-predictor alert: https://phabricator.wikimedia.org/T399733 [15:47:22] * kevinbazira afk [15:51:08] 06Machine-Learning-Team: Investigate reference-need-predictor alert triggered by BrokenProcessPool error - https://phabricator.wikimedia.org/T399733#11009928 (10isarantopoulos) This seems to be the same issue described in {T387019}. In the last comment there is a mention about the broken process pool. I wonder... [16:09:32] 06Machine-Learning-Team: Inputs for tone check model prediction - https://phabricator.wikimedia.org/T397013#11009981 (10achou) 05Open→03Resolved Resolved the ticket. Changes have been deployed! [23:13:07] 06Machine-Learning-Team, 10Wikilabels: Make a wikilabels view to see a single task. - https://phabricator.wikimedia.org/T208239#11011454 (10Izno) [23:13:08] 06Machine-Learning-Team, 10Wikilabels: Improve Wikilabels UI - https://phabricator.wikimedia.org/T252280#11011453 (10Izno) [23:13:10] 06Machine-Learning-Team, 10Wikilabels: Make Wiki Labels mobile compatible - https://phabricator.wikimedia.org/T105518#11011456 (10Izno)