[06:51:48] morning folks! [06:59:03] Good morning [07:09:06] good morning everyone! [07:23:10] Athens is melting today and tomorrow :( [07:35:15] the forecast really looks scary, please take care Ilias! [07:35:59] will try sending some berlin cool your way [07:38:32] thanks! I think it is the same heatwave that was @central europe last week [07:53:48] ouch [07:54:16] klausman: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1164235 should good to go now? [07:58:05] I think so, but nobody has added an LGTM :D [07:58:56] :) [08:40:53] klausman: o/ I think that we are good to go [08:41:12] maybe just give an heads up on the data-persistence channel before starting [08:41:40] I've also made a private repo patch to align the usernames: https://gerrit.wikimedia.org/r/c/labs/private/+/1166754 [08:55:43] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 4 others: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824#10977916 (10isarantopoulos) [08:55:45] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, and 4 others: [batch #3] Enable revertrisk filters in recent changes in multiple wikis - https://phabricator.wikimedia.org/T395824#10977917 (10isarantopoulos) 05Open→03Resolved [08:56:45] Nice [08:57:52] klausman: and then https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/1163291 can be re reviewed as well.. [08:58:17] I'll take some break and back. [09:00:23] elukey: the two patches are merged (and actually-private repo is also in sync), so we should be able to test from statboxes once Puppet has distributed everything (I suspect we still need to do the swift post -r bit once it is?) [09:05:34] klausman: I think the swift proxies need to be restarted for the change to be picked up [09:05:50] oh, right [09:05:55] then we can probably do the post -r [09:06:58] ack! I will wait a bit for Puppet to do it's thing (and find breakfast :)) and then do the proxy restart, will let you know once that's done [09:07:28] From https://docs.redhat.com/en/documentation/red_hat_openstack_platform/10/html/command-line_interface_reference_guide/swiftclient_subcommand_post it seems that the swift post command accepts a list of accounts separated by comma [09:08:44] It's a bit of a gotcha that using it will wipe out previous settings [09:27:02] yep definitely [09:41:28] elukey: klausman: o/ I am trying to build and test an openvino model-server on ml-lab1002 using `docker-pkg`. [09:41:28] previously we solved a webproxy issue by adding `http_proxy` to the config.yaml: https://phabricator.wikimedia.org/P76252#306630 [09:41:28] I've tried a similar fix but still running into the proxy issue as shown here: https://phabricator.wikimedia.org/P78766 [09:41:28] for more context, here is the openvino image source code as set up by the language team: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1153988 [09:50:46] kevinbazira: o/ I think that we should properly puppetize docker on ml-lab1002 before using it again [09:51:08] okok [09:59:31] is there anything I can do to help with properly puppetizing docker on ml-lab1002? [10:00:41] let's sync with klausman, not sure what plans he has for it [10:01:55] I think the right machine to puppetize docker on would be lab1001, since we kind have it earmarked for that use (LLM image building), but otherwise agreed with Luca [10:14:37] elukey: swifty-thanos proxies have been restarted [10:15:39] Read ACL: mlserve:ro,machinetranslation:ro [10:17:30] so you can test from stat1011 [10:18:35] SignatureDoesNotMatch error. I must have missed a username move somewhere [10:23:40] elukey: so in hieradata/common/profile/thanos/swift.yaml (private repo) we have `mlserve_ro: [password]`, but `machinetranslation: [other password]`. And in hieradata/role/common/deployment_server/kubernetes.yaml we have both with :ro (not _ro). This is very confusing [10:27:09] yes, there is an _ro missing (compare to hieradata/common/profile/thanos/swift.yaml in the normal puppet repo). Fixing. [10:53:37] * aiko have an appointment, back in 1.5h [11:11:17] elukey: ok, user's fixed, can you re-run the swift post bit? (or I can, but I dunno where it needs to run) [11:19:02] * klausman out for a bit [11:27:35] klausman: I've run it on thanos-fe1004, sudo -i + sourced the /etc/swift creds for the mlserve account [11:27:39] but Read ACL: mlserve:ro,machinetranslation:ro still holds [11:27:48] so if the user is good, I don't think we need more [12:06:57] $ s3cmd -c a.cfg ls s3://wmf-ml-models/ [12:06:59] DIR s3://wmf-ml-models/article-country/ [12:07:01] DIR s3://wmf-ml-models/article-descriptions/ [12:07:03] DIR s3://wmf-ml-models/articlequality/ [12:07:05] (rtc) [12:07:08] so yes, it works from statboxes \o/ [12:08:02] gooood [12:08:21] so now kart_ can proceed, using the new creds though [12:08:42] Yes, they should be all wired up correctly now [12:09:12] kart_: I'll do another quick pass on https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/1163291 but I doubt I'll find anything big [12:29:19] Cool [12:31:13] LGTM [12:37:11] * aiko back! [13:43:39] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 05Goal: FY2024-25 Q4 Goal: Investigate Add-a-link model training and deployment - https://phabricator.wikimedia.org/T393474#10979026 (10OKarakaya-WMF) I've converted hdfs tables into pkl and used the rest of the pipeline as it's. I've deployed one of the... [14:21:18] 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Use SHAP values to highlight peacock words - https://phabricator.wikimedia.org/T387984#10979218 (10achou) 05Open→03Resolved [14:21:59] 06Machine-Learning-Team, 10EditCheck: Evaluate the existing peacock detection model - https://phabricator.wikimedia.org/T386645#10979224 (10achou) 05Open→03Resolved [14:23:34] 06Machine-Learning-Team: Inputs for tone check model prediction - https://phabricator.wikimedia.org/T397013#10979244 (10achou) a:03achou [14:54:56] 07artificial-intelligence, 06Machine-Learning-Team, 10Edit-Review-Improvements-RC-Page, 10editquality-modeling, and 2 others: Add new recent changes filters to az.wiki - https://phabricator.wikimedia.org/T310691#10979517 (10Nemoralis) 05Open→03Resolved a:03Nemoralis {T395824} [14:55:02] 07artificial-intelligence, 06Machine-Learning-Team, 10editquality-modeling, 10Wikilabels: Complete azwiki edit quality campaign - https://phabricator.wikimedia.org/T129699#10979524 (10Nemoralis) 05Open→03Resolved a:03Nemoralis {T395824} [14:55:20] 07artificial-intelligence, 06Machine-Learning-Team, 10editquality-modeling: Deploy edit quality models for azwiki - https://phabricator.wikimedia.org/T130278#10979531 (10Nemoralis) 05Open→03Resolved a:03Nemoralis {T395824} [15:05:44] FIRING: LiftWingServiceErrorRate: ... [15:05:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=hewiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:11:38] ----^ checking [15:30:44] RESOLVED: LiftWingServiceErrorRate: ... [15:30:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=hewiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [15:35:28] when the incident started at 14:33, the hewiki-damaging-predictor was getting a tons of requests with the same rev_id, which caused timeouts fetching from MW APIs and increased the preprocesing time.. I'm not sure if these requests are from retries or something else.. need to investigate [15:35:54] part of logs: https://phabricator.wikimedia.org/P78778 [15:43:40] Thank Aiko! [16:08:30] I think these are not retries, since they all have different request IDs [16:08:57] and since we don't have caching, each request would do the same preprocessing step [16:09:38] 2025-07-07 14:33:13.568 kserve.trace requestId: 3e8dc46b-74f7-4595-8de7-9158b50377e0, preprocess_ms: 10928.990602493 [16:09:43] 2025-07-07 14:33:18.950 kserve.trace requestId: 7296c72b-40f7-4b0b-9fe8-afed61e47b6d, preprocess_ms: 16312.951087952 [16:09:54] 2025-07-07 14:33:24.139 kserve.trace requestId: 57975a9e-f884-465b-a6e2-6bb8fec640c7, preprocess_ms: 21495.087862015 [16:10:05] like these three requests all processed the same rev_id [16:12:25] I'll add more information to MLOps week log