[05:30:39] good morning o/ [06:09:40] good morning ☀️ [06:53:07] Good morning [08:37:10] I think I need to create a new Wikipedia page named "Browser Tab Clutter". I see it is missing :P [09:46:06] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 10Editing-team (Tracking): Create SLO dashboard for tone (peacock) check model - https://phabricator.wikimedia.org/T390706#10949656 (10isarantopoulos) Thanks a lot for the helpful comments Luca and sorry for the delayed response here. regarding operations... [09:47:16] klausman: (and rest of team). what do you think about the above --^? Shall we just use 20x response codes in our latency SLO calculation? I think it does make more sense, just want to hear if anyone has any different thoughts on this [09:56:20] I agree. 500s are obviously out, and neither 400 nor 300 is "productive traffic" in the strictest sense, so they would water down the SLO [10:19:46] sgtm, thank you! [11:33:54] To me it also sounds more informative to calculate latency only for 2xx responses. If we'd include 3xx and 4xx, throwing a lot of them could greatly improve our latency numbers :D [12:15:18] agree! [13:21:16] elukey, klausman What's next step we should do for issue with s3 + machinetranslation? [13:21:53] I think that Tobias was in the process of fixing the swift account [13:21:54] I am waiting for the ok from DP to restart the FEs so the username change is active, and then we can do a quick test from a statbox [13:22:15] Noted. Thanks! [13:37:46] Hey hey! What is the liftwing equivalent of https://ores.wikimedia.org/v3/scores? i noticed Growth has code like https://github.com/wikimedia/mediawiki-extensions-GrowthExperiments/blob/master/maintenance/importOresTopics.php#L283, which tries to determine whether articletopic can be generated. [13:38:02] or does the new model work on _any_ wiki? [13:40:25] Probably stupid question: I see that in inference-services we are using `use-system-site-packages: false` in blubber. I am trying to build an image similar to that one from inference-services (but without kserve of course). So I am using the same base image which has torch in it. My question is: Do I need to set `use-system-site-packages: true` in order to use the pre-installed libraries from the base image? I am using almost the same [13:40:25] blubber from inference-services but when the kokkuri pipeline runs I see that the torch is not being used. [13:52:13] georgekyz: Are you creating a new Python venv in the new image? AFAIK the `use-system-site-packages` translates to adding the `--system-site-packages` flag to the venv creation, which allows it to use packages from the global Python installation. However, I'd be wondering if `torch` is actually inside the global Python installation of the base image, or if it's in another venv in your base image 🤔 [13:53:14] If torch is installed in a different venv, setting this flag to true probably won't solve it :( [13:54:45] bartosz: thnx for the quick response mate! I got confused a little bit with the paths. I am using `docker-registry.wikimedia.org/amd-pytorch23:2.3.0rocm6.0-3-20250511` as a base image and I was recieving an error that module torch not found. [13:55:01] But I think that I had messed up the paths [13:56:44] More preciesely I was getting: [13:56:47] https://www.irccloud.com/pastebin/GLB5duxB/ [13:57:34] georgekyz: Ah I also got a little confused and thought that you might be using some specific image built in inference-services as your base [13:58:09] probably because I was using the `Trainer` object from transformers that was needed the 'accelerate>=0.26.0'` but it was huge and I couldn't push that image to registry. So I am trying tomake a very lightweight image and finally achieve to push it :P [13:59:05] Now I built a new one without the `Trainer` object from transformers and build my custom training loop so we go oldschool :P [14:15:19] thanos-fe's have been restarted. No change yet on statbox s3cmds. I think the extra swift command needs doing, but I don't know where it needs to be run [14:25:45] done it, does it work now? [14:30:16] because I see ERROR: Bucket 'wmf-ml-models' does not exist [14:30:17] mmm [14:32:08] Same here, [14:32:24] s3:/// is just empty [14:37:56] now it lists an empty bucket, I think it depends from what account the command to allow the bucket is run from [14:38:07] on thanos-fe1004.eqiad.wmnet [14:38:20] I'm running on stat1011. [14:38:31] root@thanos-fe1004:/etc/swift# source /etc/swift/account_AUTH_machinetranslation.env [14:38:34] root@thanos-fe1004:/etc/swift# swift list [14:38:36] wmf-ml-models [14:38:39] wmf-ml-models+segments [14:39:19] if I use account_AUTH_mlserve.env the output is the same [14:40:20] if I set the same env vars as that env file on stat1011 and then run swift list, I get: [14:40:31] Auth GET failed: https://thanos-swift.discovery.wmnet/auth/v1.0 401 Unauthorized [first 60 chars of response] b'

Unauthorized

This server could not verify t' [14:45:59] yeah but that env file is for swift, how are you using it on stat1011? [14:46:19] There is a swift cli util there as well [14:47:00] though I dunno if that is supposed tow ork or it being broken is "as desired" [14:51:14] 06Machine-Learning-Team, 06Moderator-Tools-Team: AI/ML Infrastructure Request: Persist historical revert risk multilingual model scores for threshold analysis - https://phabricator.wikimedia.org/T397187#10950918 (10SSalgaonkar-WMF) @DMburugu thank you so much for this helpful response!! It totally makes sense... [14:51:34] I think that the account may be configured in the wrong way [14:51:37] see in swift.yaml [14:51:38] machinetranslation: [14:51:38] access: '.admin' [14:51:38] account_name: 'AUTH_machinetranslation' [14:51:38] auth: 'https://thanos-swift.discovery.wmnet' [14:51:40] user: 'machinetranslation:prod' [14:51:43] vs [14:51:51] mlserve_ro: [14:51:51] access: '' [14:51:51] account_name: 'AUTH_mlserve' [14:51:51] auth: 'https://thanos-swift.discovery.wmnet' [14:51:51] user: 'mlserve:ro' [14:51:54] stats_enabled: 'no' [14:52:02] I think that machinetranslation should look like --^ [14:52:11] and then get the +r permissions [14:53:22] you have a point. I have a meeting until 17:00, will make a patch afterwards [15:03:12] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1164235 [15:07:38] klausman: let's wait for Matthew, because in this case I am not sure how it is best to proceed [15:07:40] urbanecm: o/ the Liftwing equivalent for the old articletopic model can be accessed by using either an internal endpoint (suitable for Mediawiki) https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Usage#Example_usage_of_internal_endpoint or via the api gateway for any other service https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_revscoring_articletopic_prediction [15:08:02] because I suspect that every account has its own view of the buckets [15:08:05] elukey: ack. will add/edit the stuff you commented on [15:08:21] I mean at this point we could use directly mlserve:ro [15:08:23] if it is the case [15:09:11] urbanecm: there is also a newer model that can be used named articletopic outlink https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_articletopic_outlink_prediction https://meta.wikimedia.org/wiki/Machine_learning_models/Production/Language_agnostic_link-based_article_topic [15:10:29] elukey: a separate account is nice in theory, but the distinct is getting pretty thin now [15:10:38] distinction* [15:47:52] isaranto: thanks for the clarification! that makes sense. [16:05:33] * isaranto afk! [21:31:53] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10952217 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [21:32:31] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10952221 (10Jclark-ctr) 05Resolved→03Open accidentally resolved ticket instead of assigning to my self [21:33:09] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10952225 (10Jclark-ctr)