[01:02:40] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10869925 (10Scardenasmolinar) I have been testing this locally, and I can see the highlighting,... [01:03:28] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10869927 (10Scardenasmolinar) Here is a screenshot of Vector 2010 {F60829742} [01:38:42] 07artificial-intelligence, 06cloud-services-team: Supporting AI, LLM, and data models on WMCS - https://phabricator.wikimedia.org/T336905#10869930 (10Huji) I keep occasionally getting pinged about this general topic on fawiki. Various users there are envisioning a lot of value from having LLMs helped with tran... [06:29:44] good morning! [06:54:55] elukey: I see that all staging deployments got updated yesterday, thank you so much for this ❤️ [06:55:19] bartosz: <3 [06:55:34] it was part of a big migration, sorry if they were messy before [06:56:40] I was updating the docker images in the charts on wednesday and was about to re-deploy them all today, but there’s no need now :D [07:01:40] I’ve noticed there’s now one more failing deployment for `readability-old` in the `readability` namespace, but in this case it was probably wrong for me to update docker image, created a patch bumping down to the old image: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1152189 [07:12:44] ^ Deployment is happy now after merging and syncing, thanks again Luca <3 [07:13:37] anytime! [07:14:36] Good morning folks [07:15:40] as FYI, ml-serve-eqiad is currently depooled, I am moving everything to PSS (safe thing, it was already done on the other two clusters, I depooled just to be extra sure) [07:19:11] good morning! [07:21:04] o/ bartosz you don't have to run load tests for all models in staging. You can just run em for 1-2 so that you can see how we run them [07:21:59] it would be nice if you could add a load test for a model server that is missing so that you could play around with locust [07:23:25] Morning! [07:23:28] isaranto: this sounds good, will do! and thanks a lot for fixing the storage_uri for articlequality model <3 [07:23:35] elukey: oh wow, you got busy yesterday! [07:25:15] klausman: yeah, I took the opportunity of the silent day :D [07:27:00] 06Machine-Learning-Team, 13Patch-For-Review: Deploy tone check model to production - https://phabricator.wikimedia.org/T394779#10870079 (10gkyziridis) >>! In T394779#10864504, @isarantopoulos wrote: >> [] Update the WIP patch in VisualEditor to adapt to the above 2 changes > Looking at https://gerrit.wikimedia... [07:29:43] elukey: a wise choice [07:29:52] elukey: so what's still left for PSS? [07:31:42] klausman: I am doing the final checks now, and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1152194 it is safe to be merged (basically it just disables PSP in the control plane) [07:32:18] I see some intermittent 503s from httpbb, so I am trying to recycle all the istio gateway pods just in case it is something related to that [07:32:28] traffic is depooled [07:35:01] Roger [07:39:50] ok so I still see a couple of 503s here and there, just as precaution I am going to recycle all the pods [07:40:07] I have the feeling that moving to PSS and everything else may cause some issues to envoy [07:40:17] but playing wack-a-mole doesn't make sense :D [07:41:59] (it will take a bit, I'll report when done) [07:59:58] I am also monitoring https://logstash.wikimedia.org/app/discover#/view/7f276c90-f8a0-11ee-be54-8fd74c07934f?_g=h@c823129&_a=h@b403c8c for PSS violations [08:00:03] and none registered so far [08:00:31] all the isvc namespaces are now running in the restricted namespace [08:00:41] restricted PSS sorry [08:08:52] That URL may have the "Logstash can't share" problem [08:09:30] (or it's just LS being silly because there are no results) [08:10:08] it is a generic dashboard, nothing pinned [08:10:13] ack [08:10:24] basically if any event is registered, there is an issue [08:10:35] https://logstash.wikimedia.org/app/discover#/view/7f276c90-f8a0-11ee-be54-8fd74c07934f is the link from wikitech [08:11:36] Yeah, it's the usual "unable to restore state from URL" thing, but the dashboard and filters look ok to my untrained eyes [08:13:20] Always a bit tricky when tyhere's no violations to verify the dash works :) [08:15:18] :) [08:15:34] all httpbb tests are passing now, eqiad and codfw [08:16:03] but there is something related to envoy calling mw-api that looks strange. Sometimes the tests fail, I saw this during the past days [08:17:50] and the error is a 503 got when trying to retrieve features from the MW API [08:17:53] example here https://phabricator.wikimedia.org/P76701 [08:20:06] Hmm. Is that 503 generated by envoy or by the model server? [08:20:15] that's envoy [08:20:21] trying to fetch from MW api [08:20:22] Ah, envoy, ecause upstream reset the conn [08:20:59] So MWAPI dropped the connection before answering the query (fully) [08:21:19] ah, even before the first response byte. [08:21:23] https://github.com/istio/istio/pull/52055 may help, but of course not in our current version [08:21:50] Though I wonder why MWAPI dropped the connection [08:24:35] if it really did, it may be something that envoy thinks happened [08:25:04] good point [08:26:03] we do se maxRequestsPerConnection: 1000, idleTimeout: 5s in all the destination rules [08:26:26] and the rationale was https://phabricator.wikimedia.org/T320374#8338627 [08:26:34] but that was a long time ago [08:27:22] You think something reaps a reused connection just as it's about to be used again? [08:27:52] that could explain, or at the time it was something that serviceops thought as well [08:27:58] not sure if they use the same setting [08:28:02] or if anything changed [08:28:24] but it would be worth to measure how many 50X we return in percentiles [08:28:41] it may resolve when we'll have slos [08:28:59] but I suspect that we are seeing this issue often at high percentiles [08:29:54] Yeah, agreed. [09:03:19] 06Machine-Learning-Team, 13Patch-For-Review: Deploy tone check model to production - https://phabricator.wikimedia.org/T394779#10870257 (10isarantopoulos) 05Open→03Resolved Ok, thanks for clearing that up! [09:27:52] all pods recycled for ml-serve-eqiad [09:28:09] no violation registered, I think we are good [09:28:32] httpbb seems now failing only for articledescription, but it was succeeding before [09:28:40] seems again feature fetching related [09:28:48] anyway, the PSS migration is complete! [09:29:40] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10870321 (10elukey) 05Open→03Resolved Recycled all the pods in ml-serve-eqiad to be sure, no PSS violation registered. Migration com... [09:39:49] hurray!! [09:56:10] * isaranto afk lunch [10:01:12] 06Machine-Learning-Team, 06Web-Team: Non-English articles show autogenerated English summaries - https://phabricator.wikimedia.org/T395596#10870480 (10ovasileva) p:05Triage→03High [10:01:36] (03PS1) 10Bartosz Wójtowicz: langid: Add locust load test for language identification model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1152251 (https://phabricator.wikimedia.org/T393865) [10:03:51] 06Machine-Learning-Team, 06Web-Team: Non-English articles show autogenerated English summaries - https://phabricator.wikimedia.org/T395596#10870489 (10ovasileva) @putnik, thank you for flagging this! Looks like a bug with the browser extension. [11:11:35] hey team, I've got a small load test question - where do we usually run them from? should I use stat/deployment machines for those? [11:30:22] hola! we usually run them on statboxes using the makefile that exists under test/locust [11:31:33] `MODEL_LOCUST_DIR="landid" make run-locust-test` [11:44:51] I see, thanks! [11:47:32] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 06Moderator-Tools-Team, 07Epic: Enable revertrisk filters in simplewiki & trwiki - https://phabricator.wikimedia.org/T395668 (10isarantopoulos) 03NEW [11:52:22] lemme know if you need any help [11:57:07] georgekyz: I reviewed the patch for the ores extension https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1151693 there are a couple of things we need to consider since the extension is already enabled for these wikis [11:58:54] (03PS2) 10Bartosz Wójtowicz: langid: Add locust load test for language identification model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1152251 (https://phabricator.wikimedia.org/T393865) [12:01:32] isaranto: I've ran the load tests for langid and pushed results to gerrit patch. It seems I got 36ms median response time with 0 failures [12:02:13] ack! nice! [12:04:59] One thing I'm wondering about is what are your favourite workflows when working on multiple machines without pushing code to origin? Do you create create new gerrit ssh key for each machine and later fetch specific patches from gerrit? [12:06:17] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Wikipedia-Android-App-Backlog, 05FY2024-25 WE4.2, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for wikis to add revert risk language agnostic filters to - https://phabricator.wikimedia.org/T395481#10870761 (10Nemoralis... [12:07:52] bartosz: we shouldnt use ssh keys on shared machines. For these types of use cases where you need to test a patch you can just pull the patch using anonymous http [12:08:17] for example `git fetch https://gerrit.wikimedia.org/r/machinelearning/liftwing/inference-services refs/changes/51/1152251/2 && git checkout -b change-1152251 FETCH_HEAD` [12:08:22] you can get that from the gerrit UI [12:09:09] Yess that's what I've done, but I also had to scp the results back to my local machine. I was wondering if I can update the patch directly from statbox [12:12:21] ok got it! I'm not aware if anyone does that. I just thought that it would be a really bad idea to add a private ssh key to a statbox :( [12:12:43] for this specific use case since we just want the xxx_stats.csv I think I just copy pasted from the terminal [12:13:00] thanks for letting me know about ssh keys, I was about to try this 😇 [12:13:06] isaranto: Thnx for reviewing it. I am on it [12:13:34] georgekyz: I'm available to chat about it if you want [12:13:56] I'm not sure what would be the best deployment strategy [12:16:47] is there ay chance that we can deploy without disabling the UI? Just by setting up the thresholds and then run the backfill script for the tables ? [12:18:34] (03PS3) 10Bartosz Wójtowicz: langid: Add locust load test for language identification model. [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1152251 (https://phabricator.wikimedia.org/T393865) [12:30:28] If we add revertisk filters they will just show up in the Ui. I don't see an issue with that though [12:35:02] ok so we can still deploy with UI enabled, goodfaith/damaging models still enabled, and we just add the revertrisklanguageagnostic enable and the corresponding thresholds and we good to go [12:35:05] I updated the patch [12:39:47] Exactly! [12:40:14] Unless there is a way to disable a specific filter from the Ui... [12:40:38] But if there is not well just go with what you suggested [12:49:56] I'm taking a look to see if this is possible [12:51:52] ty [13:02:33] (03PS3) 10Gkyziridis: ores-extension: Add extra logging [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1151706 (https://phabricator.wikimedia.org/T395253) [13:03:31] (03PS4) 10Gkyziridis: ores-extension: Add extra logging [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1151706 (https://phabricator.wikimedia.org/T395253) [13:06:56] (03CR) 10Gkyziridis: "Basically I am using the rule that 'RevisionNotFound' => `4xx error ++1` and '\RuntimeException' => `runtimeExceptionErrors ++1`." [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1151706 (https://phabricator.wikimedia.org/T395253) (owner: 10Gkyziridis) [13:08:54] (03PS5) 10Gkyziridis: ores-extension: Add extra logging [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1151706 (https://phabricator.wikimedia.org/T395253) [13:21:27] (03PS6) 10Gkyziridis: improve logging logic for PopulateDatabase backfill script [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1151706 (https://phabricator.wikimedia.org/T395253) [13:45:15] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10870955 (10Kgraessle) @Scardenasmolinar I did notice that greying out effect, but I saw it both... [13:47:49] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Wikipedia-Android-App-Backlog, 05FY2024-25 WE4.2, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for wikis to add revert risk language agnostic filters to - https://phabricator.wikimedia.org/T395481#10870959 (10isarantop... [13:48:02] (03PS1) 10Máté Szabó: LiftWingService: Add tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152267 (https://phabricator.wikimedia.org/T364705) [13:48:04] (03PS1) 10Máté Szabó: LiftWingService: Unify request creation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152268 (https://phabricator.wikimedia.org/T364705) [13:48:05] (03PS1) 10Máté Szabó: LiftWingService: Add method to evaluate pre-save revert risk [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152269 (https://phabricator.wikimedia.org/T364705) [13:48:07] (03PS1) 10Máté Szabó: Add revertrisk_score AbuseFilter variable [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152270 (https://phabricator.wikimedia.org/T364705) [14:02:14] (03CR) 10CI reject: [V:04-1] LiftWingService: Add tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152267 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [14:02:31] (03CR) 10CI reject: [V:04-1] Add revertrisk_score AbuseFilter variable [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152270 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [14:02:32] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Wikipedia-Android-App-Backlog, 05FY2024-25 WE4.2, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for wikis to add revert risk language agnostic filters to - https://phabricator.wikimedia.org/T395481#10871022 (10Nemoralis... [14:02:48] (03CR) 10CI reject: [V:04-1] LiftWingService: Add method to evaluate pre-save revert risk [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152269 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [14:03:12] (03CR) 10CI reject: [V:04-1] LiftWingService: Unify request creation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152268 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [14:04:05] (03PS2) 10Máté Szabó: LiftWingService: Add tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152267 (https://phabricator.wikimedia.org/T364705) [14:04:05] (03PS2) 10Máté Szabó: LiftWingService: Unify request creation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152268 (https://phabricator.wikimedia.org/T364705) [14:04:06] (03PS2) 10Máté Szabó: LiftWingService: Add method to evaluate pre-save revert risk [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152269 (https://phabricator.wikimedia.org/T364705) [14:04:12] (03PS2) 10Máté Szabó: Add revertrisk_score AbuseFilter variable [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152270 (https://phabricator.wikimedia.org/T364705) [14:17:31] (03CR) 10CI reject: [V:04-1] LiftWingService: Add method to evaluate pre-save revert risk [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152269 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [14:17:39] (03CR) 10CI reject: [V:04-1] LiftWingService: Add tests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152267 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [14:17:48] (03CR) 10CI reject: [V:04-1] Add revertrisk_score AbuseFilter variable [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152270 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [14:18:23] (03CR) 10CI reject: [V:04-1] LiftWingService: Unify request creation [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152268 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [14:28:14] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 06Wikipedia-Android-App-Backlog, 05FY2024-25 WE4.2, 10Moderator-Tools-Team (Kanban): Ensure all ORES i18n messages are available for wikis to add revert risk language agnostic filters to - https://phabricator.wikimedia.org/T395481#10871083 (10isarantop... [14:42:55] (03CR) 10Máté Szabó: "recheck" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1152267 (https://phabricator.wikimedia.org/T364705) (owner: 10Máté Szabó) [14:58:25] 06Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MediaWiki-Recent-changes, 07Epic, and 2 others: Enable revertrisk filters in simplewiki & trwiki - https://phabricator.wikimedia.org/T395668#10871262 (10Kgraessle) [15:04:05] (03CR) 10Kgraessle: [C:03+1] improve logging logic for PopulateDatabase backfill script [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1151706 (https://phabricator.wikimedia.org/T395253) (owner: 10Gkyziridis) [15:05:02] * isaranto afk bbl [15:50:26] 06Machine-Learning-Team: Article Summary Generation and Evaluation Pipeline using vLLM image - https://phabricator.wikimedia.org/T395246#10871442 (10kevinbazira) >>! In T395246#10866609, @isarantopoulos wrote: > The initial request can just be tackled within a notebook but we want to use the vllm image so that w... [16:55:13] awesome work Kevin --^ [17:25:59] 06Machine-Learning-Team: Article Summary Generation and Evaluation Pipeline using vLLM image - https://phabricator.wikimedia.org/T395246#10871778 (10isarantopoulos) Awesome work Kevin! @kevinbazira Could you rerun these next week using also prompt [[ https://gitlab.wikimedia.org/repos/research/simple-summaries/... [17:31:39] (03CR) 10Ilias Sarantopoulos: "Shall we just keep the stats.csv since we do that for other model servers as well?" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1152251 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [17:32:56] going afk now. have a nice weekend everyone! [18:53:55] 06Machine-Learning-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban), 13Patch-For-Review: [Spike] Investigate why filtering wasn't working on testwiki - https://phabricator.wikimedia.org/T395256#10872146 (10Scardenasmolinar) Sure, let's do that! [19:15:58] (03CR) 10Scardenasmolinar: [C:03+1] "This looks good! Thanks for all of the time you put into solving this bug! 🎉" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1151700 (https://phabricator.wikimedia.org/T395256) (owner: 10Kgraessle) [21:22:02] 06Machine-Learning-Team, 06Web-Team: Non-English articles show autogenerated English summaries - https://phabricator.wikimedia.org/T395596#10872612 (10Jdlrobson) a:03Jdrewniak I understand Jan is looking into this. [21:26:16] 06Machine-Learning-Team, 06Web-Team: Non-English articles show autogenerated English summaries - https://phabricator.wikimedia.org/T395596#10872647 (10Jdlrobson) 05Open→03In progress