[09:23:19] hello folks [09:28:46] isaranto: whenever you want we can sync about deployment-charts! [09:31:40] hey! I am free. Google meet in 5’? [09:34:51] morning! [09:35:24] isaranto: sure! [09:35:38] isaranto: o/ I wanna join :D [09:36:24] aiko: sureee! :D [09:36:45] meet.google.com/umy-zwne-wuk [09:37:38] aiko: did you see https://phabricator.wikimedia.org/T320374#8409739 ? [09:37:43] very interesting results with MP [09:47:07] That's decent scaling indeed. [09:47:11] also: morning :) [10:56:18] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira) [11:02:59] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira) 14/15 models were trained successfully in the 7th round of wikis. The Dzongkha Wikipedia (dzwiki) returned the error in the screenshot be... [11:18:43] klausman: o/ for the other model servers it seems to work the same, the MP with some extra processes halves the latency timings [11:19:14] we just wondered/brainstormed with Aiko and Ilias how/where to apply MP, since we have limited cpu/memory resources [11:19:35] one idea was to apply it selectively to the busiest (i.e. in traffic) model servers only, leaving the rest with standard configs [11:22:01] but Ilias is going to deploy the new images first, then do more tests on staging to see the perfs [11:22:11] and after that we can decide all together a plan [11:26:49] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira) I contacted @MGerlach on whether this error means that there is not enough data to train the model and he said: > Interesting. indeed, it... [11:29:17] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 7th round of wikis - https://phabricator.wikimedia.org/T304551 (10kevinbazira) [11:33:44] elukey: yeah, I think using it only for models where it actually helps is a good idea. It's a bit of work when we get new models/code, but I think it's worth it [11:36:16] klausman: the only thing that I don't like in this approach is that it kinda goes against https://wikimediafoundation.org/our-work/education/promoting-knowledge-equity/ in my mind, but it is also true that we have to strive for the best compromise [11:37:51] Is this only affecting ORES models or new stuff as well? [11:39:38] in theory only ORES models for the moment, since they are heavy on cpu bound sometimes [11:42:45] In the mid-long term, I feel this aspect might be better addressed by changing the systematic design of how these services work (modularization, simplification) rather than making bespoke configs. [11:43:32] But for now, I think the MP approach may be better in a "working now is better than perfect in the far future" sense. [11:44:13] Us being aware of the issue you mentioned is already a good first step [11:44:42] definitely, and also we cannot blame revscoring too much for not being up to speed with asyncio, that is a relatively "modern" approach in python [11:45:09] so new model servers will be created with asyncio guidelines from the start, now that we more-or-less know the basics [11:45:35] but heavy cpu bound models may come in the future, even with best intentions [11:45:45] at that point we could also explore ray workers [11:46:03] Ack. [11:46:33] Plus future "side services" like Feature Stores and caches might alleviate some of the issues as well. [11:47:12] or worsen our working life for good :D [11:47:30] Always the optimist :) [11:49:29] :) [11:49:30] * elukey lunch [12:30:17] Same here [12:35:04] Me three! [14:26:26] 10Machine-Learning-Team, 10ORES, 10Advanced Mobile Contributions, 10Growth-Team, and 3 others: 'Highlight likely problem edits' preference doesn't select any filters in mobile web - https://phabricator.wikimedia.org/T318683 (10Samwalton9) [14:26:52] 10Machine-Learning-Team: Reduce number of published docker images for revscoring models - https://phabricator.wikimedia.org/T323586 (10isarantopoulos) [14:33:24] 10Machine-Learning-Team, 10ORES, 10MediaWiki-Core-Preferences, 10Moderator-Tools-Team (Kanban): When ORES quality filters are selected in mobile web, entries should be highlighted - https://phabricator.wikimedia.org/T314026 (10Samwalton9) I'm realising that I misunderstood this feature when writing the tic... [14:53:36] 10Lift-Wing, 10Machine-Learning-Team: Test MultilingualRevertRiskModel inference service locally with docker - https://phabricator.wikimedia.org/T323613 (10achou) [15:03:07] aiko: o/ team meeting? [15:13:35] 10Machine-Learning-Team: Reduce number of published docker images for revscoring models - https://phabricator.wikimedia.org/T323586 (10calbon) a:03isarantopoulos [15:14:07] 10Machine-Learning-Team: Upgrade the link recommendation algorithm from Spark 2 to Spark 3. - https://phabricator.wikimedia.org/T323493 (10calbon) a:03kevinbazira [15:18:08] 10Machine-Learning-Team, 10artificial-intelligence, 10SRE, 10Service-deployment-requests: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10calbon) a:03calbon [15:19:37] 10Lift-Wing, 10Machine-Learning-Team, 10ML-Governance: Outlinks model card - https://phabricator.wikimedia.org/T287527 (10calbon) 05Open→03Resolved [15:20:05] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10CommRel-Specialists-Support (Oct-Dec-2022): Inspect "add a link" models to improve their performance - https://phabricator.wikimedia.org/T309263 (10calbon) a:03kevinbazira [15:23:03] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10CommRel-Specialists-Support (Oct-Dec-2022): Inspect "add a link" models to improve their performance - https://phabricator.wikimedia.org/T309263 (10calbon) @Miriam We don't think we have the resources to do this. Lets chat [15:23:23] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10CommRel-Specialists-Support (Oct-Dec-2022): Inspect "add a link" models to improve their performance - https://phabricator.wikimedia.org/T309263 (10calbon) a:05kevinbazira→03calbon [15:28:53] 10Machine-Learning-Team: Help Language team to make progress on open MT models to be used by Content Translation tool - https://phabricator.wikimedia.org/T302516 (10klausman) 05Open→03Resolved I'll close this ticket for now, since the main effort is focused on NLLB200 on AWS (https://phabricator.wikimedia.or... [15:28:56] 10Lift-Wing, 10Machine-Learning-Team: No healthy upstream and upstream connect error in Lift Wing - https://phabricator.wikimedia.org/T322196 (10calbon) a:03elukey [15:45:46] 10Lift-Wing, 10Machine-Learning-Team: Decide external URL scheme (on API GW) for models on Lift Wing - https://phabricator.wikimedia.org/T319178 (10klausman) [15:45:48] 10Lift-Wing, 10Machine-Learning-Team: API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10klausman) [15:47:11] 10Lift-Wing, 10Machine-Learning-Team, 10Epic: API Gateway Integration - https://phabricator.wikimedia.org/T288789 (10calbon) [15:59:40] deploy to staging: which one is the staging cluster named as ml-staging-codfw? I don’t have it in my ssh config. I found this host `ml-staging2002.codfw.wmnet` with a random search on wikitech. is that it? [16:00:25] yes exactly! [16:00:39] ml-staging200[12] are the worker nodes [16:03:43] gotta reboot this host, brb [16:07:16] and I'm back [16:09:09] elukey: how can I find out the available kubeconfigs to use with kube_env? [16:11:12] isaranto: in theory tab should provide hints about those [16:15:25] I get a `-bash: kube_env: command not found` . Am I suppose to ssh into somewhere else ?(I mean no the worker nodes) [16:15:56] ah no no only on the deploy1002 node [16:16:09] it is the main point of entry [16:16:29] you have access to the worker nodes but in theory (atm) it is not really needed [16:19:51] 10Machine-Learning-Team: Test revscoring model servers on Lift Wing - https://phabricator.wikimedia.org/T323624 (10elukey) [16:20:44] isaranto: created --^ [16:20:48] and assigned to you : [16:20:49] :) [16:25:37] elukey: Cool! just to verify: so deploy1002.eqiad.wmnet is the staging cluster we use and deployment.eqiad.wmnet is the prod one? [16:26:19] btw: is anybody else using the IRCCloud app on mac? it seems that it is quite difficult to use compared to the web version [16:27:11] 10Machine-Learning-Team, 10Patch-For-Review: Test ML model-servers with Benthos - https://phabricator.wikimedia.org/T320374 (10elukey) 05Open→03Resolved In this task we worked on several things, that now have separate subtasks: * Refactored all the revscoring model servers to reuse code as much as possibl... [16:27:43] 10Lift-Wing, 10Machine-Learning-Team: Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey) 05Open→03Resolved We paused this task for a long time due to T320374, and we have opened new subtasks to track the work. Closing this one in favor of more specific ones (like T323624) [16:27:45] 10Lift-Wing, 10Machine-Learning-Team, 10Epic: Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [16:28:51] 10Machine-Learning-Team, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q2): Logging spam from revscoring deploys - https://phabricator.wikimedia.org/T320468 (10elukey) @colewhite after a lot of research I think that this will go away with Istio 1.15.3, and we'll upgrade to i... [16:29:15] isaranto: nono deployment.eqiad.wmnet is a CNAME to deploy1002, so they are the same [16:29:34] we also have deploy200x that we use in emergency situations, this is why we have the CNAME [16:29:46] so you can always use deployment.eqiad.wmnet (for both staging and prod) [16:33:31] ok, that solved my confusion.thanks [17:16:28] 10Machine-Learning-Team, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q2): Logging spam from revscoring deploys - https://phabricator.wikimedia.org/T320468 (10colewhite) >>! In T320468#8413816, @elukey wrote: > @colewhite after a lot of research I think that this will go aw... [17:26:14] going afk folks, have a nice rest of the day! [17:30:34] \o [18:01:22] night all [18:33:43] o/