[03:24:37] 10artificial-intelligence, 10Wikilabels, 10editquality-modeling, 10Machine-Learning-Team (Active Tasks): Edit quality campaign for Vietnamese Wikipedia - https://phabricator.wikimedia.org/T114509 (10Phjtieudoc) Is this campaign still available? [08:07:27] Hi folks! [08:18:49] hello! How are you feeling? [08:40:22] kevinbazira: o/ [08:40:36] elukey: o/ [08:40:47] if you want to run deployment-chart's CI locally (to spot earlier on errors) you can run `rake run_locally['default']` [08:41:00] it will spin up a container and run the same logic [08:41:07] (so you don't have to send a new patch etc..) [08:42:18] great, thanks! [08:50:29] having some weird disconnections with irccloud [08:50:43] other than that much better!! [08:57:15] * elukey bbiab! [09:17:40] hello! [09:24:19] o/ [09:33:41] folks I posted the new SLO Dashboards on slack [09:33:46] lemme know if you like them [09:35:12] taking a look now 👀 [09:35:22] I also added outlink [09:35:25] I created this https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/954886 [09:35:47] all in? :D [09:35:53] I think we can deploy all wikis except enwiki and wikidata - the same thing we did last time. lemme know what u think. I [09:35:59] +1 [09:36:45] Now that the thing with the thresholds is out of the way (we deployed the numeric values for all wikis last week) only the liftwing requests change [09:37:18] isaranto: the service ops team is rebooting mw nodes so it may not be a great time to deploy (if you were planning to do it) [09:42:13] ack. I will wait for Amir's feedback, It doesn't really matter if it happens today or tomorrow [09:59:05] 10Machine-Learning-Team, 10Research, 10Section-Level-Image-Suggestions, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10mfossati) 05Open→03In progress a:03mfossati [10:06:02] elukey: o/ about Outlink dashboard, maybe we need the predictor's metrics. I'm thinking of the event stream use case. For the event stream, events are sent directly from the predictor, not going through the transformer. What do you think? [10:12:08] aiko: o/ yeah but IIUC the post-processing is handled by the transformer, so the "final" http response should be returned by it [10:12:45] client --> transformer (preprocess) -- calls --> predictor (process) -- returns to --> transformer (postprocess) --> client [10:13:03] so the client (in this case the istio gateway) never sees the predictor [10:13:19] if the predictor returns a 500 it will be the same for the transformer [10:13:27] I am almost sure but I have to verify [10:13:30] does it make sense? [10:17:59] but we have different responses for event stream and normal requests. For normal requests, it does go to post-processing in transformer. But for event stream, iiuc it doesn't go to post-processing [10:20:17] I'm not sure if the transformer will get the same errors when we have errors in sending events? [10:21:02] I think so, gimme a sec to find a repro [10:21:17] postprocess should always be called [10:23:26] aiko: so I checked for HTTP 500s in the ml-serve-eqiad's outlink transformer [10:23:45] I found one from today [10:23:46] File "/opt/lib/python/site-packages/kserve/model.py", line 286, in _http_predict [10:23:49] raise HTTPStatusError(message, request=response.request, response=response) [10:23:52] httpx.HTTPStatusError: RuntimeError : The event posted to EventGate has been rejected, please contact the ML team if the issue persists., '500 Internal Server Error' for url 'http://outlink-topic-model-predictor-default.articletopic-outlink/v1/models/outlink-topic-model:predict' [10:25:19] oh nice! so transformer will get the error [10:25:39] and to double check, at the same time in the predictor I see [10:25:40] aiohttp.client_exceptions.ClientResponseError: 503, message='Service Unavailable', url=URL('http://eventgate-main.discovery.wmnet:4480/v1/events') [10:26:05] RuntimeError: The event posted to EventGate has been rejected, please contact the ML team if the issue persists. [10:26:43] postprocess will always be called by the kserve's internals (the code that we extend) [10:26:58] if we don't define it then it just passes the predictor's response IIUC [10:27:03] the transformer is basically a proxy [10:27:12] does it make sense? [10:27:17] this is my understanding [10:27:30] (hence the need to just check the transformer's metrics) [10:27:32] (for the SLO) [10:27:36] makes sense! thanks for the confirmation :) [10:29:39] now I think that the only SLO-less service that we run is ORES Legacy [10:29:51] but do we need to do anything for that error? Was it on the eventgate's side? [10:30:31] aiko: I was thinking the same, in theory ChangeProp does retry only for 502 and 503s [10:30:45] so in this case, we returned 500, and it didn't retry (almost sure about it) [10:30:49] so the score got lost [10:31:41] we should probably add the 500s among the changeprop's retry scenarios [10:31:52] IIRC it will try few times before giving up [10:32:19] anyway, going afk for lunch, ttl! [10:34:27] I see [10:34:44] +1 add 500s to changeprop's retry scenarios [10:46:04] * aiko lunch [11:37:14] 10Machine-Learning-Team, 10serviceops: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10akosiaris) Adding a data point that just crossed my mind, just to rule it out. A mysqldump of the recommendation API database right now sits at 810MB. A bz... [12:26:20] * isaranto lunch [12:32:12] (03CR) 10AikoChou: [C: 03+2] "check" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/954613 (owner: 10AikoChou) [12:35:26] 10Machine-Learning-Team, 10serviceops: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10elukey) @SCherukuwada Hi! Shall we restart the conversation about recommendation-api? [12:40:15] 10Machine-Learning-Team, 10Goal: Support WME migration to Lift Wing - https://phabricator.wikimedia.org/T341698 (10elukey) Update: * The WME team is going to switch to Lift Wing (dropping also the `revision-score` stream) on Sept 6th. * We are going to coordinate with them to understand if the new traffic is g... [12:45:22] 10Machine-Learning-Team, 10Goal: Defined and measured SLO for every production service - https://phabricator.wikimedia.org/T341693 (10elukey) Update: * Found basic SLIs (latency and HTTP 2xx requests) to use, and their related thresholds. We will start from a baseline of 95% for new services, to refine them to... [12:45:55] kevinbazira: o/ could you update https://phabricator.wikimedia.org/T341704 with a basic summary before the meeting? [12:48:28] 10Machine-Learning-Team, 10Goal: Lift Wing announced at MVP to the public - https://phabricator.wikimedia.org/T341703 (10elukey) Lift Wing has been announced on various mediums as MVP: * Various slack channels. * Product-all meeting. We also have onboarded bots (from ORES) and WME is moving all their traffic... [12:48:38] (I am updatin the other ones, just adding a summary of what we are working on, next steps, status, etc..) [12:51:03] 10Machine-Learning-Team, 10Patch-For-Review: Increase Lift Wing rate limit for ImpactVisualizer OAuth2 client - https://phabricator.wikimedia.org/T345394 (10elukey) ` elukey@mwmaint1002:~$ mwscript extensions/OAuthRateLimiter/maintenance/setClientTierName.php --wiki metawiki --client e7bca63fbc20c1bc77a5d1d347... [12:54:06] 10Machine-Learning-Team, 10Goal: Content Recommendation API migration completed - https://phabricator.wikimedia.org/T341704 (10kevinbazira) In T338805, we containerized the [[ https://gerrit.wikimedia.org/g/research/recommendation-api | Flask web application ]] that runs the Content Translation Recommendation... [12:54:30] 10Machine-Learning-Team, 10Goal: Content Recommendation API migration completed - https://phabricator.wikimedia.org/T341704 (10kevinbazira) In T339890, we are working on hosting the recommendation-api container on LiftWing. [13:02:04] 10Machine-Learning-Team, 10Goal: Zero traffic on bare metal ORES servers - https://phabricator.wikimedia.org/T341696 (10elukey) Update: * [[ https://ores-legacy.wikimedia.org/ | ores-legacy ]] is fully in production, scaled up and ready to take traffic. * The WME team is going to move their traffic to Lift W... [13:03:54] 10Machine-Learning-Team, 10Goal: Order 2-4 GPU for Lift Wing and Statbox - https://phabricator.wikimedia.org/T341699 (10elukey) We have identified the [[ https://www.amd.com/en/products/professional-graphics/instinct-mi50-32gb | Radeon Instinct MI50 ]] as potential candidate for the purchase. Next steps: * Fi... [13:06:44] 10Machine-Learning-Team, 10Goal: Order 2-4 GPU for Lift Wing and Statbox - https://phabricator.wikimedia.org/T341699 (10calbon) 16 or 32GB? [13:06:49] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Fix ORES Special page - https://phabricator.wikimedia.org/T345407 (10elukey) a:03isarantopoulos [13:07:53] 10Machine-Learning-Team, 10Goal: Order 2-4 GPU for Lift Wing and Statbox - https://phabricator.wikimedia.org/T341699 (10elukey) Definitely 32G! [13:08:01] elukey: the SLO dashboards look great! [13:08:05] \o/ [13:08:20] I'll do the one for ores-legacy today/tomorrow, then we'll be covered [13:08:21] 10Machine-Learning-Team, 10Goal: Order 2-4 GPU for Lift Wing and Statbox - https://phabricator.wikimedia.org/T341699 (10calbon) Sounds good. [13:09:09] do u know where I can find out if rebooting the mw nodes is done? I found this https://phabricator.wikimedia.org/T342534 [13:09:28] or just ask in irc wikimedia-serviceops (?) [13:09:45] isaranto: it is done, I've already deployed, but IIRC Kamila is going to run a DC Switchover test at around 14 UTC [13:10:08] should be fine to go, but maybe drop a line in #serviceops [13:10:40] Amir1: shall we deploy this first thing tomorrow ?https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/954886 [13:11:10] so that we have all day to check/test (whatever time "first thing" means, doesnt matter ) [13:11:10] we can do it now [13:11:16] ah, okay [13:11:19] sounds good to me [13:11:55] I'm in to do it now as well. I'll be online/around for 3-4 hours [13:14:01] elukey: o/ load test result: https://phabricator.wikimedia.org/P52253 (agnostic) https://phabricator.wikimedia.org/P52256 (multilingual) [13:14:25] revertrisk agnostic's result is good. I think we could also target 20 rps like outlink. multilingual performs poorly, only handles 2.9 rps, so maybe we could set target 3? [13:14:56] lol poor multi-lingual [13:15:09] yeah 3 seems ok for it [13:15:26] for LA, maybe let's do 15? To be more conservative [13:15:29] what do you think? [13:15:49] yep 15 sounds good to me [13:17:07] I'll file a patch for it [13:23:21] 10Machine-Learning-Team: enwiki-articlequality version inconsistency between Lift Wing and ORES - https://phabricator.wikimedia.org/T344895 (10elukey) 05Open→03Resolved a:03isarantopoulos [13:23:24] 10Machine-Learning-Team, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05): fiwiki RC filters classify all edits as 'very likely bad faith' - https://phabricator.wikimedia.org/T343308 (10elukey) 05Open→03Resolved [13:23:27] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.41-notes (1.41.0-wmf.20; 2023-08-01): Add User-agent in header of Ores extension - https://phabricator.wikimedia.org/T342605 (10elukey) 05Open→03Resolved [13:23:35] 10Machine-Learning-Team: Append PYTHONPATH in blubber - https://phabricator.wikimedia.org/T342273 (10elukey) 05Open→03Resolved [13:23:37] 10Machine-Learning-Team: ores-legacy wikidata errors - https://phabricator.wikimedia.org/T345063 (10elukey) 05Open→03Resolved [13:23:39] 10Machine-Learning-Team: Add deprecation messages for features not supported in ores-legacy - https://phabricator.wikimedia.org/T342663 (10elukey) 05Open→03Resolved [13:23:44] 10Machine-Learning-Team: [ores-legacy] Inconsistency when returning features - https://phabricator.wikimedia.org/T342791 (10elukey) 05Open→03Resolved [13:23:50] 10Machine-Learning-Team, 10API Platform: Investigate ad-hoc traffic class for API GW rate limits applied to Inference services as used by WME - https://phabricator.wikimedia.org/T338121 (10elukey) 05Open→03Resolved [13:23:58] 10Machine-Learning-Team: [ores-legacy] add message that v1 support for ORES has been dropped - https://phabricator.wikimedia.org/T341486 (10elukey) 05Open→03Resolved [13:24:40] 10Machine-Learning-Team, 10Patch-For-Review: use wikiID in inference name on LW for revscoring models - https://phabricator.wikimedia.org/T342266 (10elukey) 05In progress→03Resolved [13:24:43] 10Machine-Learning-Team: Add deprecation message for too many revision ids - https://phabricator.wikimedia.org/T342789 (10elukey) 05Open→03Resolved [13:24:49] 10Machine-Learning-Team, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Event-Platform: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10elukey) 05Open→03Resolved [13:25:01] 10Machine-Learning-Team, 10CirrusSearch, 10Discovery-Search (Current work), 10Patch-For-Review: Add outlink topic model predictions to CirrusSearch indices - https://phabricator.wikimedia.org/T328276 (10elukey) [13:29:06] chrisalbon_: o/ I have cleaned up a bit the board [13:29:22] thank you Luca <3 [13:30:32] I have moved some tasks and renamed some columns [13:30:36] I'll explain in a bit [13:30:41] if you don't like it we can change them :) [13:31:16] kevinbazira: o/ you have 5 tasks in "in progress", are you working on all of them? [13:32:05] elukey I'll update those tasks in a bit. [13:32:49] super [13:35:45] Hey, Amir is deploying the change now. LW going live again..! [13:36:13] I mean it is already live, just going to more wikis :) [13:36:51] I am now a little bit worried about capacity [13:37:04] tomorrow WME will migrate as well, I need to review some numbers [13:37:34] ok we can review, but most wikis dont have that much traffic [13:37:47] and we have enwiki and wikidata deactivated (using ORES) [13:38:15] I am always pessimistic in these cases, we should have ordered and racked the new nodes first [13:38:22] I'd be more comfortable [13:45:06] We have budget if we wanted to move it around for more racks [13:45:19] Or use the DSE nodes maybe [13:45:27] (that feels like a last resort) [13:46:59] nono I think that we scheduled the order of new nodes for next Q, because we wanted to deal with ORES first [13:47:21] but the main issue is that even if we have 16 nodes (codfw and eqiad), we basically use only the eqiad ones [13:47:56] and WME will call us from (I think) AWS us-east-1 (virginia), that will be routed to eqiad for sure [13:48:11] but the switchover may change things, now that I think about it [13:50:09] oh did we? If so I forgot. [13:53:38] yeah we are going to get 16 mores nodes (8 in each DC) [13:53:51] plus expanding ml-staging (2->4 nodes) [13:53:56] *more nodes [13:55:01] isaranto elukey deployed [13:55:48] thaaaanks Amir1: I'll be monitoring 🤞 [14:05:43] elukey: do we have a grafana graph for reqs to LW? [14:06:39] Amir1: the best is the istio gw dashboard on logstash [14:06:54] https://logstash.wikimedia.org/goto/d50def537ef9d88d822d301a338c9327 [14:07:09] okay thanks! [14:09:22] mmm I don't see the MediaWiki UA though [14:16:40] yeah, I think there might be not that many? [14:17:00] yeah but I see none [14:17:14] or maybe it is low, mmm [14:17:19] I'll check after this meeting [14:17:24] I know what's going on [14:17:38] we basically pass the user/editor's UA as our UA [14:18:00] https://logstash.wikimedia.org/goto/678c40ff293105f044c161248844fb6c [14:18:15] that's roughly ours [14:19:52] oh perhaps this is the issue -> https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ORES/+/941838 [14:20:29] I did this. I attach the user agent of the original request to the LW request [14:23:04] yeah but that was intentional [14:23:23] don't remember what was exactly the intention [14:23:48] I wanted to add "MediaWiki XYZ" as a header, but didn't think it through though [14:26:22] 10Machine-Learning-Team, 10Goal: Stretch: Hosting a production ready version of an LLM - https://phabricator.wikimedia.org/T341695 (10calbon) p:05Triage→03High [14:26:43] 10Machine-Learning-Team, 10Goal: Stretch: Hosting a production ready version of an LLM - https://phabricator.wikimedia.org/T341695 (10calbon) p:05High→03Triage [14:29:34] 10Machine-Learning-Team: Increase Lift Wing rate limit for ImpactVisualizer OAuth2 client - https://phabricator.wikimedia.org/T345394 (10calbon) a:03elukey [14:32:19] (03PS1) 10Ilias Sarantopoulos: fix: do not use the user agent from original request [extensions/ORES] - 10https://gerrit.wikimedia.org/r/954951 [14:33:10] Amir1: I removed it and it is back to where it was [14:37:52] 10Machine-Learning-Team, 10Goal: Lift Wing announced as MVP to the public - https://phabricator.wikimedia.org/T341703 (10calbon) [14:40:05] 10Machine-Learning-Team, 10Goal: Lift Wing announced as MVP to the public - https://phabricator.wikimedia.org/T341703 (10calbon) I'll write the wikitech-l post [14:40:26] 10Machine-Learning-Team, 10Goal: Lift Wing announced as MVP to the public - https://phabricator.wikimedia.org/T341703 (10elukey) a:03calbon [14:47:15] 10Machine-Learning-Team, 10Goal: Zero traffic on bare metal ORES servers - https://phabricator.wikimedia.org/T341696 (10isarantopoulos) All wikis except enwiki and wikidata are now using Lift Wing for the recent changes filters. [15:03:01] isaranto: ok if I assign the ores goal to you? [15:03:33] sure! [15:04:43] 10Machine-Learning-Team, 10Goal: Zero traffic on bare metal ORES servers - https://phabricator.wikimedia.org/T341696 (10elukey) a:03isarantopoulos [15:20:44] isaranto: let's remember to send an email to wikitech-l before leaving for the evening [15:21:04] ack! [15:21:32] I'll reply to the previous one [15:21:37] <3 [15:35:51] Amir1: shall I merge https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ORES/+/954951 so it is deployed with next train? [15:54:23] new dashboards created! https://grafana.wikimedia.org/dashboards/f/SLOs/slos [15:56:55] 10Machine-Learning-Team, 10Patch-For-Review: Define SLI/SLO for Lift Wing - https://phabricator.wikimedia.org/T327620 (10elukey) https://grafana.wikimedia.org/d/slo-Lift_Wing_Article_Topic_Outlink/lift-wing-article-topic-outlink-slo-s https://grafana.wikimedia.org/d/slo-Lift_Wing_Revert_Risk_LA/lift-wing-rever... [15:56:57] added all the links in https://phabricator.wikimedia.org/T327620#9143265 [15:59:48] sent also a meeting invite for tomorrow, when WME will deploy [16:00:28] going afk for today, have a nice rest of the day folks! [16:01:42] good afternoon! [16:20:25] 10Machine-Learning-Team, 10ORES, 10All-and-every-Wikisource, 10ArticlePlaceholder, and 55 others: Remove unnecessary targets definitions - https://phabricator.wikimedia.org/T328497 (10Jdlrobson) [16:33:20] new dashboards look niceee [16:47:12] (03CR) 10Ladsgroup: [C: 03+2] fix: do not use the user agent from original request [extensions/ORES] - 10https://gerrit.wikimedia.org/r/954951 (owner: 10Ilias Sarantopoulos) [16:49:26] (03Merged) 10jenkins-bot: fix: do not use the user agent from original request [extensions/ORES] - 10https://gerrit.wikimedia.org/r/954951 (owner: 10Ilias Sarantopoulos) [21:56:12] 10Machine-Learning-Team: Increase Lift Wing rate limit for ImpactVisualizer OAuth2 client - https://phabricator.wikimedia.org/T345394 (10Ragesoss) 05Open→03Resolved Thank you! [23:47:02] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review, 10User-notice: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10JJMC89) I mostly reused the previous draft by @Quiddity.