[04:35:27] 10Machine-Learning-Team, 10Patch-For-Review: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10isarantopoulos) eswiki damaging and goodfaith in eqiad have been bumped to use 4Gi as a limit to delay hitting OOM issues until we fix the issue. Als... [05:35:46] (03CR) 10DannyS712: [C: 03+2] tests: Migrate to use SelectQueryBuilder [extensions/ORES] - 10https://gerrit.wikimedia.org/r/958539 (https://phabricator.wikimedia.org/T312454) (owner: 10Ladsgroup) [05:51:21] 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10kevinbazira) Thank you for sharing this information, @Isaac. When we were containerizing the Flask app that runs this recommendation-api, we ran into errors where the a... [05:51:32] (03PS4) 10Elukey: Upgrade revscoring images to KServe 0.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/957948 (https://phabricator.wikimedia.org/T346446) [05:52:24] isaranto: o/ +1 on the memory increase, we have probably multiple leaks :( I fixed the kserve 0.11 change, if you want to start testing it in staging please go ahead [05:52:34] I need to run some errand, will be back in ~2h [05:53:29] (03Merged) 10jenkins-bot: tests: Migrate to use SelectQueryBuilder [extensions/ORES] - 10https://gerrit.wikimedia.org/r/958539 (https://phabricator.wikimedia.org/T312454) (owner: 10Ladsgroup) [06:01:51] Ok I'll do! elukey shall I update the kserve chat as well? [06:20:21] (03CR) 10Ilias Sarantopoulos: [C: 03+2] Upgrade revscoring images to KServe 0.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/957948 (https://phabricator.wikimedia.org/T346446) (owner: 10Elukey) [06:21:07] (03Merged) 10jenkins-bot: Upgrade revscoring images to KServe 0.11 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/957948 (https://phabricator.wikimedia.org/T346446) (owner: 10Elukey) [06:36:50] 10Machine-Learning-Team, 10Observability-Alerting, 10Patch-For-Review: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10isarantopoulos) @elukey We could do that. but I'll need to find the appropriate query for the alert. On top of that I was thinking that we could add an alert when a pod re... [07:18:03] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10isarantopoulos) Just deployed damaging models on staging with kserve 0.11 and successfully ran httpbb tests [08:00:58] 10Machine-Learning-Team, 10Observability-Alerting, 10Patch-For-Review: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10elukey) >>! In T346151#9177226, @isarantopoulos wrote: > @elukey We could do that. but I'll need to find the appropriate query for the alert. The metric is `kafka_burrow_... [08:06:28] o/ [08:06:47] isaranto: I tried to query eswiki damaging in staging but it breaks :( [08:07:24] but the good thing is: [08:07:25] 2023-09-19 08:05:31.363 uvicorn.access INFO: 127.0.0.6:0 1 - "POST /v1/models/eswiki-damaging%3Apredict HTTP/1.1" 500 Internal Server Error [08:07:34] so the access log works :) [08:09:15] o/ I got the same 500. The weird thing is that httpbb tests run successfully so Im checking to see what is wrong on that front [08:09:16] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10elukey) Tried to hit eswiki-damaging in staging: ` curl -s https://inference-staging.svc.codfw.wmnet:30443/v1/models/eswiki-damaging:predict -X POST -d '{"rev_id": 153825... [08:11:35] 🤦 sure, there is no test for eswiki in staging since I just adde the deployment this morning [08:12:22] although I get the same issue with eswikiquote. [08:12:38] nevermind I'm looking into it and will report once I have some findings [08:15:05] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10wmde-wikidata-tech, and 2 others: Update API calls from ORES to Lift Wing - https://phabricator.wikimedia.org/T343731 (10karapayneWMDE) [08:20:59] 10Machine-Learning-Team, 10Goal: Zero traffic on bare metal ORES servers - https://phabricator.wikimedia.org/T341696 (10isarantopoulos) The changeprop stream has been stopped and ORES traffic has significantly decreased (4 times less traffic than when the stream was activated.) Current requests are approx 600k... [08:32:44] isaranto: ack! Btw Twistet PageGetter is a health check from Pybal [08:32:53] so all traffic that we can discard [08:33:02] ack! [08:33:10] shall we block it or not? [08:34:50] isaranto: Pybal is our load balancing tool (internal), not sure if you are familiar (I can give you more info in case) [08:35:55] Basically we use Linux's LVS for L4 load balancing (up to TCP basically) [08:35:56] TIL! I found this on wikitech https://wikitech.wikimedia.org/wiki/PyBal . Will read it and come back with any questions [08:36:22] and Pybal is a python daemon that does various things (configs LVS backends based on health checks, talks BGP with routers, etc..) [08:36:28] exactly :) [08:36:42] Some of the Go-related UAs should also be internal, all monitoring [08:36:54] my impression is that most of the traffic is now health checks/metrics :D [08:36:56] with some bot [08:42:20] 10Machine-Learning-Team, 10Goal: Support WME migration to Lift Wing - https://phabricator.wikimedia.org/T341698 (10elukey) WME is going to perform some extra tests on Lift Wing this week, and they will enable the full request flow after that. They are not using ORES anymore (and they are planning to deprecate... [08:51:49] it is weird, I cannot repro the HTTP 500 locally [08:54:03] I am using enwiki-goodfaith though [08:54:04] my local docker images take a while to build... [08:54:12] I just replicated it using articlequality [09:01:26] nope same result for me [09:08:38] isaranto: anything specific that you are using? [09:08:48] as input etc.. [09:08:57] I built the revscoring docker image locally [09:09:19] ran it as always, and checked revscoring's version etc.. [09:09:39] kserve is definitely 0.11 [09:18:09] No nothing special [09:18:24] Will be afk for approx 1h [09:49:15] afk as well for a bit [11:12:28] Good morning [11:18:56] o/ [11:30:49] \o [11:46:23] (03PS1) 10Ilias Sarantopoulos: fix: revscoring model server inputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/958916 (https://phabricator.wikimedia.org/T346446) [11:54:32] isaranto: is there a change to the upstream code that we can relate to --^ ? [11:54:39] it seems really weird [11:55:11] this is what I am looking into. the above patch is just WIP, def not a way to proceed (would have to validate input etc) [11:55:34] I am building an older image to check the input data types. [11:57:10] I want to also check the changes in other packages like fastapi [11:59:21] very strange [11:59:24] and I can't repro locally yet [12:01:27] https://www.irccloud.com/pastebin/Tpznvc3r/ [12:01:47] this is how i reproduce it from the current HEAD of main branch [12:06:39] thanks! Still can't, really strange [12:07:03] I also used the docker registry's image, just to be sure [12:08:39] I don't have docker on this machine yet so I can't help sorry [12:08:43] isaranto: I made it work in staging... [12:09:04] you are going to cry - I added this to curl '-H "Content-type: application/json"' [12:09:32] this is why I wasn't able to repro [12:10:03] I think it is FastAPI, it doesn't recognize the input as json anymore (without a hint from HTTP headers) [12:10:12] * elukey cries in a corner [12:11:19] or maybe https://github.com/kserve/kserve/blob/release-0.11/python/kserve/kserve/rest.py#L173 [12:11:54] yeah I think we end up here: https://github.com/kserve/kserve/blob/release-0.11/python/kserve/kserve/rest.py#L206 [12:12:46] but they haven't changed rest.py in ages [12:12:51] so not sure anymore [12:13:08] aha! [12:19:46] shall we hardcode the header somewhere? [12:20:13] I'm thinking that since we only support rest at the moment it would be unlikely to cause any issue [12:20:22] especially in revscoring models [12:23:10] I'd add a check to return a 400 if the input is not a dict, with a hint of checking Content-type etc.. [12:23:44] or, maybe we could check the Content-type header directly and return a 400 with the hint [12:24:00] seems more correct, even if slightly more annoying for the user [12:24:03] wdyt? [12:24:14] (we'll also need to make sure that the API Gateway sets the header) [12:27:28] Arguably, the API GW should just not touch it. [12:27:37] (if it's already there, I mean) [12:30:55] elukey: it seems like the correct thing to do but on the other hand I wouldn't want to add more friction to the users. Especially in this case where we don't support other content types it could be hardcoded. Introducing this to the users along with additional functionality would be justified in the future [12:32:52] BUT I'm open to what you're saying, since it is only going to be for internal users [12:33:23] so we also figured out why httpbb worked! [12:33:42] good catch Luca! [12:34:55] btw shall we also send sth to wikitech-l today or tomorrow about ores switch to ores-legacy? [12:35:45] klausman: good point yes, API-GW probably should let it through [12:36:06] we don't really mangle the payload so it is more correct what you suggest [12:38:58] snap we don't have the header set in our api-gw examples [12:39:10] so I guess that multiple folks already call us without (possibly) setting the header [12:39:23] what about setting it if we don't see it among the headers? [12:40:16] isaranto: re-thinking - where would you want to set it? Say internal users, if they don't set it we'll get inputs as a byte blob [12:41:55] good question..I'm not sure if it can be done in kserve somehow, or in the docker image [12:43:03] I think that the only thing that we can do is attempt a json.loads() if inputs is a byte array [12:43:24] if it fails, we return a 400 [12:43:33] otherwise we keep going [12:43:35] wdyt? [12:44:18] isaranto: Yes we should post something to Wikitech-l, so if someone has a problem (weird things happen) they can see that we made a change. I'm unfortunately out today so maybe tomorrow? [12:45:09] chrisalbon: we can take care of it! [12:45:14] yep! [12:46:09] elukey: I agree lets just transform it, and in the event of failure return 400 with an appropriate msg [12:54:56] Re: Content-Type: how about we allow any Content-Type (fix it up somewhere on our side), but document that people _should_ be setting it? [13:04:13] I am more keen with setting it the way elukey describe above. this is such an example https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/958916 [13:04:51] stating it in documentation ofc as a best practice [13:07:26] I have another idea. Attach a wrapper in the model servers that modifies the request header using the middleware (starlette I guess). [13:08:13] don't know if its doable 100%. the only issue I have with the above patch is that requests may have any content type [13:10:03] agreed [13:11:35] we could gate calling validate_input on the existing C-T being empty or text/plain adn rejecting anything else [13:18:53] we definitely need to update our docs [13:18:58] api-gw, wikitech, etc.. [13:19:09] ^^^^ [13:35:53] isaranto chrisalbon: Random note I forgot: You can't change the name of a deployed extension, but you can change the name that is being shown in https://en.wikipedia.org/wiki/Special:Version and many extensions do that. You just need to edit the name field in extension.json. Any ideas for a new name? [13:36:31] ml-extension? [13:36:43] so its pretty vague [13:37:50] MachineLearningPlatform [13:37:56] Machine Learning platform [13:38:03] I mean LW is ML platform [13:38:10] I LOVE BIKESHEDDING [13:38:25] lol sure, Machine Learning Platform [13:41:12] Amir1: I love the idea but doesn't it become more confusing for users? Deploying the "ORES" extension, but seeing another name in the Special pages? [13:41:50] it still links to Extension:ORES [13:41:57] Bike shed bike shed bike shed [13:42:11] it is confusing but the idea solution is to build the ability to rename an extension [13:43:08] see flow is "Structured Discussions" in Special:Version :D [13:45:17] (03PS1) 10Ladsgroup: Change the user-facing name of the extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/958935 [13:45:22] ^ [13:46:21] https://usercontent.irccloud-cdn.com/file/tRIAKcYQ/grafik.png [13:47:44] (03CR) 10Ladsgroup: "Hi, Are you happy with the change and your name being added?" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/958935 (owner: 10Ladsgroup) [13:48:31] ML platform is perhaps too generic since it only scores revisions BUT I'm ok :) [13:49:32] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/958935 (owner: 10Ladsgroup) [13:50:56] 10Machine-Learning-Team: Update docs for ORES Extension - https://phabricator.wikimedia.org/T346761 (10calbon) [14:11:17] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Documentation: Update docs for ORES Extension - https://phabricator.wikimedia.org/T346761 (10Aklapper) +#Mediawiki-extensions-ORES (please add code base project tags for the world outside of some internal WMF teams - thanks!) [14:23:45] 10Machine-Learning-Team, 10Goal: Defined and measured SLO for every production service - https://phabricator.wikimedia.org/T341693 (10calbon) Once we merge a code review, we should be good. New services should have an SLO with at least 95% availability and 95% of requests below a latency. We can refine over ti... [14:38:57] 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Rename the envoy's uses_ingress option to sets_sni - https://phabricator.wikimedia.org/T346638 (10calbon) a:03elukey [14:39:05] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10calbon) a:03isarantopoulos [14:39:09] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) a:03elukey [14:42:17] isaranto: my evil plan is to extend this further :D [14:42:34] I know.... [14:42:35] e.g. for online training we can resue the whole extension for this [14:42:39] 😛 [14:48:17] 10Machine-Learning-Team, 10Commons: Utilize ChatGPT for categorizing and extracting metadata from files on Commons - https://phabricator.wikimedia.org/T345898 (10calbon) a:03calbon [15:12:08] elukey: I'm trying to do some memory profiling locally to try to address the issue. Will report my findings tomorrow morning when we can also coordinate [15:12:36] going afk folks o/ [15:24:47] isaranto: I am doing the same, let's see tomorrow if we have good data! [15:25:56] I am using eventstreams and eswiki-damaging [15:25:58] so far lo leak [16:16:08] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) Using this [[ https://thanos.wikimedia.org/graph?g0.expr=sum%20by%20(pod)%20(container_memory_usage_bytes%7Bnamespace%3D~%22revscoring-editquality.*%22%2C%20contai... [16:17:54] folks I am killing some pods before leaving [16:19:53] 10Machine-Learning-Team: Isvc pods sometimes fail to serve HTTP requests and blackhole traffic - https://phabricator.wikimedia.org/T346445 (10elukey) Killed {es,ko,ru}wiki-{goodfaith,damaging}. [16:19:56] reported in the pod [16:20:01] err in the task :) [16:41:41] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10leila) [16:41:56] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10leila) @mfossati given that there is no assignment for the Research team in the task description, I'm going... [17:23:19] 10Machine-Learning-Team, 10Patch-For-Review: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10Isaac) > When we were containerizing the Flask app that runs this recommendation-api, we ran into errors where the application expected to load embeddings as shown in T3... [17:27:52] (03CR) 10Jforrester: [C: 03+2] Change the user-facing name of the extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/958935 (owner: 10Ladsgroup) [17:43:10] (03Merged) 10jenkins-bot: Change the user-facing name of the extension [extensions/ORES] - 10https://gerrit.wikimedia.org/r/958935 (owner: 10Ladsgroup) [17:48:26] 10Machine-Learning-Team, 10ORES: Creating lists of "weasel words" for ORES (Machine Learning) - https://phabricator.wikimedia.org/T288761 (10Ciell) 05Open→03Resolved ORES model for nlwp is done.