[05:33:53] o/ [05:34:15] Attempting the revert for the CORS configuration in ores-legacy [05:38:35] (03PS1) 10Ilias Sarantopoulos: ores-legacy: revert CORS configuration [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/961271 [05:39:06] oh well I issued a new commit as silly me had also the redirects in the same commit [05:43:59] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ores-legacy: revert CORS configuration [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/961271 (owner: 10Ilias Sarantopoulos) [05:44:51] (03Merged) 10jenkins-bot: ores-legacy: revert CORS configuration [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/961271 (owner: 10Ilias Sarantopoulos) [06:16:26] It seems to work fine! [06:19:38] (03PS6) 10Ilias Sarantopoulos: ores-legacy: support empty boolean parameters [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/960587 (https://phabricator.wikimedia.org/T347193) [07:00:10] Afk for 1h [07:24:44] 10Machine-Learning-Team, 10ORES: ORES possibly blocking PetScan from loading? - https://phabricator.wikimedia.org/T347367 (10Magnus) PetScan restored for now. Waiting for the CORS issue to be resolved. [07:34:23] isaranto: thanks! [07:34:24] o/ [07:34:53] Do u want me to do sth about the permanent fix? [07:36:31] nono I'll file a patch for deployment-charts' modules [07:45:50] Ok! Thanks [07:58:21] 10Machine-Learning-Team: petscan expects javascript function callback from ORES - https://phabricator.wikimedia.org/T347317 (10Magnus) 05Open→03Resolved Resolved, see T347367 [07:58:35] 10Machine-Learning-Team, 10ORES: ORES possibly blocking PetScan from loading? - https://phabricator.wikimedia.org/T347367 (10Magnus) 05Open→03Resolved Changed the query so it's now working as before the ORES change. [08:23:43] isaranto: \o/ [08:23:46] nice :) [08:25:45] Great! [08:57:12] elukey: I was thinking that we I should focus on alerts before continuing with kserve upgrade, logging etc. wdyt? [09:01:22] isaranto: I think that we should add the alert for the job queues, it is a good sign of something going wrong [09:01:47] the cgroup memory too, but it may be more difficult to figure out what is the right threshold (ideally percentage based) [09:02:01] cool, I agree! [09:02:03] cgroup? [09:04:02] yeah the memory used by each container, it is managed by a cgroup [09:04:06] (linux cgroup) [09:09:58] aa ok, thanks for clarifying [09:36:15] running some errands + early lunch! [09:42:36] (03PS1) 10AikoChou: events.py: set the same error code when sending events to eventgate [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/961347 (https://phabricator.wikimedia.org/T346136) [09:56:02] (03CR) 10AikoChou: "I tried using wmflib's retry decorator (https://doc.wikimedia.org/wmflib/master/api/wmflib.decorators.html#wmflib.decorators.retry) on sen" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/961347 (https://phabricator.wikimedia.org/T346136) (owner: 10AikoChou) [09:59:51] Morning! (brely) [09:59:55] barely* [10:08:05] elukey: I think I'll wait until next week before I remove any functional bits (LVS etc) for ORES, just in case we have an emergency need. Not just for the unlikely case of some client needing it, but if we need to check what old behavior was. [10:08:10] wdyt? [10:18:09] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ores-legacy: support empty boolean parameters [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/960587 (https://phabricator.wikimedia.org/T347193) (owner: 10Ilias Sarantopoulos) [10:19:04] (03Merged) 10jenkins-bot: ores-legacy: support empty boolean parameters [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/960587 (https://phabricator.wikimedia.org/T347193) (owner: 10Ilias Sarantopoulos) [10:40:42] just redeployed ores-legacy with empty boolean params support [10:43:31] 10Machine-Learning-Team, 10ORES, 10Patch-For-Review: Support for basic boolean flags in ores-legacy - https://phabricator.wikimedia.org/T347193 (10isarantopoulos) Added the support for empty boolean features in order not to break existing functionality. The following request now `https://ores.wikimedia.org/... [10:43:41] 10Machine-Learning-Team, 10ORES, 10Patch-For-Review: Support for basic boolean flags in ores-legacy - https://phabricator.wikimedia.org/T347193 (10isarantopoulos) 05Open→03Resolved [10:46:09] * isaranto lunch! [11:00:18] ditto [11:05:58] Hello [11:16:58] hey Chris [11:18:56] o/ [11:30:27] 10Machine-Learning-Team: Investigate recommendation-api-ng internal endpoint failure - https://phabricator.wikimedia.org/T347475 (10kevinbazira) [11:48:07] So... how is it going? [12:03:20] Eveything seems good on ORES front till now [12:04:46] what happened now? [12:07:40] aa sry . I mean until now cause I can't predict the future 😛 [12:07:47] wrong phrasing :) [12:08:38] lol, I was like "OH SHIT" and was looking through my email for gerrit notifications [12:09:13] it is just me being cautious [12:15:02] 10Machine-Learning-Team, 10ORES: [ores-legacy] Model not available should return 404 - https://phabricator.wikimedia.org/T347480 (10isarantopoulos) [12:15:37] just found a small bug which would affect our SLOs [12:18:20] 10Machine-Learning-Team, 10ORES: [ores-legacy] Model not available should return 404 - https://phabricator.wikimedia.org/T347480 (10isarantopoulos) Same thing can be reported when accessing ` https://ores.wikimedia.org/v3/scores/commonswiki/1234/damaging ` While ` https://ores.wikimedia.org/v3/scores/commonsw... [12:19:15] I have verified that these are the only types of 5xx responses that we have. [12:20:11] 10Machine-Learning-Team, 10ORES: [ores-legacy] Model not available should return 404 - https://phabricator.wikimedia.org/T347480 (10isarantopoulos) a:03isarantopoulos [13:22:40] isaranto: are you deplying things in codfw? [13:22:51] nope [13:23:09] I have to nlwiki aq pods that are in terminating state [13:23:14] two* [13:23:36] folks an heads up - in ~30 mins service ops will repool eqiad, that includes ml-serve-eqiad [13:23:46] I was mostly poking around since we have latency warnings about the API firing, and often, they are cause by stuck pods [13:24:35] Might be autoscaling [13:26:56] you can check from the metrics if it was [13:27:08] currently digging into that [13:32:16] elukey: I can't figure out how to find the reason for a pod (re)start or termination from the Grafana dashboards [13:36:54] klausman: one thing that you can use: `kubectl get events -n revscoring-articlequality | grep euwiki --color` [13:38:57] Ah, I keep forgetting about `get events` [13:42:05] elukey: https://phabricator.wikimedia.org/F37817201 also wondering what this was. Those are calico/typha pods [13:42:31] They are all TX (transmit), so it's not image/model downloading. [13:43:18] no idea, but it is easy to get lost in metrics, try to focus on the ones that you care about :) [14:00:07] 10Machine-Learning-Team, 10ORES: Help migrate SDZeroBot to Lift Wing - https://phabricator.wikimedia.org/T342960 (10isarantopoulos) SDZeroBot is now hitting ores-legacy with urls that perform at least 100 LW requests so it fails with 400 [15:06:48] 10Machine-Learning-Team, 10ORES: Help migrate SDZeroBot to Lift Wing - https://phabricator.wikimedia.org/T342960 (10isarantopoulos) Created a PR to help tackle the issue. https://github.com/siddharthvp/SDZeroBot/pull/35 [15:16:34] isaranto, kevinbazira - one question about https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/960681 - the health check should be something really fast, if we need 10 seconds to process it we should use something else [15:17:41] and checking in https://phabricator.wikimedia.org/T347475 we may be short on workers [15:17:49] how many did we configure kevinbazira ? [15:18:11] elukey: I am investigating whether the 10s will work. If they do then we can reduce them further. [15:18:51] kevinbazira: sure but I think that the issue is the amount of workers [15:19:11] it cannot be that querying spec with some values causes a generic call to spec to fail [15:19:19] it seems as if we can process only one call at the time [15:19:24] that is not great [15:19:54] how many workers would be optimal? [15:19:54] this is why I asked how many workers we are running, if we have only one we probably can serve only one req at the time [15:20:39] kevinbazira: we need to figure it out, usually it is a balance with the number of cpus, it needs to be tested [15:21:41] ok, will continue testing. had merged the 10s patch and I am going test that then test the workers too! [15:22:27] yeah but I am -2 on keeping the 10s setting :) [15:22:42] you're right. I thought as well jus to try it [15:22:43] so even if it works, we need to find a good alternative [15:22:49] ack [15:35:07] elukey, isaranto: I've tested the 10s timeout and the readiness probe still fails. going to test with cpus > 1 [15:36:09] (03PS1) 10Ilias Sarantopoulos: ores-legacy: fix 500 issues with wrong wikiId [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/961408 (https://phabricator.wikimedia.org/T347480) [15:36:55] kevinbazira: let's try to think this through before applying patches [15:37:14] the main problem seems to be that without any other requests, the /api/spec works fast [15:37:23] (03PS2) 10Ilias Sarantopoulos: ores-legacy: fix 500 issues with wrong wikiId [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/961408 (https://phabricator.wikimedia.org/T347480) [15:37:32] 10Machine-Learning-Team, 10ORES, 10Patch-For-Review: [ores-legacy] Model not available should return 404 instead of 500 - https://phabricator.wikimedia.org/T347480 (10isarantopoulos) [15:37:34] IIUC, when you query /api/spec with some parameter, it takes ages to complete [15:37:51] this by itself is something to investigate, why a request takes so long? [15:37:52] 10Machine-Learning-Team, 10ORES, 10Patch-For-Review: [ores-legacy] Model not available should return 400 instead of 500 response code - https://phabricator.wikimedia.org/T347480 (10isarantopoulos) [15:38:17] moreover, from the failure outlined in the task, when a long request is running the probe fails [15:38:30] so it could be that we are short on workers [15:38:39] even if you increase cpus it will still be the same [15:38:43] does it make sense? [15:39:28] elukey: yes, it does. in a previous comment you had mentioned the cpus thats why I was thinking along those lines [15:41:40] kevinbazira: yep but always ask if it is going to solve before sending the patch (I usually picture in my head a reason why it does, it helps me a lot) [15:42:10] 10Machine-Learning-Team, 10ORES, 10Patch-For-Review: [ores-legacy] Model not available should return 400 instead of 500 response code - https://phabricator.wikimedia.org/T347480 (10isarantopoulos) The above patch seems to fix the issue in the same way the other endpoints do - by checking that the models are... [15:42:23] 10Machine-Learning-Team, 10ORES, 10Patch-For-Review: [ores-legacy] Model not available should return 404 instead of 500 response code - https://phabricator.wikimedia.org/T347480 (10isarantopoulos) [15:43:04] (03PS3) 10Ilias Sarantopoulos: ores-legacy: fix 500 issues with wrong wikiId [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/961408 (https://phabricator.wikimedia.org/T347480) [15:43:13] I have not sent the patch yet, I mentioned it before sending it ... [15:43:36] ack sure :) [15:43:58] but let's also discuss high level in here what is the problem etc.. it always help to brainbounce! :) [15:44:46] ok, so you've mentioned increasing the wokers. is that something you're ok with me testing? [15:45:49] kevinbazira: you can test anything :) My point is to have a clear idea about what we are testing [15:46:00] and it is fine to ask otherwise [15:46:18] we should make sure if we use one worker or not [15:46:23] as starting point [16:05:31] going afk for today folks! [16:05:33] have a nice one! [16:05:41] kevinbazira: we can work on it tomorrow if you want [16:08:15] elukey: no problem. we'll pick it up tomorrow. [16:08:22] enjoy your evening o/ [16:33:31] Going afk, cu folks! [16:41:05] Starfield beckons, I'm off :) [16:43:01] me too! have a nice evening folks :) [16:54:36] night all! [17:28:39] 10Machine-Learning-Team, 10ORES: Help migrate SDZeroBot to Lift Wing - https://phabricator.wikimedia.org/T342960 (10SD0001) Have merged the above PR so the 400 errors should be resolved now. Regarding migration to Lift Wing, how do I request scores for multiple revision ids in a single request? ORES allowed t... [17:31:01] 10Machine-Learning-Team, 10ORES: Help migrate SDZeroBot to Lift Wing - https://phabricator.wikimedia.org/T342960 (10SD0001) If multiple rev ids in a single call is not possible, what would be a reasonable concurrency level to make LW requests in parallel?