[07:01:15] hello folks! [07:09:54] good morning! [07:17:38] 10Machine-Learning-Team: Add deprecation message for ORES UI - https://phabricator.wikimedia.org/T342118 (10kevinbazira) a:03kevinbazira [07:29:16] 10Machine-Learning-Team: Add deprecation message for ORES UI - https://phabricator.wikimedia.org/T342118 (10kevinbazira) As I was preparing to add a deprecation message to the ORES UI, I found that this UI exists in two repos: 1. [[ https://github.com/wikimedia/ores/blob/c64bc38b4fe320690b7c47b747dbdd0c5307ec3d... [07:31:21] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Add User-agent in header of Ores extension - https://phabricator.wikimedia.org/T342605 (10isarantopoulos) a:03isarantopoulos [07:32:33] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Add User-agent in header of Ores extension - https://phabricator.wikimedia.org/T342605 (10elukey) Small nit - at the moment we should get the Mediawiki's UA, that includes the version. It would be nice to keep the same versioning for the new UA as well, to... [07:33:36] elukey: --^ this is exactly my focus to figure out how it is currently done. there must be a config variable or sth [07:35:00] isaranto: ack! [07:35:34] I am reviewing the current access logs, and I see that some have weird user agents [07:35:38] one is userAgent :D [07:36:20] the x-forwarded-for header indicates a 172.16 address, that is weird [07:37:09] ahh no wait it is a cloud vps instance for sure [07:56:11] ack [07:58:41] isaranto: in the serviceops chan people wanted me to drink retsina [07:58:54] I told them that we should all meet in Athens [08:00:23] lol [08:03:38] u dont want to drink retsina... [08:03:47] but an Athens meeting would be great! [08:47:45] (03PS1) 10Ilias Sarantopoulos: add User-agent header in Lift Wing requests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/941838 (https://phabricator.wikimedia.org/T342605) [08:49:15] (03CR) 10CI reject: [V: 04-1] add User-agent header in Lift Wing requests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/941838 (https://phabricator.wikimedia.org/T342605) (owner: 10Ilias Sarantopoulos) [08:58:37] 10Machine-Learning-Team: Add ores like threshold support in LW for revscoring models - https://phabricator.wikimedia.org/T341483 (10isarantopoulos) After running this query on logstash ` select Time, level, host, user_agent, uri, return_code, response_size, method, duration from logstash-default-1-7.0.0-1-2023.0... [09:02:27] 10Lift-Wing, 10Machine-Learning-Team: Enable batch inference for revscoring models - https://phabricator.wikimedia.org/T342555 (10achou) I conducted a small experiment using mwapi to obtain features for multiple revision ids. I used a list of 50 different revision ids (same as the request in T341479) and ran t... [09:50:25] 10Machine-Learning-Team: Design/Feature discussion: return codes for LW services to signal "the revision doesn't exist" - https://phabricator.wikimedia.org/T342735 (10klausman) [10:01:29] elukey: here's the official discussion on noGIL python https://discuss.python.org/t/pep-703-making-the-global-interpreter-lock-optional/22606/119 [10:02:28] #nogil [10:02:33] haha [10:43:12] * elukey lunch! [11:01:41] same [11:02:19] klausman: I added some revision ids and info on the requests here https://phabricator.wikimedia.org/P49709 . lemme know if you need more info [11:02:27] * isaranto lunch [11:23:47] isaranto: ack, thanks, will take a poke in a bit [12:02:53] isaranto: when hitting https://inference.svc.eqiad.wmnet:30443/v1/models/eswikibookswiki-goodfaith:predict (Host: eswikibookswiki-goodfaith.revscoring-editquality-goodfaith.wikimedia.org) with revision 459820 (from your list), I get: [12:03:03] The MW API does not have any info related to the rev-id provided as input (459820), therefore it is not possible to extract features properly. One possible cause is the deletion of the page related to the revision id. Please contact the ML-Team if you need more info. [12:04:57] my b, had frogotten lang: [12:06:18] huh. [12:06:44] your curl example (with lang:en) works, but my tool's request with lang:es (or :en) does not. Investigating. [12:18:16] alright, went with the old range (123000+rand(2999)) and that works fine. [12:18:36] oups lang should not be there at all (it comes from my bad copy-pasting from a revertrisk request). However the revid you mention is for eswikiquote (not for eswikibooks) [12:19:03] I get about 42qps woth 20 threads, whcih may be the saturation point. CPU and memory usage look normal for both istio-proxy and the kserve pod [12:19:20] ah my b for books vs quote. [12:19:42] so far only tested goodfaith, will now also do damaging, and then quote for both models [12:23:25] damaging has about 35qps @ 20 threads [12:28:08] quote-damaging hits 65qps [12:32:17] quote-goodfaith is around 55qps. All of these were done using the internal endpoints. I'll spot-check the route via APIGW as well, but may run out of quota :) [12:39:07] nice! [12:39:50] regarding the mw extension we are interested only for the internal one, so we're good to go [12:40:04] thanks klausman: ! [12:40:10] np :) [12:40:24] I have a loong list of features-to-do for the tool by now [12:41:14] e.g. loading configs for internal/external endpoints from config so I don't have to fiddle with the URls directly. So the tool could be asked to run against a X wiki, Y model, internal/external and make the URL etc itself. [12:41:44] but that is for next-week-me [12:55:44] very nice :) [12:57:16] elukey: would you have some time to give me a braindump re: Git-LFS and how we currently use it? [13:02:38] klausman: sure, but I admit that my knowledge is very limited.. We use gerrit as LFS repository, basically storing binaries on it. Releng needs to support the use case (for example, to allow scap to pull down the binaries when deploying etc..) [13:02:57] And we want to move to s3 entirely? [13:03:05] yes definitely [13:03:09] i.e. LFS is only used for classic ORES? [13:03:13] yep [13:03:36] SO once we shut down that, our use of LFS-Gerrit goes away and releng is one step closer to not having to support it? [13:03:50] exactly [13:04:01] ok, then my mental model was about right. [13:04:18] I will open a task with Releng to delcare our intentions and make sure we coordinate as needed. [13:06:55] (03PS2) 10Ilias Sarantopoulos: add User-agent header in Lift Wing requests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/941838 (https://phabricator.wikimedia.org/T342605) [13:08:25] (03CR) 10CI reject: [V: 04-1] add User-agent header in Lift Wing requests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/941838 (https://phabricator.wikimedia.org/T342605) (owner: 10Ilias Sarantopoulos) [13:11:16] wow some really nice side effect of having istio gateway's logs - we also have the ones for ores-legacy.wikimedia.org! [13:11:45] we just need to filter by that domain in logstash [13:11:49] The joys of (somehwat) unified ssystems architecture :) [13:11:58] and we can re-use the same dashboard etc.. [13:12:43] and we'll have the same for recommendation-api etc.. [13:12:54] (as long as we use the istio ingress of course) [13:14:19] (03PS3) 10Ilias Sarantopoulos: add User-agent header in Lift Wing requests [extensions/ORES] - 10https://gerrit.wikimedia.org/r/941838 (https://phabricator.wikimedia.org/T342605) [13:22:32] 10Machine-Learning-Team, 10Release-Engineering-Team: ML-Team will soon stop using LFS on Gerrit (for ORES deployment) - https://phabricator.wikimedia.org/T342765 (10klausman) [14:24:32] (03PS1) 10Elukey: ores-legacy: avoid any connection pool when calling Lift Wing [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/941944 (https://phabricator.wikimedia.org/T341479) [14:27:53] I contacted the SWViewer folks (they own a bot) on discord, and they realized this: [14:28:00] headers: {"User-Agent": "userAgent", "Authorization": "Bearer " + bearerToken} [14:28:08] so we found "userAgent" !! \o/ [14:46:17] 🤣 [14:47:34] if anybody has time, I'd appreciate a review for https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/941944 [14:47:45] I'll test it in staging, to see if it improves the conn timeouts [14:52:08] Looking [14:53:20] (03CR) 10Klausman: [C: 03+1] ores-legacy: avoid any connection pool when calling Lift Wing [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/941944 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [14:53:32] thanks! [14:54:53] (03CR) 10Elukey: [C: 03+2] ores-legacy: avoid any connection pool when calling Lift Wing [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/941944 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [14:55:45] (03Merged) 10jenkins-bot: ores-legacy: avoid any connection pool when calling Lift Wing [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/941944 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [15:04:53] ok something strange from ores-legacy staging [15:05:32] when I try to make a big query from stat1004, I get 4/5 times responses with timeouts [15:05:45] then, almost no error, and the no error at all [15:06:16] I monitored the TCP conns created to liftwing's 30443 port, and they keep increasing (in the envoy container) [15:07:17] it seems as if the first bunch of tcp conns are slower to be established, then when they are up (envoy proxy -> istio gateway) all good [15:08:39] mmm do we get cpu throttled? [15:10:09] ahhh the TLS proxy container is throttled! [15:20:16] ack [15:21:29] Lift Wing is live in ores extension on es.wikibooks.org and es.wikiquote.org 🎉 thanks Amir1: ! <3 [15:21:42] I'll be monitoring that all goes well [15:22:16] niceeee [15:25:12] ok it is definitely cpu throttling [15:29:22] nice find! [15:30:40] IIUC if we have a single pod making all those tcp conns it may get throttled, and return errors [15:30:50] I'll follow up with serviceops [15:50:41] 10Machine-Learning-Team: [ores-legacy] Clienterror is returned in some responses - https://phabricator.wikimedia.org/T341479 (10elukey) After a lot of tests I found out that the connection timeouts were decreasing as more and more TCP connections were established between the ores-legacy's tls proxy (envoy) and t... [15:50:49] added my thoughts to --^ [15:50:56] but yeah it should be the root cause [15:51:31] for those big queries if the pod is not warmed up sufficiently we get client errors [15:52:16] my idea is to tune a bit "limit" so that the container can spike up in cpu usage if needed, but we may need to limit the number of parallel rev-ids [15:52:22] what do you think isaranto ? [15:53:53] ack! [15:54:29] one thing is not clear: if we use more pods, the issue will persist if many revids are requested in a single request right? [15:54:52] so the pod tuning would be about being able to facilitate more requests (?) [15:55:20] isaranto: yeah it will, but if they are warmed up etc.. there will be more chances to have less failures (maybe?) [15:55:35] limiting the rev-ids/models combinations will surely help [15:55:48] lemme write it in the task, aiko are you working on it by any chance? [15:57:26] ok, understood most of it apart of the warming up (no need to explain now, I will ask tomorrow !) [15:58:19] isaranto: ah nono the basic idea is that more traffic to ores pods == more envoy connections established to lift wing == less failures [15:58:40] for staging we may be out of luck and accept failures if we test those big queries [15:58:47] but in production in theory we should be good [16:00:27] elukey: no I'm working on adding msg for non supporting features in ores-legacy [16:01:33] aiko: yes yes what I meant is if I can add this extra code change in your task, basically returning a 400 with deprecation etc.. if the query string used contains too many rev-ids [16:01:50] to limit the amount of batches people request in one go [16:05:09] 10Machine-Learning-Team: Add deprecation messages for features not supported in ores-legacy - https://phabricator.wikimedia.org/T342663 (10elukey) In T341479 we figured out that big query strings cause the ores-legacy's tls proxy to be cpu-throttled for a bit, ending up in client connection errors. We should lim... [16:05:42] I added soem thoughts in --^ [16:05:51] anyway, something for tomorrow :) [16:06:09] (I can also try to make the code change) [16:07:19] I can do that in the morning so aiko can focus on the other errors! [16:07:58] because we also figured out another inconsistency with ores (using features) [16:08:10] ahh okok [16:09:57] going afk folks, have a nice rest of the day! [16:10:51] 10Machine-Learning-Team: Add deprecation messages for features not supported in ores-legacy - https://phabricator.wikimedia.org/T342663 (10isarantopoulos) While checking this task @achou found an inconsistency with returning features, so we should investigate that as well. I am referring to the ability to return... [16:11:44] hmm, I think it is better to tackle these issues in separate tasks. [16:14:34] 10Machine-Learning-Team: Add deprecation message for too many revision ids - https://phabricator.wikimedia.org/T342789 (10isarantopoulos) [16:17:05] ahh sorry I didn't get it. I'm happy to work on that. [16:19:04] 10Machine-Learning-Team: [ores-legacy] Inconsistency when returning features - https://phabricator.wikimedia.org/T342791 (10isarantopoulos) [16:19:41] I have added these both as separate tasks so we can break down work a bit more and we can work on one task at a time [16:20:08] thanks Ilias! :) [16:20:16] anyone feel free to jump on those as long as you're finished or are blocked in another task [16:20:27] aiko: :D [16:20:33] going afk as well! [16:22:17] 10Machine-Learning-Team: Add deprecation messages for features not supported in ores-legacy - https://phabricator.wikimedia.org/T342663 (10isarantopoulos) The last two comments have been moved in their own tasks. https://phabricator.wikimedia.org/T342789 and https://phabricator.wikimedia.org/T342791 respectively [16:26:55] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10isarantopoulos) [16:27:59] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10isarantopoulos) [16:32:07] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10isarantopoulos) [16:32:55] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Patch-For-Review: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10isarantopoulos) [16:33:25] --^ ((was trying to add a check mark icon but couldnt) nevermind. ciao! [16:35:45] ciao ciao