[08:20:46] hi folks! [08:29:14] so I am checking https://github.com/kserve/kserve/pull/2444/files [08:29:34] the _http_client instance that we use from kserve.model changed [08:29:49] from tornado's AsyncHTTPClient to httpx.AsyncClient( [08:30:23] in fact the starting error in the stacktrace (when we try to send an event) is [08:30:26] File "/srv/rev/model-server/events.py", line 70, in send_event [08:30:29] await aio_http_client.fetch( [08:30:31] AttributeError: 'AsyncClient' object has no attribute 'fetch' [08:33:46] hey! [08:36:03] o/ [08:36:36] so in theory send_event needs to support both clients at the same time (since we have a mixed kserve version) [08:36:46] does this come from tornado? [08:37:44] indirectly, see [08:37:44] https://github.com/kserve/kserve/blob/release-0.9/python/kserve/kserve/model.py#L96 [08:37:48] vs [08:37:50] usually it is the case that either something has changed , there is no fetch function or it is not imported by default and needs to be explicitly defined [08:38:00] https://github.com/kserve/kserve/blob/release-0.10/python/kserve/kserve/model.py#L137 [08:38:24] isaranto: see above, I mentioned the pull request [08:38:29] they changed the library [08:38:33] tornado first, then httpx [08:38:43] aa sorry [08:39:16] I am going to create a quick patch to support both [08:40:43] so we need to change fetch to post. hmm this wasn't clear from the release notes 🤓 [08:41:08] more than that, the API is completely different [08:41:32] I think there is more, see https://github.com/kserve/kserve/blob/release-0.10/python/kserve/kserve/model.py#L286 [08:41:44] yes I also found this one `AttributeError: module 'tornado' has no attribute 'web'` [08:41:51] we currently use tornado.web.HTTPError [08:42:53] ah ok in https://github.com/kserve/kserve/commit/e126d019ac9598f71ce747c81ad512d0b704f2bf there is an explicit "Remove tornado handlers" [08:43:55] and nothing is mentioned in https://github.com/kserve/kserve/releases [08:44:00] lovely [08:45:47] one thing that I don't recall is if we use http_client_session [08:46:19] it is aiohttp-based, and independent from the http_client offered by kserve [08:46:32] we could probably use it to be more resilient to upgrades like this one [08:46:44] great thorough look Luca! what an eye [08:46:46] `Use httpx.AsyncClient to replace tornado` [08:48:23] going to test "our" http_client_session attribute, maybe it is as simple as using it in send_event [08:48:38] I am looking into an nltk related error and will jump into this afterwards [08:48:48] super so we can do things in parallel :) [08:48:56] good morning :D [08:49:03] good morning! [08:49:08] 🤗 [08:49:40] I think that a set of unit tests would help to catch all these things in CI [08:52:18] yes definitely, we never really added something significant (always in testing phase, but now..) [08:52:43] testing async code like this is probably a little weird, buuut with mocks etc... should probably be fine [08:57:37] TIL also https://www.uvicorn.org/ [09:01:44] yes! it is used by fastapi [09:02:13] yeah but kserve changed everything from 0.9 -> 0.10 without a big warning :D [09:02:26] I mean I get that they are releasing fast / breaking stuff etc... [09:02:29] haven't used it standalone tbh [09:02:31] but it seems a little brutal for users :D [09:03:20] uvicorn was there way before [09:03:28] I mean as a requirement [09:09:14] sure but it wasn't really used afaics [09:09:16] tornado was [09:09:52] my point is that users will assume tornado behind the scenes, and use it accordingly (like returning http codes etc..) [09:41:11] isaranto: the new precommit stuff is great <3 [09:42:00] happy if u enjoy it <3 [09:50:43] yeah it is really great [10:06:03] I am almost done with a first draft of the changes for revscoring [10:06:13] need more tests though [10:15:53] (03PS1) 10Elukey: WIP - Move revscoring model server to fastapi and adapt events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883541 [10:17:02] isaranto: this is the draft --^ [10:17:12] afaics it seems to work, need to test outlink [10:17:25] it is a transition step, it requires a cleanup/improvement afterwards [10:17:38] but the 0.10 migration is bigger than 0.9's [10:36:45] ack. will take a look [10:37:15] the revscoring error is gone, I tested multiple things [10:37:26] outlink still not tested but I'll do it soon [10:42:15] 10Machine-Learning-Team: Investigate if the mediawiki.revision-score stream can be broken down into multiple ones with ChangeProp - https://phabricator.wikimedia.org/T327302 (10elukey) After a long debugging with Hugh, we figured out the problem and reported in T327805. Hugh sent a patch to change the CA BUNDLE... [11:10:16] I am going to issue another patch to make tornado imports explicit e.g. `from tornado.web import HTTPError` as they don't exist in the __init__.py of the module (until version 6.1) which means they aren't imported by default [11:10:53] won't affect anything else [11:11:08] * isaranto 's famous last words 😄 [11:12:25] isaranto: what do you mean? We run 6.1 though [11:13:03] I get these errors if I issue wrong input [11:13:10] with my patch? [11:13:22] e.g. {"blahblah": 123} instead of rev_id [11:13:37] not with your patch, in general in python 3.9 [11:14:23] ``` [11:14:24] File "/srv/rev/model-server/model_servers.py", line 143, in preprocess [11:14:24] rev_id = preprocess_utils.get_rev_id(inputs, self.REVISION_CREATE_EVENT_KEY) [11:14:24] File "/srv/rev/model-server/preprocess_utils.py", line 31, in get_rev_id [11:14:24] raise tornado.web.HTTPError( [11:14:24] AttributeError: module 'tornado' has no attribute 'web' [11:14:24] ``` [11:14:27] yeah but if you see the patch above I removed all tornado references from the revscoring model server [11:14:42] so it shouldn't be needed an explicit import [11:15:18] this is why I was asking :) [11:16:19] yeah you're right I saw this line `raise tornado.web.HTTPError` in `outlink-topic-model/model-server/model.py` of your patch but it is still a draft [11:16:24] (03PS2) 10Elukey: Move revscoring model server to fastapi and adapt events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883541 [11:17:05] will take a look when it is ready for review then and will continue with my stuff [11:18:08] isaranto: mmm ok but then I don't understand why we need an explicit import, if tornado is imported [11:18:18] is it something new of py3.9? [11:18:43] I will checkout an older image to see what the issue is [11:18:51] I mean what was the behavior in the past [11:19:37] sure but let's also fix one thing at the time [11:19:47] we have too many things in flight, we can iterate [11:20:04] outlink is not moved to 3.9 now, so we can keep the old behavior if it works [11:20:16] I am testing the image atm [11:21:08] yeah I won't do anything about outlink you re right [11:21:56] * isaranto needs to pay close attention to what Luca is saying in IRC [11:23:38] also outlink needs some love, afaics we don't check the features in the request and we have some "print()" in there [11:25:39] (03PS1) 10Ilias Sarantopoulos: (WIP) - update README with pre-commit info [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883549 (https://phabricator.wikimedia.org/T325198) [11:28:48] (03PS1) 10Ilias Sarantopoulos: update nltk-dependency [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883550 [11:29:30] (03PS2) 10Ilias Sarantopoulos: update nltk-dependency [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883550 [11:29:32] (03CR) 10Elukey: [C: 03+1] update nltk-dependency [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883550 (owner: 10Ilias Sarantopoulos) [11:30:31] elukey: what took u so long 😛 [11:31:07] hehe thanks for the amazing response times [11:31:28] I will deploy this to staging and test it a bit more [11:31:50] I would love to be able to deploy to staging on my own without bugging u... [11:38:06] yeah I feel the pain sorry, I'll try to check in deployment-charts what we can do! [11:38:22] going afk for lunch + gym, when I get back I'll finish testing outlink! [12:19:21] (03CR) 10Ilias Sarantopoulos: [C: 03+2] update nltk-dependency [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883550 (owner: 10Ilias Sarantopoulos) [12:20:27] (03Merged) 10jenkins-bot: update nltk-dependency [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883550 (owner: 10Ilias Sarantopoulos) [12:39:01] ok, I checked an older version and it produces the same error with tornado... [12:45:54] because the server doesn't crash it isn't a big deal but it isn't the correct behaviour. [13:03:28] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 (10kevinbazira) [13:04:19] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 10th round of wikis - https://phabricator.wikimedia.org/T308135 (10kevinbazira) a:03kevinbazira [14:08:52] isaranto: ah snap I thought it worked, weird [14:09:04] okok I'll test it as well, and provide the correct import then [14:11:06] isaranto: unblocked for the nltk change, you can deploy :) [14:28:10] elukey: thanks! regarding tornado, I can submit a commit on your patch if u want [14:28:59] isaranto: nono lemme fix it, it should take a min [14:29:23] I remember it was working though, we really need some unit test [14:30:19] all i did and worked was this: from tornado.web import HTTPError or whatever it is (since u removed HTTError [14:34:00] so the issue affects also preprocess_utils.py, I haven't seen it this morning [14:34:29] that is shared among multiple model servers [14:34:32] like events.py [14:35:02] ok I'll apply the same fix [14:36:29] yes. you'll see it only when we try to raise that error [14:36:59] the downside with interpreted languages...and that's why unit testing is important. anyway.. [14:37:19] in other news, everything I have tested seems to work now in revscoring 3.9! 🎉 [14:37:43] I'm ready to deploy to prod and then probably jump on the testing ticket so I can test all the models [14:39:06] I mean in revscoring with python 3.9 + bullseye [14:41:59] isaranto: we are definitely not ready for prod, the event stuff is completely broken [14:42:17] as well as all the corner case code that raises tornado errors no? [14:46:41] Hello! [14:49:06] (03PS3) 10Elukey: Move revscoring model server to fastapi and adapt events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883541 [14:49:07] morning! [14:49:15] isaranto: updated the code review [14:49:22] morning! [14:49:57] (03PS4) 10Elukey: Move revscoring model server to fastapi and adapt events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883541 [14:51:45] you're right, for some reason I thought we had deployed to prod and some things were broken - I checked and I was wrong, just confused [14:52:36] so then we can fix the tornado errors first - which are already in prod though 😛 [14:53:07] (03PS5) 10Elukey: Move revscoring model server to fastapi and adapt events [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883541 [14:53:16] shall I work on the testing ticket from tomorrow then or do u need some help with other things first? [14:53:49] nevermind lets talk in a bit [14:53:55] isaranto: something must have happened recently with the prod images, I am 100% sure that the tornado errors worked fine (me and Aiko tested them) :( [14:54:21] if you want to check the code review above and comment what's wrong etc.. while I test it would help a lot [14:54:41] after that in theory we should have everything fixed [14:58:43] I'll do that! [14:58:53] I'm also building an even older image to check [15:00:19] so far the revscoring model server seems to work fine, with and without events [15:15:29] 10Machine-Learning-Team: get a GPU on Lift Wing - https://phabricator.wikimedia.org/T327923 (10calbon) [15:46:45] 10Machine-Learning-Team, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 9 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10ayounsi) [15:55:51] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) I'll check our db-related hosts and I'll get back to you tomorrow [16:03:19] chrisalbon: the avatar is a little creepy but very interesting idea :D [16:03:49] After watching it like 20 times I've decided the fact her shoulders don't move puts it in the uncanny valley [16:54:08] isaranto: mmm the event.py calls in outlink may need more follow up, I am testing them in the transformer/predictor architecture and the error propagation is weird (at least in the docker setup) [16:54:18] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "I have a simple question regarding moving functions around, other than that LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883541 (owner: 10Elukey) [16:54:26] everything works but it may need more follow ups [16:54:38] elukey: you're good to go from my side [16:55:12] initially I didn't understand why u moved some functions around but i got it afterwards (kserve 0.8-9 vs 0.10) [16:55:16] right? [16:55:28] (03CR) 10Elukey: Move revscoring model server to fastapi and adapt events (032 comments) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/883541 (owner: 10Elukey) [16:55:51] isaranto: exactly yes, do you want some comments about it in the commit msg or elsewhere? [17:01:17] no I think it is ok for me. If we left it in the same place it would require an explanation. moving it solves the problem. these are only used by revscoring anyway (for now at least) [17:02:08] logging off folks, cu tomorrow! [17:07:38] night! [17:20:55] created the new PKI intermediate CAs for the k8s 1.23 upgrade, tomorrow I'll keep going :) [17:20:59] have a nice rest of the day folks! [17:31:21] night! [18:42:07] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10RKemper) [18:45:10] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10RKemper) [18:50:33] 10Machine-Learning-Team, 10DBA, 10Data-Engineering, 10Data-Persistence, and 10 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10LSobanski)