[09:00:25] morning! [09:00:31] wow isaranto nice work! [09:01:06] I am curious if my failures were related to running the same config on bullseye (vs buster/ubuntu/etc..) [09:01:18] good morning! [09:01:58] I think yes..i am going to write on the ticket but some enchant libraries have changed names (both in latest ubuntu and debian) [09:03:02] isaranto: yeah I changed them as well in the commit that I linked, but for some reason scipy wasn't really building on bullseye [09:03:24] some weird errors, that got fixed when I bumped scipy too.. then I ran tests and of course numpy was to upgrade as well [09:03:28] I also tried this one with buster image https://github.com/wikimedia/revscoring/pull/531 and it needs more work [09:03:30] and then other errors [09:04:23] btw in the first PR if you see I changed some tests because the language dictionaries have changed (some words were in the dict that weren't previously) [09:05:13] elukey: what if we don't update to bullseye and we just bump the python version in buster? - use python 3.9 over there. shall I check? [09:07:02] isaranto: I'd prefer to run on bullseye, we have a hard deadline to stop using buster by next september, and SRE is already moving all infra to bullseye [09:07:18] we spend a little bit time now and we don't think about it later [09:07:58] but it is not something super urgent, we can work on it during the next couple of months! [09:12:11] ok, bullseye it is then. I'll try it at some time as well with the dependencies I have in the github branch [09:12:48] super thanks [09:31:08] isaranto: another contraint to keep in mind is https://github.com/kserve/kserve/blob/release-0.8/python/kserve/requirements.txt [09:31:21] that is our current version of kserve... numpy needs to be 1.19.x [09:31:41] 👍 [09:31:43] but IIRC if we bump scipy too much there will be issues between them [09:32:08] elukey: is it ok if I upload a new patchset on your bullseye patch? [09:32:22] yes yes please [09:37:32] for example, with a recent scipy and numpy ~= 1.19.3 I get the following when running tests [09:37:35] ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 80 from PyObject [09:41:51] isaranto: you are totally right, the versions of numpy/scipy/etc.. that we use are ancient.. I am wondering if we gained more perfs simply upgrading them [09:42:08] I'm playing a bit locally for now... If we don't manage to update the deps I would suggest to just put the effort to upgrade kserve and then continue with the rest. I think it isnt worth the effort to hack our way for sth that is going to be deprecated [09:42:28] btw did u manage to overcome the `ModuleNotFoundError: No module named 'tox.reporter' error? [09:47:35] didn't see it yet [09:49:50] I am trying various versions of scipy, and 1.3.x doesn't lead to the numpy compatibility pointed out above, ubt then it fails in a fortran compilation :D [09:49:55] (03PS2) 10Ilias Sarantopoulos: WIP - Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [09:50:36] I think it makes sense to follow your suggestion and go for kserve 0.8 [09:50:40] err 0.9 [09:50:57] https://github.com/kserve/kserve/blob/release-0.9/python/kserve/requirements.txt [09:51:05] it requires numpy 1.21.x [09:51:51] that is from 2022, so one big step forward :D [09:51:59] lemme try to build revscoring with that combination [09:53:10] (03CR) 10CI reject: [V: 04-1] WIP - Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [09:57:09] ok, I fixed the test image failure about tox. I just removed the tox-monorepo package.. (i dont think it is needed in newer versions) [10:11:15] ahhh I am stupid, it was the gensim dep leading to the numpy incompatibility [10:11:53] ok now I see your 3 tests failed isaranto, okok [10:12:29] I am going to retry with a fresh installation [10:43:05] (03PS3) 10Ilias Sarantopoulos: WIP - Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [10:44:44] (03CR) 10CI reject: [V: 04-1] WIP - Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [10:48:06] isaranto: with https://github.com/elukey/revscoring/commit/8d3cfd0907f037052fd621bebaf9cf16b6948b85 I have all tests running but the serbian one on Bullseye [10:48:18] it is interesting that you got the errors in test_basque.py [10:50:10] but I wasn't able to fix the serbian test since there seem to be no word recognized [10:50:14] that smells strange [10:50:28] and hunspell-sr was not changed... [10:51:09] it probably changed version from buster to bullseye [10:51:43] yes 1.6.2 -> 1.7.x [10:51:54] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [11:21:02] * elukey lunch! [11:22:13] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade python from 3.7 to 3.9 in docker images - https://phabricator.wikimedia.org/T325657 (10elukey) I use this test environment: ` docker run --rm -it -v $(pwd):/revscoring docker-registry.wikimedia.org/bullseye apt-get install -y python3-pip python3-dev python... [11:37:24] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade python from 3.7 to 3.9 in docker images - https://phabricator.wikimedia.org/T325657 (10isarantopoulos) The test image in the patch https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/870517/ is working + all tests are successful. H... [11:44:05] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [11:49:57] I used a matrix setup in CI https://github.com/wikimedia/revscoring/pull/527 so that we test both python 3.7 and 3.9. This one is ready for review. The improvement upon it will be to run it with buster or bullseye image [12:02:20] (03PS4) 10Ilias Sarantopoulos: WIP - Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [12:04:00] (03CR) 10jenkins-bot: WIP - Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [12:07:51] wow nice work ilias [12:08:03] going to review it in depth when I am back but it looks awesome [12:08:04] <3 [12:09:09] I am not 100% sure if the new deps will run fine on buster as well, but if so even better [12:09:37] I am still a little puzzled about serbian vs basque but I'll stop asking questions :D [12:15:38] I am puzzled as well... the thing is that the best setup is to use the same image everywhere for CI - so to use bullseye everywhere. I have a separate PR to work on that https://github.com/wikimedia/revscoring/pull/531 [12:15:44] lets sync later [12:15:48] * isaranto afk lunch [14:22:59] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade python from 3.7 to 3.9 in docker images - https://phabricator.wikimedia.org/T325657 (10elukey) @isarantopoulos I tested your patch for 2.11.9 with my set up above, and I get the following error: ` ________________________________ ERROR collecting tests/data... [14:29:52] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade python from 3.7 to 3.9 in docker images - https://phabricator.wikimedia.org/T325657 (10isarantopoulos) @elukey only the test container was built successfully but using the branch provided in here https://github.com/wikimedia/revscoring/pull/527 I changed the... [14:42:34] elukey: are u available to jump on a quick call? [14:42:41] perhaps after the meeting? [14:44:39] yeah already on one sorry :) [14:49:02] (03PS5) 10Ilias Sarantopoulos: WIP - Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [14:51:59] (03CR) 10CI reject: [V: 04-1] WIP - Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [15:00:15] (03CR) 10AikoChou: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870520 (https://phabricator.wikimedia.org/T325349) (owner: 10AikoChou) [15:06:13] (03Merged) 10jenkins-bot: revertrisk: add torch==1.13.1+cpu [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870520 (https://phabricator.wikimedia.org/T325349) (owner: 10AikoChou) [15:16:08] (03PS6) 10Ilias Sarantopoulos: WIP - Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [15:17:49] (03CR) 10CI reject: [V: 04-1] WIP - Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [15:30:52] (03CR) 10Ilias Sarantopoulos: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [15:31:29] elukey: I built the bullseye image 🎉 [15:31:36] wooowwww [15:32:25] yay \o/ [15:32:35] waiting for CI jenkins to verify it again but it is with this https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/870517/ [15:32:35] and this https://github.com/wikimedia/revscoring/pull/527 [15:33:39] Jenkins agrees! [15:35:24] (03PS7) 10Ilias Sarantopoulos: WIP - Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [15:35:28] great work Ilias :) [15:35:43] I think that we can merge the new revscoring version when we are back [15:35:49] what do you think? [15:35:52] ty. based on your work [15:36:21] yes of course [15:37:46] I have one last glitch for github action based on bullseye https://github.com/wikimedia/revscoring/actions/runs/3758812792/jobs/6387609053 [15:37:46] I cannot run the command `python3 -m nltk.downloader omw sentiwordnet stopwords wordnet` if I get passed that it will be ready as well [15:38:07] I get his thing here which doesnt make much sense for me `python3: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by python3)` [15:38:18] more info here https://github.com/wikimedia/revscoring/actions/runs/3758812792/jobs/6387609053. Any clues? [15:40:31] mmm maybe it just need the libc debian package? [15:41:38] ah wait sorry I said something stupid - maybe the version of GLIB is not right [15:42:03] I see 2.31 on bullseye [15:42:48] yeah lib6 is there, perhaps it needs some kind of hack. I think it has to do with github actions python because the other runs fine [15:42:53] isaranto, elukey: this means the binary is trying to resolve a symbol which was only added in glibc 2.34, so won't work on buster/bullseye [15:43:12] typically happens if it was built on testing/sid [15:43:37] and the build system detected the presence of a current glibc, so it picked the higher feature set [15:43:40] moritzm: I was thinking that maybe the pks were compiled for ubuntu or similar yes [15:43:43] hm.. thanks moritzm! [15:44:22] this has been a problem in the past before, e.g. if a package was built against a glibc supporting openat() (and the various fooat() sys calls) not present in older glibc releases [15:52:24] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/870900 [15:52:38] elukey: --^ new image for revertrisk model [15:53:48] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade python from 3.7 to 3.9 in docker images - https://phabricator.wikimedia.org/T325657 (10isarantopoulos) Successfully built revscoring with debian bullseye and python 3.9. The below two PR/patches need to be merged (first the revscoring one) and then the infe... [15:54:44] merged :) [15:56:46] thank uuu [15:57:36] I figured it out (at least why it doesnt work :D ) It has to do with the githbu actions and the fact that it was built on ubuntu as you said . thank u both https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#using-setup-python-with-a-self-hosted-runner [16:06:14] ah nice! [16:06:22] elukey: o/ not sure why the revertisk pod wasn't recreated after deployment in staging, could you manually delete it for me? [16:07:18] aiko: maybe puppet didn't run and the deployment-charts repo wasn't updated [16:07:21] did you see a diff? [16:07:55] elukey: yep I checked it [16:09:22] and nsfw pod was recreated but revertrisk wasn't [16:11:08] aiko: done, really weird [16:12:41] elukey: yeah.. that's weird [16:12:42] is the docker image big? It is taking ages [16:13:21] about 2.24GB [16:15:22] Is it because we are assigning more resources to the pod? [16:15:57] we requested 4 cpu and 6G memory [16:16:40] mmm no I think it is the storage initializer [16:16:40] botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://thanos-swift.discovery.wmnet/wmf-ml-models?prefix=experimental%2Frevertrisk%2F20221026144108%2F&encoding-type=url" [16:18:04] let me check if the model is there [16:19:18] it says connection timeout, that is very suspicious [16:21:36] weird.. the path in the error message is wrong [16:22:53] the timestamp in the error msg is 20221026144108 [16:22:56] but in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/870900 it should be 20221214175551 [16:24:21] and I checked the model is there s3://wmf-ml-models/experimental/revertrisk/20221214175551/ [16:27:03] in the diff, I saw [16:27:07] - name: STORAGE_URI [16:27:08] - value: "s3://wmf-ml-models/experimental/revertrisk/20221026144108/" [16:27:08] + value: "s3://wmf-ml-models/experimental/revertrisk/20221214175551/" [16:27:29] I see this in the isvc's config [16:27:31] s3://wmf-ml-models/experimental/revertrisk/20221214175551/ [16:27:34] that is ok [16:32:51] aiko: how big is the model binary? [16:33:06] did it grow in size? I am wondering if we are hitting a timeout because it is too big [16:33:07] 2.5G [16:33:52] and do you know what was before? [16:37:12] the first ver model was small before, only 487kb [16:38:45] it was just a xgboost model without any transformers models [16:40:57] not sure, connect timeout should be only related to establishing the tcp connection [16:41:29] but maybe we are hitting a timeout in the fetch of the model [16:44:47] I ran [16:44:47] kubectl describe pod revert-risk-model-predictor-default-jschs-deployment-747467gbvv -n experimental to inspect the pod [16:46:00] but I still saw it was trying to connect to the wrong path [16:46:00] s3://wmf-ml-models/experimental/revertrisk/20221026144108/ [16:46:00] to copy a model [16:46:56] ah right interersting! [16:46:59] s3://wmf-ml-models/experimental/revertrisk/20221026144108/ to local [16:49:37] ahhhh [16:49:37] [maximum memory usage per Container is 4Gi, but limit is 6Gi, maximum memory usage per Pod is 6Gi, but limit is 7673479168] [16:49:45] this was in kubectl get events -n experimental [16:50:30] I didn't sync the options that you have set Aiko [16:50:32] I am doing it now [16:51:38] aha! [16:54:17] [I 221222 16:54:02 storage:90] Successfully copied s3://wmf-ml-models/experimental/revertrisk/20221214175551/ to /mnt/models [16:54:20] \o/ [16:54:27] niceee! [16:54:41] podInitializing.. [16:56:14] running now :) [16:56:41] all right folks I am going to log-off, tomorrow I'll be off so have a wonderful holidays :) [16:56:46] elukey: do you think we need to also build revscoring dependencies for python 3.7? [16:56:51] (I'll check IRC tomorrow to see if anybody is stuck) [16:57:00] aa sorry nevermind. [16:57:15] isaranto: nono let's concentrate only on 3.9 from now on [16:57:17] elukey: thanks Luca for your help :) [16:57:20] have a great time, cu next year! [16:57:21] what we support basically :) [16:57:27] o/ o/ o/ you too folks! [16:57:43] elukey: <3 [17:28:02] logging off ! I added this thingy https://github.com/wikimedia/revscoring/pull/532 so that dependabot can automatically open pull requests. combined with the other PR it will open pull requests + run tests so we will just review them. I used it in the past but it was quite different. these things change faaast. hope it works out of the box. 🤞 [17:31:19] isaranto: looks so cool! thanks for working on that :D have a happy holidays and see you next year!! [17:31:42] I'm here tomorrow as well! [17:32:06] ohhh yeah I forgot hahaha [17:32:20] :D [17:32:51] see u :DDD