[04:40:25] (03CR) 10Kevin Bazira: [C: 03+2] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/875284 (https://phabricator.wikimedia.org/T325198) (owner: 10Ilias Sarantopoulos) [04:46:22] (03Merged) 10jenkins-bot: Add pre-commit hooks [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/875284 (https://phabricator.wikimedia.org/T325198) (owner: 10Ilias Sarantopoulos) [05:01:59] (03CR) 10Kevin Bazira: [C: 03+1] "Besides the Merge Conflict. Everything else LGTM." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/879868 (https://phabricator.wikimedia.org/T325295) (owner: 10AikoChou) [08:12:00] hello folks :) [08:12:31] hi elukey o/ [08:12:33] Happy New Year! [08:14:38] Hello Kevin, same thing to you! [08:40:59] hey Luca 🤗 [08:42:01] isaranto: o/ [08:43:04] I didn't check yet all the updates, but if you are still battling with py3.9 deps we could in theory bump kserve to 0.9 in the docker image, and see if it alleviates the problems (if we use 0.8 for the control plane it should be fine, we'll upgrade it soon anyway) [08:59:27] ah snap I see https://phabricator.wikimedia.org/T325657#8524276 [08:59:36] this is a major thing to do sigh [09:00:25] yes... [09:04:02] it is a big effort but we can't really skip it [09:11:46] we need a plan on how to approach the python upgrade + make retraining easier. this is/will be a recurring topic [09:13:20] (03CR) 10Elukey: "nice!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/871185 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [09:13:34] definitely yes, with so many models etc.. [09:14:12] related subject is revscoring, since we should keep its deps as much as possible up to date, but at some point it will be necessary to change some code etc.. [09:14:30] either we deprecate it and focus on new libs, or we should do some work on it [09:14:45] not saying to improve it etc.., only basic maintenance [09:15:27] I think it applies to all models we deploy, at some point we may need to update code or add something, but ok yes maybe for revscoring we could take another approach [09:15:49] also +1 on kserve upgrade. We could work on it this week. wdyt? [09:16:04] unless you have something else in mind - more urgent [09:16:06] yes sure sure, did you see the steps in the task? [09:16:46] I can explain at high level what we did in the past (I know that you have experience with kserve but not sure how much from the wonderful k8s level) [09:16:57] https://phabricator.wikimedia.org/T325528 [09:17:53] there are two layers for the upgrade - k8s controller and model server docker images [09:18:21] IN THEORY (but I haven't really found any docs about it) we could upgrade the model servers beforehand, and then the control plane [09:18:45] if the upgrade is a minor version (like in this case, 0.8 -> 0.9) it should be ok [09:21:36] yes I think it is quite clear. from an infra point of view we "just" need to upgrade the kserve controller to the new version (as long as it is compatible with the deployed knative and istio versions) [09:22:03] i do not understand this step though `Update docker images in the production-images repository`. any hints? [09:24:18] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM! my bad as well. I didn't know in which order we were going to merge the patches. all good!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/879868 (https://phabricator.wikimedia.org/T325295) (owner: 10AikoChou) [09:24:40] ah yes - we cannot use docker images from external docker registries, so we need to import and build dockerfiles [09:24:45] in this case, the kserve controller [09:25:14] clear! [09:25:35] the last time the upgrade went fine since we didn't really care of traffic etc.. [09:26:01] same thing now but soon we'll hopefully have some real users and we should start thinking how to upgrade less brutally :D [09:26:15] kevinbazira: I saw you merged this one https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/875284 [09:27:43] elukey: if you can take a look whenever you find some time. I'd like us to experiment and figure out something that works. at this moment the pre-commit hooks do not run for all files in the repo, just for the files that the revscoring test images copies inside it [09:29:01] sure I'll do it today [09:29:05] (hopefully) [09:34:20] it is not urgent, i just pointed it out cause it is already merged [09:35:18] we added this new linter https://github.com/charliermarsh/ruff which is super fast and already adopted by high profile projects e.g. pandas, airflow, fastAPI etc [09:40:29] yep yep I saw it, at some point talk with volans (Riccardo) about it, he'll be interested for sure [09:42:16] ack [10:17:03] https://github.com/python/peps/pull/2955 :O [10:29:09] interesting! [11:09:37] Morning! [11:11:12] Ah yes, PEP 703. I have been looking at it and discussed it with a few local Pythonistas over the last week. It's of course not the first attempt at getting rid of the GIL, but I think its approach to the problem and (in no small part) the increased pressure on modern Python to have an answer to non-process parallelism make it more likely to be implemented than most of its predecessors. The [11:11:14] "breaking C API" problem is a biggie though. [11:11:37] It also gets bonus points for referencing Paul McK's book on parallel programming :D [11:12:11] we'll see, maybe this is the good one :) [11:12:34] isaranto: my only concern with ruff is that it seems a little bit too new [11:12:41] Fingers crossed. Though the cynic in me moans about having yet another 2-to-3 crisis on our hands. [11:14:05] in theory it is an opt-in feature so it shouldn't really break things by default [11:15:51] it is new but tbh it's adoption from so many high profile projects gave me more confidence :D . If we have even the slightest of problems we can just switch it with flake8 anytime [11:17:44] definitely yes.. just to understand - in your change we install pre-commit during tests, that in turn reads its config and pulls down ruff right? [11:34:47] * elukey lunch! [11:48:29] yes exactly. when you do this in your local setup this happens only once (or when you change your config) but it is common to use it in this way. and indeed it is a nice way to use the same setup in all environments (local & CI) [12:01:14] At least it doesn't build it ab initio (first ccompile the Rust toolchain, then...) [12:11:13] * isaranto lunch o clock! [13:19:30] * klausman late lunch [14:33:51] klausman: if you are ok I'd start merging the knative stuff (new images + chart) [14:34:09] IIUC Janis started to upgrade k8s on serviceops' staging [14:43:11] wfm! [14:43:34] If you want any help with merge/deploy, lmk [14:53:59] do you want to take care of the production-images side? (don't recall if you already build/published the images before or not) [14:58:30] Nah, never done it [14:58:35] yes I can. i haven't build/published [14:58:55] sry, jumped in your conv :D [15:01:13] isaranto: only SREs can do this bit :( [15:01:20] :( [15:01:29] :D [15:01:42] klausman: when you have time lemme know so we can do it [15:02:09] elukey: currently in the DSE meeting, not sure if it'll be anything of length/substance :D [15:02:25] yep yep even tomorrow [15:02:47] Let's sync up after this one (in 30m at the most) [15:36:12] so the procedure is the following [15:36:42] 1) merge the change for knative, in this case https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/861349 [15:36:52] we need to V:+2 since there is no CI atm [15:37:36] 2) ssh to build2001, cd to /srv/images/production-images/ and git fetch/diff/pull the change (needs to be done as root [15:37:53] 3) run the build-production-images script [15:38:04] it will take care of building + publishing if the images are ok [15:38:08] For 2): does one usually use the specific commit or HEAD? [15:38:59] usually HEAD, but the build script will list what images are being built [15:39:22] it may happen that we build other images, I usually fetch/diff/pull to make sure I am pulling down only my commit [15:39:29] or better, my changes [15:39:32] Roger [15:39:37] if there are more, I ping people [15:39:54] Does the script pause before doing the whole thing? [15:40:07] nope [15:40:26] it uses docker-pkg behind the scenes [15:41:26] when you say 'run the build-production-images' --- where does this script live? [15:41:45] you should have it in the root's PATH [15:42:05] ah, right [15:42:53] trying to find the docs on wikitech but I don't find them [15:43:06] I'll do it tomorrow, since a friend is visiting and will arrive between 1700 and 1800, and I'd rather have more time in case something goes wrong. [15:44:24] sure [15:45:45] Also, in the DSE meeting David Casse asked about the new Benthos-based Kafka flows and how they work. Since I wasn't sure about details, I asked him to open a Phab task to discuss that [15:48:51] klausman: we haven't really decided about Benthos yet, all still in flux [15:49:21] Yeah, I think David wants to be sure his concerns as a consumer are heard, so to speak [15:50:44] in theory if would be nice if those streams were handled by Data Engineering's stream processor (likely Flink), need to check with them if there are news or not (otherwise we can probably start thinking about benthos on k8s) [15:59:25] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade python from 3.7 to 3.9 in docker images - https://phabricator.wikimedia.org/T325657 (10elukey) I am +1 in investing time for retrain if needed, it is a sizeable amount of work but we'll need to migrate away from Buster by September :) [16:21:31] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade python from 3.7 to 3.9 in docker images - https://phabricator.wikimedia.org/T325657 (10isarantopoulos) I would recommend to create some simple pipelines (mlflow, airflow, argo) or just containerize the training procedure. It seems that we may have to deal wi... [16:22:10] is airflow the only supported pipeline orchestrator @ WMF at the moment? [16:29:35] in theory yes, it is handled by Data Engineering [16:58:36] new kserve config yaml is around 18k lines [16:58:43] * elukey cries in a corner [17:16:22] * elukey afk! [20:38:00] elukey: our timeline for flink stream currently is: this quarter, dogfood our own pipeline (page content change). next quarter: use case(s) with some libraries and tooling to make everything nice. [20:38:19] after that, more general platform support for anyone to build their own with minimal handholding.