[08:13:35] ottomata: o/ thanks for the timeline! So IIUC in a couple of quarters from now (say around July) there will be some initial support for use cases like revision-score-model-xxx right? [08:14:29] I am also interested in the "build their own part" - will there be the possibility to add a stream to a "central" / DE-managed Flink cluster or we'll have multiple team-owned flink deployments? [08:15:00] In the latter use case we may prefer Benthos on k8s if tests look good, a central stream processor on the other hand would be really handy [08:22:57] good morning folks :) [09:00:13] klausman: o/ [09:00:39] I merged the knative-serving chart now, it is not going to be applied until we switch the pins in admin_ng: [09:00:42] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/values/common.yaml#138 [09:00:48] (writing down as FYI) [09:05:32] Morning! [09:05:37] And ack :) [09:29:10] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ml clusters to kserve 0.9 - https://phabricator.wikimedia.org/T325528 (10elukey) https://kserve.github.io/website/0.9/blog/articles/2022-07-21-KServe-0.9-release/ [09:29:15] isaranto: https://kserve.github.io/website/0.9/blog/articles/2022-07-21-KServe-0.9-release/ [09:29:25] some interesting add-ons in 0.9 [09:33:23] nice additions! it seems things from the training pipelines make it to inference: the ability to create a DAG mostly [09:33:36] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10kostajh) [09:33:54] yep yep [09:41:47] elukey: will +2 automerge or do I have to do both? [09:46:24] both :) [09:46:31] A;right [09:47:02] And there are no further merging steps before what we talked about yesterday? [09:47:57] SHould the build script run as root as well, or is that only needed for git? [09:48:15] nvm, it's only in root's PATH, so obs root. [09:48:23] Running now. [09:49:04] https://phabricator.wikimedia.org/P43170 So far, so good [09:49:39] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10kostajh) @kevinbazira it looks like Chechen wiki (`cewiki`) didn't get into the list of wikis in `wikis.txt` for some reason, and so t... [09:55:53] perfect looks good [09:58:08] on a related note - we have unified the revscoring docker images into one, so we can now in theory clean up our registry [09:58:33] I am wondering if anybody used/uses the revscoring images from the community but I doubt so [09:58:50] Ok, build script is now in the publishing stage [09:58:51] in the future we may need to be conservative but in this case we can probably prune them [09:59:17] Could we make the old revscoring stuff unavailable in a way that is easily reverted? [09:59:37] Just to see if anyone notices for a while, and then clean them up for good if nobody complains [10:00:05] build+publish are complete [10:00:25] not sure, I think that we have a way to clean up the registry from images but not to simply unpublish/disable them [10:00:48] anyway, I'll open a task :) [10:00:56] thanks for the build + publish! [10:01:16] https://wikitech.wikimedia.org/wiki/Docker-registry#Deleting_images [10:04:49] Did `docker pull docker-registry.wikimedia.org/knative-build:latest` and got the expected image [10:05:03] `"Created": "2023-01-17T09:52:51.521815409Z"` [10:06:08] Same for a few spot-cxhecked non-build images [10:06:34] super, I closed the related task [10:07:07] 10Machine-Learning-Team, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Import new knative-serving version (k8s 1.23 dependency for ML) - https://phabricator.wikimedia.org/T323793 (10elukey) 05Open→03Resolved Docker images published, chart merged! [10:08:01] klausman: Janis worked on https://phabricator.wikimedia.org/T326340 and IIUC one of the staging clusters is already on 1.23 [10:08:16] he is afk but probably next week we'll be able to upgrade ours [10:09:16] Ah, that's a neat checklist [10:09:46] And yeah, Janis is out his week, back next [11:10:26] elukey: klausman: feel free to add me as reviewers in all SRE related patches in other repos (as u did in production-images) . even if I can't +2 it helps me get the "full picture" [11:10:41] Ack, will do [11:11:13] thanks 🤗 [11:21:00] * elukey hates yaml and template [11:21:03] *templates [11:23:22] <3 [11:23:43] hang in there. unfortunately yaml makes the world go round and round... [11:24:07] * isaranto struggling with apple silicon stuff [11:34:56] elukey: I presume you're aware of yq? https://kislyuk.github.io/yq/ [11:34:58] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10kevinbazira) Thank you for letting me know @kostajh. I have started re-running the pipeline for Chechen Wikipedia - cewiki. Will let... [11:34:59] * elukey lunch! [11:35:22] never tried yq, will do [11:35:39] but my frustration is more with yaml templating and hel(l)m [11:35:41] It sometimes helps with finding whitespace errors, but is of course no panacea [11:39:42] found this for M1 chip holders https://pyenchant.github.io/pyenchant/install.html#apple-silicon-related-errors [11:40:32] on yq: I've found it incredibly easy to use compared to other more native stuff [11:41:25] but you can't rely on it for scripts you want to run anywhere [11:51:12] yeah, I use jq from scripting sometimes, but never yq [12:11:17] * klausman lunch and errand [12:37:17] 10Machine-Learning-Team, 10Add-Link, 10Growth-Scaling, 10Growth-Team: Establish processes for running the dataset pipeline - https://phabricator.wikimedia.org/T276438 (10kostajh) > How often should we re-run the pipeline for an existing dataset? I noticed today that cswiki's dataset was last generated in... [13:17:24] * isaranto l8 lunch [14:01:31] Mdk j [14:01:38] Morning all! [14:02:13] o/ [14:04:45] \o/ [14:13:11] isaranto: I just seen the msg for kserve's upgrade sorry! If you want to help we can pair on the upgrade of the docker images for model servers [14:13:24] I have a code review almost ready for the control plane's upgrade [14:16:40] sure, I want to part the work I'm doing now. shall we start on it tomorrow morning? [14:16:56] yes yes even later on during the week [14:17:12] I could try to submit a patch and we could take it from there (?) [14:17:28] unless you want to do it otherwise e.g. first sync etc [14:19:50] nono on the contrary, if you have something that works please send it :) I think that there will be some dependency problems with revscoring though [14:20:08] since kserve's numpy dep got bumped [14:22:24] isaranto: https://github.com/kserve/kserve/blob/release-0.9/python/kserve/requirements.txt#L12 [14:22:40] https://github.com/elukey/revscoring/blob/master/requirements.txt#L14 [14:22:42] :( [14:22:58] ofc [14:23:18] it will allow us to upgrade numpy, scipy etc [14:23:20] but in theory bumping numpy in revscoring shouldn't be a big issue [14:23:30] yeah that too [14:25:49] \o 'ello Chris [14:26:40] elukey: +1'd your kserve 0.9 change [14:28:40] I can do the merge + build if you want. [14:31:00] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10kevinbazira) @kostajh, the cewiki pipeline has completed running successfully and I have published the datasets. [14:34:39] klausman: thanks! there is a codfw outage in progress so let's hold any action for a sec [14:34:52] Aye, cap'n [14:35:18] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10kostajh) >>! In T304550#8531114, @kevinbazira wrote: > @kostajh, the cewiki pipeline has completed running successfully and I have pub... [15:04:35] 10Machine-Learning-Team, 10Add-Link, 10Growth-Scaling, 10Growth-Team: Establish processes for running the dataset pipeline - https://phabricator.wikimedia.org/T276438 (10MGerlach) @kostajh I agree that we should re-run the pipelines after some time. If possible, updating after 6 months seems reasonable (th... [15:17:03] (03PS14) 10Ilias Sarantopoulos: Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [15:17:15] (03CR) 10CI reject: [V: 04-1] Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [15:21:09] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade python from 3.7 to 3.9 in docker images - https://phabricator.wikimedia.org/T325657 (10isarantopoulos) Figured out a way to make the failing models work by monkey patching the `utils.py` of the enchant library https://gerrit.wikimedia.org/r/c/machinelearning... [15:25:38] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ml clusters to kserve 0.9 - https://phabricator.wikimedia.org/T325528 (10elukey) a:03elukey [15:37:47] 10Machine-Learning-Team: Add documentation about LiftWing to the API Portal - https://phabricator.wikimedia.org/T325759 (10calbon) a:03klausman [15:53:21] going to get back my car, will be back online in a bit [15:55:48] https://githubnext.com/projects/hey-github/ [15:56:27] the example they have is EDA on titanic dataset [15:57:01] "What could possibly go wrong?" etc [16:17:53] wow [16:19:12] (03PS15) 10Ilias Sarantopoulos: Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [16:19:15] (03CR) 10CI reject: [V: 04-1] Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [16:26:05] (03PS16) 10Ilias Sarantopoulos: Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [16:27:27] (03CR) 10CI reject: [V: 04-1] Upgrade the revscoring model server to Python 3.9 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/870517 (https://phabricator.wikimedia.org/T325657) (owner: 10Elukey) [17:26:03] * elukey afk! [20:42:21] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team: Establish process for periodically refreshing link recommendation models - https://phabricator.wikimedia.org/T327212 (10kostajh) [20:42:56] 10Machine-Learning-Team, 10Add-Link, 10Growth-Scaling, 10Growth-Team: Establish processes for running the dataset pipeline - https://phabricator.wikimedia.org/T276438 (10kostajh) >>! In T276438#8531223, @MGerlach wrote: > @kostajh I agree that we should re-run the pipelines after some time. If possible, up...