[04:55:04] (03Abandoned) 10Ilias Sarantopoulos: fix: set default lift wing url to null [extensions/ORES] - 10https://gerrit.wikimedia.org/r/937142 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [04:59:26] Good morning ☀️ o/ [05:14:23] I've deployed kserve 0.11.1 for revscoring damaging in staging, looks solid [07:26:29] o/ [07:26:33] nice! [07:31:00] (03CR) 10Elukey: [C: 03+1] revertrisk-la: bump knowledge_integrity version to v0.4.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/962066 (https://phabricator.wikimedia.org/T347330) (owner: 10AikoChou) [07:50:01] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/962066 (https://phabricator.wikimedia.org/T347330) (owner: 10AikoChou) [07:51:29] good morning :) [07:57:00] (03Merged) 10jenkins-bot: revertrisk-la: bump knowledge_integrity version to v0.4.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/962066 (https://phabricator.wikimedia.org/T347330) (owner: 10AikoChou) [08:16:55] * isaranto afk - commuting [08:40:26] aiko: o/ by any chance did you check the sha512 of he rr model before uploading? [08:40:56] we should really establish a process to get the model binary from other teams [08:46:13] for the moment we can have something simple, like [08:46:34] 1) few ways to pass the module (gsuite, stat boxes, etc..) [08:47:17] 2) sha512 of the file stated in Phabricator (requester logs it, ml-ops verifies it and confirm before uploading) [08:47:29] does it make sense? [08:54:41] elukey: no, we didn't do it this time. Yes that makes sense! [08:55:33] aiko: do you mind to do it retroactively and log into the task how the model was passed around and what it is the sha? [08:55:41] just to start doing it, then we can document [08:57:04] 10Machine-Learning-Team, 10Patch-For-Review, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10klausman) SLO dashboard now available at: https://grafana-rw.wikimedia.org/d/slo-Lift_Wing_Readability/lift-wing-readab... [08:57:11] https://grafana-rw.wikimedia.org/d/slo-Lift_Wing_Readability/lift-wing-readability-slo-s?orgId=1 <- Readability SLO dashboard now available [08:57:24] elukey: ok!! no problem [08:57:49] klausman: nice! [09:22:50] 10Machine-Learning-Team, 10serviceops: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10SCherukuwada) @Seddon Could you please post an update here and link to relevant tickets? [09:33:26] klausman: since you are now well up-to-date with pybal, do you have time to prioritize the rec-api-ng's VIP? [09:33:30] :D [09:33:50] we are still seeing severe perf issues but at least we'll finish the work on that side [09:33:53] does it make sense? [09:40:28] elukey: does the requester need to state the model location or url in the Phab task? or only the sha512 of the file? [09:45:51] aiko: I think that we can state generally the location (without URL) and the sha512 of the file in the Phab task. Then we double check and verify [09:46:12] basically it is a way to have a sort-of "trace" (for us and for the community) about what binary we handle/serve [09:46:45] the sha512 is the most important bit since we'll make sure that the file wasn't tampere [09:46:48] *tampered [09:47:22] we should also probably add the sha512 files to our public file directory too [09:50:10] got it, that makes sense. [09:51:51] +1 also add the checksum to the public model directory [09:52:18] https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_stages_for_a_model_server_on_Lift_Wing [09:52:22] I added a step to the list [09:55:48] aiko: I don't recall the link of the public dir for the models, do you have it handy to paste? [09:55:48] thanks luca! [09:56:17] elukey: https://analytics.wikimedia.org/published/wmf-ml-models/ [09:56:36] ahhh [09:59:31] 10Machine-Learning-Team: Add sha512 checksum files to all the ML's models in the public dir - https://phabricator.wikimedia.org/T347838 (10elukey) [09:59:34] aiko: opened --^ [09:59:52] one interesting thing to work on is also to verify that only us (ML) can publish those files [10:00:08] I am not 100% sure if this is the case, or if we can do it with basic perms [10:00:45] we trust users on stat100x but we should nonetheless have some fence in place for file tampering (malicious or accidental [10:01:07] it just came up to mind [10:01:08] (03PS5) 10Ilias Sarantopoulos: ores-legacy: return 400 on callback requests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/961980 (https://phabricator.wikimedia.org/T347663) [10:06:31] yeah that's a good point, we should do it [10:12:18] * aiko lunch! [10:33:39] same [10:37:19] klausman: when you are back lemme know if what I wrote above is ok (re: rec-api-ng VIP) [10:38:28] yes, will do the VIP stuff for rec-api-ng this week [10:38:46] (if my arms get better, got a doc appt for that on Wed) [10:40:48] sure sure! thanks :) [10:40:55] I can take care of it if you wnat [10:40:56] *want [10:49:19] Let me see if I can do it (or make substantial progress on it) today/tomorrw. [10:49:45] Going for lunch ,will deploy kserve to revscoring prod afterwards [10:52:23] isaranto: can we do a quick load test in staging before proceeding? [10:52:29] if you haven't done it [10:52:37] just to make sure that we are not hitting a major regression [10:52:57] Yep! [10:52:59] we have active clients now and I am wondering if we should add basic checks before moving to prod [10:53:08] I'll do it [10:53:17] <3 [10:54:38] * elukey lunhc! [10:54:42] * elukey lunch! :) [11:08:28] 10Machine-Learning-Team, 10Research: Expand language support for Revert Risk Model - https://phabricator.wikimedia.org/T347330 (10MunizaA) [11:15:45] 10Machine-Learning-Team: Add sha512 checksum files to all the ML's models in the public dir - https://phabricator.wikimedia.org/T347838 (10klausman) Do you think it would be useful to also keep the checksums in a different place, with permissions independent of the backing store behind the published/ directory? [12:05:59] Good morning alll [12:08:57] I did a load test in the staging for the new RRLA, and it worked well (tested performance and the problematic rev ids) [12:09:16] going to deploy to prod [12:09:40] chrisalbon: o/ morning :) [12:21:27] morning Chris! [12:44:22] aiko: nice! [12:46:39] 10Machine-Learning-Team: Add sha512 checksum files to all the ML's models in the public dir - https://phabricator.wikimedia.org/T347838 (10elukey) >>! In T347838#9215035, @klausman wrote: > Do you think it would be useful to also keep the checksums in a different place, with permissions independent of the backin... [13:11:02] 10Machine-Learning-Team, 10Research: Expand language support for Revert Risk Model - https://phabricator.wikimedia.org/T347330 (10achou) Thanks @MunizaA for adding the sha512 checksum for the new model binary in the task description. I have verified it and confirmed the integrity of the file that we uploaded t... [13:23:05] 10Machine-Learning-Team, 10Research: Expand language support for Revert Risk Model - https://phabricator.wikimedia.org/T347330 (10elukey) @achou @MunizaA thanks a lot! One nit - the paste outlined in the task's description is editable, so in theory anybody can tamper with it (everything is logged but it may be... [13:28:55] is this cool? https://github.com/alibaba/feathub [13:35:58] 10Machine-Learning-Team, 10Patch-For-Review: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10isarantopoulos) run a load-test on the deployed enwiki-goodfaith in staging before and after the upgrade and the results are almost the same **with kserve 0.10** ` isarant... [13:39:17] ottomata: o/ :) looks nice, first time I hear about it [13:39:25] a, it looks quite fresh [13:41:36] ottomata: o/ [13:59:31] proceeding with deploying revscoring to prod [13:59:59] elukey: <3 o/ [14:03:19] ottomata: looks interesting :D [14:06:57] * elukey afk for some errands! [14:08:58] 10Machine-Learning-Team, 10Research: Expand language support for Revert Risk Model - https://phabricator.wikimedia.org/T347330 (10achou) Adding the sha512 here: ` 94ff70cbfac87565b5e04480acd7accd7d0c1f424ebfc2cb858338bc62c309b3745220223489498254010fb772266abd5498e2671eb1cddc138717e663bd3922 *revert_risk_langu... [14:16:17] 10Machine-Learning-Team, 10Research: Review Revert Risk reports from WME - https://phabricator.wikimedia.org/T347136 (10achou) Hi @prabhat, the language support issue should be resolved in T347330. We have deployed the new Revert Risk language agnostic model to production. [14:24:27] deployments seem to be so slow..When I check the revisions it seems they are there (it says "Deploying") but it takes a lot of time until model servers are actually deployed [14:52:31] isaranto: in all ns or in the biggest ones? [14:53:45] elukey: Was struggling a bit with articlequality on codfw [14:54:23] Mostly the biggest ones , I'll do all the deployments and let you know [14:56:00] isaranto: it happens, we have soooo many isvs :( [14:56:07] *isvcs [15:00:34] Ack [15:06:09] the other think that I am thinking is related to the big namespaces and their allowed max cpu/memory for all pods [15:06:39] it forces a certain slow churn in what pods are deployed, since only some of them can be created at the same time [15:06:53] we could try to increase the limits [15:07:07] but it is kinda nice to have these deployments scoped [15:44:31] I wish we had something like disruption budgets. E.g. "When deploying, use as many resources as you like, but you may not slow or preempt more than X other pods" [15:52:14] there should be something like that, but all isvcs are a separate Deployment IIRC, so it is difficult to coordinate [15:56:44] yeah, that's what I figured. [16:20:49] hmm some deployments have failed. There was a failure trying to create new revisions and when I do describe the revision I see it is because of the quotas: [16:20:49] `kubectl describe revision ukwiki-articlequality-predictor-default-00013` [16:21:27] getting `FailedCreate` with message `pods "ukwiki-articlequality-predictor-default-00013-deployment-7clzdg" is forbidden: exceeded quota: quota-compute-resources, requested: limits.cpu=4, used: limits.cpu=88, limited: limi` [16:21:43] namespace: revscoring-articlequality in eqiad [16:22:44] ah! [16:23:42] they are all running now though [16:31:33] they were all running anyway. The services didnt fail, the new deployments did. If you see half of the pods started 11d ago [16:39:59] silly me yes [16:40:14] so we run 22 isvcs in there [16:40:43] because of the scaleup [16:41:44] will continue with the rest tomorrow morning to check [16:41:54] going afk folks! [16:42:47] and the request for each isvc is 1 CPU and 2 GBi of ram, but this is only the kserve container [16:43:17] we have ~0.5 millicores for the other containers, and ~0.6GBi of ram [16:44:20] ah no wait more, the queue-proxy requires 1 CPU [16:44:21] mmmm [16:44:37] ok so now it makes sense, for each pod it is a little more than 2 CPUs [16:44:45] we have 90 total, hence the limit [17:16:21] 10Machine-Learning-Team, 10Add-Link, 10Chinese-Sites, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 (10Sgs) I've checked the enabled wikis and all present a fair amount of results except for: - //xalwiki// returns 5... [17:16:38] 10Machine-Learning-Team, 10Add-Link, 10Chinese-Sites, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 (10Sgs) [17:19:06] isaranto: should be fixed with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/962651 [17:19:13] it will be applied in a bit [17:19:29] * elukey afk! [18:15:49] night elukey! [19:46:57] 10Machine-Learning-Team, 10Add-Link, 10Chinese-Sites, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 (10kevinbazira) >>! In T308139#9216703, @Sgs wrote: > I've checked the enabled wikis and all present a fair amount of...