[04:55:04] <wikibugs>	 (03Abandoned) 10Ilias Sarantopoulos: fix: set default lift wing url to null [extensions/ORES] - 10https://gerrit.wikimedia.org/r/937142 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos)
[04:59:26] <isaranto>	 Good morning ☀️ o/
[05:14:23] <isaranto>	 I've deployed kserve 0.11.1 for revscoring damaging in staging, looks solid
[07:26:29] <elukey>	 o/
[07:26:33] <elukey>	 nice!
[07:31:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] revertrisk-la: bump knowledge_integrity version to v0.4.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/962066 (https://phabricator.wikimedia.org/T347330) (owner: 10AikoChou)
[07:50:01] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/962066 (https://phabricator.wikimedia.org/T347330) (owner: 10AikoChou)
[07:51:29] <aiko>	 good morning :)
[07:57:00] <wikibugs>	 (03Merged) 10jenkins-bot: revertrisk-la: bump knowledge_integrity version to v0.4.0 [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/962066 (https://phabricator.wikimedia.org/T347330) (owner: 10AikoChou)
[08:16:55] * isaranto afk - commuting 
[08:40:26] <elukey>	 aiko: o/ by any chance did you check the sha512 of he rr model before uploading?
[08:40:56] <elukey>	 we should really establish a process to get the model binary from other teams
[08:46:13] <elukey>	 for the moment we can have something simple, like
[08:46:34] <elukey>	 1) few ways to pass the module (gsuite, stat boxes, etc..)
[08:47:17] <elukey>	 2) sha512 of the file stated in Phabricator (requester logs it, ml-ops verifies it and confirm before uploading)
[08:47:29] <elukey>	 does it make sense?
[08:54:41] <aiko>	 elukey: no, we didn't do it this time. Yes that makes sense!
[08:55:33] <elukey>	 aiko: do you mind to do it retroactively and log into the task how the model was passed around and what it is the sha?
[08:55:41] <elukey>	 just to start doing it, then we can document
[08:57:04] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review, 10Research (FY2023-24-Research-July-September): Deploy multilingual readability model to LiftWing - https://phabricator.wikimedia.org/T334182 (10klausman) SLO dashboard now available at: https://grafana-rw.wikimedia.org/d/slo-Lift_Wing_Readability/lift-wing-readab...
[08:57:11] <klausman>	 https://grafana-rw.wikimedia.org/d/slo-Lift_Wing_Readability/lift-wing-readability-slo-s?orgId=1 <- Readability SLO dashboard now available
[08:57:24] <aiko>	 elukey: ok!! no problem
[08:57:49] <elukey>	 klausman: nice!
[09:22:50] <wikibugs>	 10Machine-Learning-Team, 10serviceops: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10SCherukuwada) @Seddon Could you please post an update here and link to relevant tickets?
[09:33:26] <elukey>	 klausman: since you are now well up-to-date with pybal, do you have time to prioritize the rec-api-ng's VIP?
[09:33:30] <elukey>	 :D
[09:33:50] <elukey>	 we are still seeing severe perf issues but at least we'll finish the work on that side
[09:33:53] <elukey>	 does it make sense?
[09:40:28] <aiko>	 elukey: does the requester need to state the model location or url in the Phab task? or only the sha512 of the file?
[09:45:51] <elukey>	 aiko: I think that we can state generally the location (without URL) and the sha512 of the file in the Phab task. Then we double check and verify
[09:46:12] <elukey>	 basically it is a way to have a sort-of "trace" (for us and for the community) about what binary we handle/serve
[09:46:45] <elukey>	 the sha512 is the most important bit since we'll make sure that the file wasn't tampere
[09:46:48] <elukey>	 *tampered
[09:47:22] <elukey>	 we should also probably add the sha512 files to our public file directory too
[09:50:10] <aiko>	 got it, that makes sense.
[09:51:51] <aiko>	 +1 also add the checksum to the public model directory
[09:52:18] <elukey>	 https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_stages_for_a_model_server_on_Lift_Wing
[09:52:22] <elukey>	 I added a step to the list
[09:55:48] <elukey>	 aiko: I don't recall the link of the public dir for the models, do you have it handy to paste?
[09:55:48] <aiko>	 thanks luca!
[09:56:17] <aiko>	 elukey: https://analytics.wikimedia.org/published/wmf-ml-models/
[09:56:36] <elukey>	 ahhh
[09:59:31] <wikibugs>	 10Machine-Learning-Team: Add sha512 checksum files to all the ML's models in the public dir - https://phabricator.wikimedia.org/T347838 (10elukey)
[09:59:34] <elukey>	 aiko: opened --^
[09:59:52] <elukey>	 one interesting thing to work on is also to verify that only us (ML) can publish those files
[10:00:08] <elukey>	 I am not 100% sure if this is the case, or if we can do it with basic perms
[10:00:45] <elukey>	 we trust users on stat100x but we should nonetheless have some fence in place for file tampering (malicious or accidental
[10:01:07] <elukey>	 it just came up to mind
[10:01:08] <wikibugs>	 (03PS5) 10Ilias Sarantopoulos: ores-legacy: return 400 on callback requests [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/961980 (https://phabricator.wikimedia.org/T347663)
[10:06:31] <aiko>	 yeah that's a good point, we should do it
[10:12:18] * aiko lunch!
[10:33:39] <klausman>	 same
[10:37:19] <elukey>	 klausman: when you are back lemme know if what I wrote above is ok (re: rec-api-ng VIP)
[10:38:28] <klausman>	 yes, will do the VIP stuff for rec-api-ng this week
[10:38:46] <klausman>	 (if my arms get better, got a doc appt for that on Wed)
[10:40:48] <elukey>	 sure sure! thanks :)
[10:40:55] <elukey>	 I can take care of it if you wnat
[10:40:56] <elukey>	 *want
[10:49:19] <klausman>	 Let me see if I can do it (or make substantial progress on it) today/tomorrw.
[10:49:45] <isaranto>	 Going for lunch ,will deploy kserve to revscoring prod afterwards 
[10:52:23] <elukey>	 isaranto: can we do a quick load test in staging before proceeding?
[10:52:29] <elukey>	 if you haven't done it 
[10:52:37] <elukey>	 just to make sure that we are not hitting a major regression
[10:52:57] <isaranto>	 Yep!
[10:52:59] <elukey>	 we have active clients now and I am wondering if we should add basic checks before moving to prod
[10:53:08] <isaranto>	 I'll do it
[10:53:17] <elukey>	 <3
[10:54:38] * elukey lunhc!
[10:54:42] * elukey lunch! :)
[11:08:28] <wikibugs>	 10Machine-Learning-Team, 10Research: Expand language support for Revert Risk Model - https://phabricator.wikimedia.org/T347330 (10MunizaA)
[11:15:45] <wikibugs>	 10Machine-Learning-Team: Add sha512 checksum files to all the ML's models in the public dir - https://phabricator.wikimedia.org/T347838 (10klausman) Do you think it would be useful to also keep the checksums in a different place, with permissions independent of the backing store behind the published/ directory?
[12:05:59] <chrisalbon>	 Good morning alll
[12:08:57] <aiko>	 I did a load test in the staging for the new RRLA, and it worked well (tested performance and the problematic rev ids)
[12:09:16] <aiko>	 going to deploy to prod
[12:09:40] <aiko>	 chrisalbon: o/ morning :) 
[12:21:27] <isaranto>	 morning Chris!
[12:44:22] <elukey>	 aiko: nice!
[12:46:39] <wikibugs>	 10Machine-Learning-Team: Add sha512 checksum files to all the ML's models in the public dir - https://phabricator.wikimedia.org/T347838 (10elukey) >>! In T347838#9215035, @klausman wrote: > Do you think it would be useful to also keep the checksums in a different place, with permissions independent of the backin...
[13:11:02] <wikibugs>	 10Machine-Learning-Team, 10Research: Expand language support for Revert Risk Model - https://phabricator.wikimedia.org/T347330 (10achou) Thanks @MunizaA for adding the sha512 checksum for the new model binary in the task description. I have verified it and confirmed the integrity of the file that we uploaded t...
[13:23:05] <wikibugs>	 10Machine-Learning-Team, 10Research: Expand language support for Revert Risk Model - https://phabricator.wikimedia.org/T347330 (10elukey) @achou @MunizaA thanks a lot! One nit - the paste outlined in the task's description is editable, so in theory anybody can tamper with it (everything is logged but it may be...
[13:28:55] <ottomata>	 is this cool? https://github.com/alibaba/feathub 
[13:35:58] <wikibugs>	 10Machine-Learning-Team, 10Patch-For-Review: Upgrade revscoring Docker images to KServe 0.11 - https://phabricator.wikimedia.org/T346446 (10isarantopoulos) run a load-test on the deployed enwiki-goodfaith in staging before and after the upgrade and the results are almost the same **with kserve 0.10** ` isarant...
[13:39:17] <isaranto>	 ottomata: o/ :) looks nice, first time I hear about it
[13:39:25] <isaranto>	 a, it looks quite fresh
[13:41:36] <elukey>	 ottomata: o/
[13:59:31] <isaranto>	 proceeding with deploying revscoring to prod
[13:59:59] <ottomata>	 elukey:  <3 o/
[14:03:19] <aiko>	 ottomata: looks interesting :D
[14:06:57] * elukey afk for some errands!
[14:08:58] <wikibugs>	 10Machine-Learning-Team, 10Research: Expand language support for Revert Risk Model - https://phabricator.wikimedia.org/T347330 (10achou) Adding the sha512 here:  ` 94ff70cbfac87565b5e04480acd7accd7d0c1f424ebfc2cb858338bc62c309b3745220223489498254010fb772266abd5498e2671eb1cddc138717e663bd3922 *revert_risk_langu...
[14:16:17] <wikibugs>	 10Machine-Learning-Team, 10Research: Review Revert Risk reports from WME - https://phabricator.wikimedia.org/T347136 (10achou) Hi @prabhat, the language support issue should be resolved in T347330. We have deployed the new Revert Risk language agnostic model to production.
[14:24:27] <isaranto>	 deployments seem to be so slow..When I check the revisions it seems they are there (it says "Deploying") but it takes a lot of time until model servers are actually deployed
[14:52:31] <elukey>	 isaranto: in all ns or in the biggest ones?
[14:53:45] <isaranto>	 elukey: Was struggling  a bit with articlequality on codfw
[14:54:23] <isaranto>	 Mostly the biggest ones , I'll do all the deployments and let you know
[14:56:00] <elukey>	 isaranto: it happens, we have soooo many isvs :(
[14:56:07] <elukey>	 *isvcs
[15:00:34] <isaranto>	 Ack
[15:06:09] <elukey>	 the other think that I am thinking is related to the big namespaces and their allowed max cpu/memory for all pods
[15:06:39] <elukey>	 it forces a certain slow churn in what pods are deployed, since only some of them can be created at the same time
[15:06:53] <elukey>	 we could try to increase the limits
[15:07:07] <elukey>	 but it is kinda nice to have these deployments scoped
[15:44:31] <klausman>	 I wish we had something like disruption budgets. E.g. "When deploying, use as many resources as you like, but you may not slow or preempt more than X other pods"
[15:52:14] <elukey>	 there should be something like that, but all isvcs are a separate Deployment IIRC, so it is difficult to coordinate
[15:56:44] <klausman>	 yeah, that's what I figured.
[16:20:49] <isaranto>	 hmm some deployments have failed. There was a failure trying to create new revisions and when I do describe the revision I see it is because of the quotas:
[16:20:49] <isaranto>	 `kubectl describe revision ukwiki-articlequality-predictor-default-00013`
[16:21:27] <isaranto>	 getting `FailedCreate` with message `pods "ukwiki-articlequality-predictor-default-00013-deployment-7clzdg" is forbidden: exceeded quota: quota-compute-resources, requested: limits.cpu=4, used: limits.cpu=88, limited: limi`
[16:21:43] <isaranto>	 namespace: revscoring-articlequality in eqiad
[16:22:44] <elukey>	 ah!
[16:23:42] <elukey>	 they are all running now though
[16:31:33] <isaranto>	 they were all running anyway. The services didnt fail, the new deployments did. If you see half of the pods started 11d ago
[16:39:59] <elukey>	 silly me yes
[16:40:14] <elukey>	 so we run 22 isvcs in there
[16:40:43] <elukey>	 because of the scaleup
[16:41:44] <isaranto>	 will continue with the rest tomorrow morning to check
[16:41:54] <isaranto>	 going afk folks!
[16:42:47] <elukey>	 and the request for each isvc is 1 CPU and 2 GBi of ram, but this is only the kserve container
[16:43:17] <elukey>	 we have ~0.5 millicores for the other containers, and ~0.6GBi of ram
[16:44:20] <elukey>	 ah no wait more, the queue-proxy requires 1 CPU
[16:44:21] <elukey>	 mmmm
[16:44:37] <elukey>	 ok so now it makes sense, for each pod it is a little more than 2 CPUs
[16:44:45] <elukey>	 we have 90 total, hence the limit
[17:16:21] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Chinese-Sites, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 (10Sgs) I've checked the enabled wikis and all present a fair amount of results except for:   - //xalwiki// returns 5...
[17:16:38] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Chinese-Sites, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 (10Sgs)
[17:19:06] <elukey>	 isaranto: should be fixed with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/962651
[17:19:13] <elukey>	 it will be applied in a bit
[17:19:29] * elukey afk!
[18:15:49] <chrisalbon>	 night elukey!
[19:46:57] <wikibugs>	 10Machine-Learning-Team, 10Add-Link, 10Chinese-Sites, 10Growth-Team (Current Sprint), 10User-notice: Deploy "add a link" to 14th round of wikis - https://phabricator.wikimedia.org/T308139 (10kevinbazira) >>! In T308139#9216703, @Sgs wrote: > I've checked the enabled wikis and all present a fair amount of...