[09:03:30] ottomata: hmpf...I've just deployed eventgate-analytics without an issue [09:03:40] (to eqiad) [09:26:03] ottomata: it seems that deploying with "atomic: true" (the default) causes the mount failure for the configmap which aparently prevents the old pods from being deleted (I have not seen this with any other certmanager migration deployment yet) [09:27:09] with "atomic: false" (e.g. no rollback on failure) that does not seem to be a problem... [09:27:55] or it has nothing to do with the atomic flag and it's solely based on luck (e.g. the time at which the configmap is deleted from the API) [09:28:03] that would at least make some sense... [09:32:33] anyways - eventgate-analytics is deployed now [09:35:10] I do see a bit of trottling there now as well (<4ms) but that also does not seem to affect latency [13:39:43] jayme: thanks! you didn't see any quota warnings in k8s events when you deployed when you were lucky? I ask cuz I was able to be lucky sometimes too, but I think i always saw at least a few of those? [13:40:13] re throttling... i don't like it! we just got rid of it all. if you think the reasons for the quota being exceeded was not related to our CPU limits settings, maybe I can revert them? [13:40:35] ottomata: I saw some as well IIRC [13:40:49] the throttling is def. due to the decreased limits [13:41:25] you could try to bump to 1.5 cpu's and see how that goes I'd say [13:42:18] okay [13:42:20] will do ty [13:42:39] so, is eventgate-analytics fully deployed now? I never got lucky enough to push it all through [14:56:48] ottomata: sorry - yes [15:01:06] so is eventstreams [15:17:48] is there a way to `helmfile apply` only a specific release? rdf-streaming-updater has 2 releases, commons and wikidata [15:24:49] hmm, looks like `helmfile -e staging --selector release=wikidata` should work, but it says no releases match...still working on it [15:28:26] jayme: ty [15:28:31] inflatador: from what I see it only has one, "main" [15:28:43] inflatador: --selector name= [15:29:12] e.g. https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Targeting_a_specific_release_with_Helmfile_(e.g._canary) [15:29:13] :) [15:29:36] thanks all...FWiW I'm using the patchset here, so there are multiple releases https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/967229/ [15:30:20] that worked, thanks for the link ottomata [15:30:38] I had to use 'name=commons' instead of 'release=commons' [15:30:41] tw! [15:30:42] yw [15:33:22] make sure to destroy the main release before merging that one inflatador [15:34:11] jayme ACK. I'm making progress but not quite there yet ;) [16:46:16] jayme: still seeing Error creating: pods "eventgate-production-5ff74bcc75-wzkxd" is forbidden: exceeded quota in eventgate-analytics, but deployment did succeed [16:47:34] ottomata: yeah, it might throw the error temporarily and than continue when the next old pod das been terminated [16:48:25] but ultimately we should probably bump the limits for the namespace I think. Could you open a task asking for more? [16:49:44] k...but. if we saw these with limilts: 1000m, are we sure this is is related to namespace limits? [16:50:50] jayme: alternatively, would using maxSurge help? https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge [16:51:18] iiuc default would have it spinning up +7 pods at a time [16:51:39] so 37 concurrent running replicas during deployments? [16:52:15] doing eventgate-analytics in codfw now... seems like deployment is going to fail this time (not getting lucky) [16:56:33] even more than 37 maybe as main and canary run in parallel [16:57:25] ah ya [16:57:32] yeah, failed in codfw, was unlucky : [16:57:33] :) [16:58:17] but yes, you could decrease maxSurge or increase maxUnavailable [17:06:33] task: https://phabricator.wikimedia.org/T350707 [17:07:02] in eqiad where deployment was succesfull, throttling is gone. [17:08:32] cool. I can take care of the task tomorrow - hope that's soon enough [22:39:29] MR to add more RAM to staging rdf-streaming-updater quota if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/972483