[16:54:14] hm, deploying eventgate-analytics in eqiad, haven't changed limits, but [16:54:29] kubectl events show: [16:54:29] Error creating: pods "eventgate-production-6df4d4dc56-m56w5" is forbidden: exceeded quota: quota-compute-resources, requested: limits.cpu=2500m, used: limits.cpu=90, limited: limits.cpu=90 [16:55:40] and now: MountVolume.SetUp failed for volume "tls-certs-volume" : configmap "eventgate-production-tls-proxy-certs" not found, which...might be part of a rollback attempt? iiuc that config map has now been removed? [16:55:44] cc jayme claime ? [16:57:55] i don't think we've merged any changes to tls proxy cpu limits for eventgate yet, this should just be the chart update [16:58:03] dunno why limits are being exceeded now? [16:58:42] could be due to that you basically increase the number of replicats while deploying [16:58:53] oh, during deployment, hm [16:59:01] for rolling [16:59:01] hm [16:59:02] the tls-certs thing is clearly a but [16:59:05] *bug [16:59:18] let me check on that [16:59:39] yes, during rolling upgrade you'll have more replicas than usually [17:00:11] okay, never had this happen before, i did do deployments to this last week [17:00:46] the mountvolume thing should have happend in staging as well...that's odd [17:01:05] i already did all eventgate-logging-external clusters, in eqiad and codfw as well, no prob there [17:01:43] oh, wait [17:01:47] it did happen in events there [17:01:53] 17m Warning FailedMount pod/eventgate-production-5ff8d68b49-dbn4w MountVolume.SetUp failed for volume "tls-certs-volume" : secret "eventgate-production-tls-proxy-certs" not found [17:01:59] but it did not fail the deployment. [17:02:08] so i didn' t check k8s events [17:04:13] ah, than it might just have been temporary [17:04:47] yeah, maybe for just this deployment where we use the new certs stuff? [17:04:54] i'll kill a pod in staging and watch events and see if it happens [17:05:29] so the *secret* "eventgate-production-tls-proxy-certs" not found - that might be temporary and fix itself [17:05:36] right [17:05:42] *configmap* "eventgate-production-tls-proxy-certs" not found - that is an issue [17:05:47] oh [17:05:49] because there is no configmap anymore [17:05:52] right [17:05:54] it's a secret :) [17:08:12] but that seemed to have worked well in staging for eventgate-analytics [17:08:13] did not see that in events when new pod started in staging [17:08:27] i did it in eventgate-logigng-external staging just now [17:08:37] so...i guess...temporary? [17:08:52] but my pressing issue is the failed eventgate-analytics eqiad deployment due to limits [17:09:26] not a problem, but it looks like....the deployment half succeeded? maybe it tried to roll back and that failed too? [17:09:39] most pods are new, but many are still from 4d ago [17:09:48] anyway, that's fine. but how to deploy? :) [17:09:49] I'll take a look [17:10:05] k, maybe increase namespace limits in admin_ng somehwere...? [17:10:09] so eventgate-analytics staging was fine? [17:10:17] and codfw you did not do yet? [17:10:20] yes. [17:10:25] eqiad and codfw both have 30 replicas [17:10:30] so prob would see it in codfw [17:10:58] has the helmfile process of yours terminated already? [17:11:01] yes [17:11:03] ack [17:11:12] WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/eventgate-analytics-deploy-eqiad.config [17:11:12] Error: UPGRADE FAILED: release production failed, and has been rolled back due to atomic being set: timed out waiting for the condition [17:19:31] ottomata: rollback was fine, nothing bad happened. I think this comes up now because you've increased the time it takes for pods to terminate [17:21:47] with 30 (+1 canary) replicas, each one with a limit of 2.5 you're already "using" 77.5 of the 90 cpus you're allowed. Rolling update will add ~7.5 replicas (while also teminating around 7.5 rigth away - but those take time) [17:23:12] hm, did the timeout changes apply for this deployment though? the timeouts should be on new resources that are deployed? [17:23:21] but, for next deployments that would make sense. [17:24:17] i don' think we need a limit of 2000 for this though... hmmm [17:24:20] I don't understand...you've added the terminatiGraceperoos and preStop hooks...see the helmfile diff [17:24:41] yes, but won't those apply on the newly created pods when they are shutting down? [17:24:51] ah, good point :D [17:25:08] yes, my theory is flawed [17:26:34] ottomata: easiest fix is probably to put more realistic limits [17:28:07] 2 cpus seems like *way* more than it would acutually ever use [17:28:22] yeah, looking for a quantile historical for a maxish cpu [17:28:29] we just have average and top current in the dash [17:28:42] my histogram prom fu is very poor... [17:28:57] what is the max you see there? [17:29:11] max pod current is 213ms [17:30:17] past me and past eventgate benchmarking maxed around 1.2s [17:30:42] https://phabricator.wikimedia.org/T220661#5118898 [17:31:00] maybe i'll just set to limit 1000m ? [17:31:10] not m but yeah :) [17:31:13] if we need more, we can have more replicas [17:32:05] evengate-logging-external sets cpu limits to 1 [17:32:19] i'll just make that the chart default, it isn't overridden anywhere else [17:32:46] you wrote you measures 1.2s with 26k msg/s - but it does not tell how many replicas [17:33:27] i think i was just doing one......dunno why i would try to bench more than one at a time. [17:33:29] but so long ago... [17:33:42] maybe https://phabricator.wikimedia.org/T220661#5106266 indicates otherwise? [17:34:42] https://phabricator.wikimedia.org/T220661#5116643 [17:34:50] has it at 1800 events per sec, which sounds more reasonable [17:34:57] 26k for one replica...sounds likea lot [17:35:03] given all of eventgate-analytics only does ~20k events/s ...yeah :) [17:35:37] if one instance is able to cope with that you could also lower to 2-3 replicas :-p [17:35:39] yeah, above comments were benching one replica. that one i linked was prod bench after changing settings [17:35:49] okay anyway, cpu: 1 should be fine [17:35:53] that sounds fine [17:36:43] but please double check if you see increased throttling after updating the limit [17:37:03] k [17:37:39] 1.5s should fit the namespace limit as well if you want to play safe [17:38:25] i think i'd rather go with 1 in case we add more replicas one day [17:40:45] ack. but be aware that changing the default will change the limit for all eventgate instances [17:41:06] yes [17:41:20] eventgate-analytics is busiest, will keep an eye out [17:44:46] ack [17:54:14] the configmap "eventgate-production-tls-proxy-certs" not found messages are from the old pods - probably because helm already deleted it (the configmap) [17:55:03] makes sense [17:55:43] yargh, deploy is taking a while...maybe cuz of those timeouts? shoudln't be more than like 13 seconds per pod(group(?)) tho., i have to run! ahhh [17:55:57] will be back in 1.5 hours... [17:56:14] it's still hitting that limit ... strange [17:56:16] i hope i'm not leaving this in a bad state....it seems fine. [17:56:17] oh? [17:56:26] oh it is! [17:56:26] hm [17:56:33] it will probably roll back in a bit [17:56:44] (5 or 10 min timeout, I don't recall) [17:57:01] okay, i'll come back to this later this afternoon, you are probably almost done your workday. jayme if you have any tips for me to try when I get back leave em here :) [17:57:12] thank you for your help! [17:57:16] ttyl [17:57:20] o/ [18:08:07] I don't see why that should not fit. While quota errors where printed there where 17 pods in the old state (42.5 cpus) plus 7 in the new (10.5 cpus) - but I'm clearly missing something as the error counts limits.cpu=89500m ... 🤔 [18:08:39] unfortunately I gtg. but I can try to figure this out tomorrow my morning [18:14:49] getting some errors deploying the new rdf-streaming-updater in staging...`User "rdf-streaming-updater-deploy" cannot create resource "flinkdeployments" in API group "flink.apache.org"` . Sounds like RBAC stuff, any ideas? [18:18:31] inflatador: IIRC the rdf-streaming-updater deploy user has special permissions (a dedicated clusterrole) to allow for the pre-operator stuff to work [18:18:50] that role maybe might not have permissions like the usual deploy user [18:19:06] check the helmfile_rbac.yaml in deployment-charts [18:19:08] jayme interesting, I'm looking at admin_ng/helmfile_rbac.yaml ATM [18:19:59] Cool, let me look a ilttle closer, there are a few namespaces in staging where this actually works [18:21:26] as said, the rdf user is special. it probably works everywhere else [18:27:11] Y, my guess is that the flinkdeployment stuff might need to be explicitly added to https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/admin_ng/helmfile_rbac.yaml#L98 ...getting a patch up now [18:31:15] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/972005 is up to explicitly add those perms [18:34:00] lgtm - now I'm really of ;) [18:34:14] jayme once again I am in your debt! Thanks ;P [20:40:11] jayme: for you investigations tomorrow: eventgate-anlaytics-external in eqiad (24 replicas) exceeded limits too, but eventually succeeded in spawning all new. [20:42:19] same in codfw too: requested: limits.cpu=1500m, used: limits.cpu=89500m, limited: limits.cpu=90 [20:42:29] but eventually helmfile apply succeeded anyway [20:43:08] proceeding for eventgate-main, it has fewer replicas anyway... [21:14:51] jayme: and, indeed, throttling is back. [21:15:54] latency does not seem affected tho. [21:28:54] just did another eventgate-analytics-external deploy in eqiad, this one went 100% good! no limit quota issues