[08:09:13] <elukey>	 hello folks, morning :)
[08:09:26] <elukey>	 snapshot1017 has puppet disabled with a generic msg like "outage"
[08:09:59] <elukey>	 IIRC something happened last week but afaict we can re-enable puppet right?
[08:23:36] <elukey>	 (re-enabling, if it was a problem I would say that we didn't rely on puppet alone with that disable msg)
[08:35:48] <Emperor>	 elukey: err, I think that's the "dumps break production" issue?
[08:36:02] <elukey>	 yep
[08:36:35] <Emperor>	 I thought puppet was intentionally disabled to stop it re-enabling dumps until we were happy that wouldn't break production?
[08:36:37] <elukey>	 there are a couple of systemd timers that are failing, that's it
[08:36:48] <elukey>	 where was it announced?
[08:37:19] <Emperor>	 I've been away, but the relevant tickets are T368289 and T368098
[08:37:19] <stashbot>	 T368289: Incredible amount of logs from Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks - https://phabricator.wikimedia.org/T368289
[08:37:20] <stashbot>	 T368098: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098
[08:38:10] <Emperor>	 I would suggest checking with e.g. Amir1 before re-enabling
[08:40:04] <elukey>	 I have stopped all timers with "dumps" and stopped puppet as well, this time with a meaningful title
[08:40:41] <elukey>	 I get that we were all busy but we cannot leave hosts for more than a week without puppet runs and without clear messages
[08:40:53] <elukey>	 I'll write in the task as well
[08:42:32] <elukey>	 (added a comment)
[09:40:38] <elukey>	 for the on-callers: I am deploying Thumbor, Wikifeeds and API/Rest Gateway to pick up new Envoy images (based on Bookworm). Nothing should explode but if you see anything weird, you know who to blame :)
[09:44:18] <hnowlan>	 exciting :D
[09:44:41] <hnowlan>	 I'll be keeping an eye on the restgw/thumbor graphs
[09:50:57] <elukey>	 hnowlan: api/rest gateway done, I don't see weird things from the graphs.. Proceeding with Thumbor, brace yourself :D
[09:52:38] <hnowlan>	 looks grand
[09:52:40] <elukey>	 mmm I see some rate limits for the eqiad api-gateway
[09:52:43] <elukey>	 maybe temporary
[09:53:36] <hnowlan>	 oh yeah, hmm sw.french saw that last reboot 
[09:54:05] <hnowlan>	 it goes away after a while - don't worry about it in relation to your change, but I'm gonna try and figure out what's going on 
[09:54:13] * elukey nods
[10:06:43] <elukey>	 hnowlan: thumbor done!
[10:07:35] <hnowlan>	 looks good, thanks! 
[10:07:55] <elukey>	 the only things that are somehow weird are related to haproxy backend timings and average pod queue time
[10:08:01] <elukey>	 since it dropepd to ~1ms
[10:08:35] <elukey>	 like https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&refresh=1m&viewPanel=86
[10:09:15] <elukey>	 but it goes up and down in the past 30d, so should be ok
[10:09:19] <elukey>	 all right, all good!
[10:09:33] <hnowlan>	 yeah that's normal enough 
[10:14:45] <elukey>	 ahahhaha I like the description
[10:37:30] <Amir1>	 elukey: I think the plan was to enable it and then another outage happened 
[10:38:09] <Amir1>	 anyway. DE owns dumps, they should decide. I think there is an issue with our dumper + databases that's making it quite slow causing issues
[10:40:49] <elukey>	 Amir1: sure sure, my only point was that we should have action items after an outage to puppetize the status of a host, otherwise it will fall through the cracks (not assigning the blame to anybody, it is just a follow up)
[10:41:03] <Amir1>	 yeah
[13:28:26] <btullis>	 elukey: Amir1: Sorry, I'll hold my hand up to allowing snapshot1017 to have disabled Puppet for too long. I'll check when the dumps might be re-enabled and make a patch to correct it if we're not able to switch it back on today.
[13:31:48] <elukey>	 <3
[14:10:33] <btullis>	 elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052752 
[14:11:20] <btullis>	 I'm stepping afk for a little while now, but I can merge and deploy later.
[15:08:21] <btullis>	 Amir1: feel free to merge my changes.
[15:08:24] <Amir1>	 thanks!
[15:59:38] <arnaudb>	 nothing to report oncall wise!
[16:00:06] <cdanis>	 thanks arnaudb !
[16:01:27] <rzl>	 thanks :)