[08:09:13] hello folks, morning :) [08:09:26] snapshot1017 has puppet disabled with a generic msg like "outage" [08:09:59] IIRC something happened last week but afaict we can re-enable puppet right? [08:23:36] (re-enabling, if it was a problem I would say that we didn't rely on puppet alone with that disable msg) [08:35:48] elukey: err, I think that's the "dumps break production" issue? [08:36:02] yep [08:36:35] I thought puppet was intentionally disabled to stop it re-enabling dumps until we were happy that wouldn't break production? [08:36:37] there are a couple of systemd timers that are failing, that's it [08:36:48] where was it announced? [08:37:19] I've been away, but the relevant tickets are T368289 and T368098 [08:37:19] T368289: Incredible amount of logs from Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks - https://phabricator.wikimedia.org/T368289 [08:37:20] T368098: Dumps generation without prefetch cause disruption to the production environment - https://phabricator.wikimedia.org/T368098 [08:38:10] I would suggest checking with e.g. Amir1 before re-enabling [08:40:04] I have stopped all timers with "dumps" and stopped puppet as well, this time with a meaningful title [08:40:41] I get that we were all busy but we cannot leave hosts for more than a week without puppet runs and without clear messages [08:40:53] I'll write in the task as well [08:42:32] (added a comment) [09:40:38] for the on-callers: I am deploying Thumbor, Wikifeeds and API/Rest Gateway to pick up new Envoy images (based on Bookworm). Nothing should explode but if you see anything weird, you know who to blame :) [09:44:18] exciting :D [09:44:41] I'll be keeping an eye on the restgw/thumbor graphs [09:50:57] hnowlan: api/rest gateway done, I don't see weird things from the graphs.. Proceeding with Thumbor, brace yourself :D [09:52:38] looks grand [09:52:40] mmm I see some rate limits for the eqiad api-gateway [09:52:43] maybe temporary [09:53:36] oh yeah, hmm sw.french saw that last reboot [09:54:05] it goes away after a while - don't worry about it in relation to your change, but I'm gonna try and figure out what's going on [09:54:13] * elukey nods [10:06:43] hnowlan: thumbor done! [10:07:35] looks good, thanks! [10:07:55] the only things that are somehow weird are related to haproxy backend timings and average pod queue time [10:08:01] since it dropepd to ~1ms [10:08:35] like https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&refresh=1m&viewPanel=86 [10:09:15] but it goes up and down in the past 30d, so should be ok [10:09:19] all right, all good! [10:09:33] yeah that's normal enough [10:14:45] ahahhaha I like the description [10:37:30] elukey: I think the plan was to enable it and then another outage happened [10:38:09] anyway. DE owns dumps, they should decide. I think there is an issue with our dumper + databases that's making it quite slow causing issues [10:40:49] Amir1: sure sure, my only point was that we should have action items after an outage to puppetize the status of a host, otherwise it will fall through the cracks (not assigning the blame to anybody, it is just a follow up) [10:41:03] yeah [13:28:26] elukey: Amir1: Sorry, I'll hold my hand up to allowing snapshot1017 to have disabled Puppet for too long. I'll check when the dumps might be re-enabled and make a patch to correct it if we're not able to switch it back on today. [13:31:48] <3 [14:10:33] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052752 [14:11:20] I'm stepping afk for a little while now, but I can merge and deploy later. [15:08:21] Amir1: feel free to merge my changes. [15:08:24] thanks! [15:59:38] nothing to report oncall wise! [16:00:06] thanks arnaudb ! [16:01:27] thanks :)