[07:48:58] it seems like puppet 5 infra might have some issue? [07:48:59] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: undefined method `content' for nil:NilClass [07:49:35] I am also now getting this error when pushing to gerrit [07:49:40] subsystem request failed on channel 0 [07:49:40] scp: Connection closed [07:50:01] but only for the puppet repo :-/ [07:54:26] it just worked [07:54:28] weird [07:56:08] so the puppet 5 issue seems to be specific to puppetmaster1002 [08:33:29] we've had a _lot_ of PuppetZeroResources alerts overnight about db hosts [08:43:25] Emperor: that was the same thing I was looking at earlier. restarting apache2 on puppetmaster1002 seems to have fixed it [08:44:33] I see over on -private they're rebooting a puppetmaster right now [08:45:29] I guess once that's done I'll see if the remaining alerts have cleared [08:46:43] aha. I'm not on -private [08:54:23] still two PuppetZeroResource errors, let's see [08:56:38] moritzm: db1225 is still unhappy [08:56:59] Warning: Unable to fetch my node definition, but the agent run will continue: [08:57:09] Warning: Error 500 on SERVER: Server Error: Could not retrieve facts for db1225. [08:57:09] eqiad.wmnet: Failed to find facts from PuppetDB at puppet:8140: undefined method [08:57:09] `content' for nil:NilClass [08:57:41] I'm retrying again again, looks like it might be working this time [08:58:24] moritzm: db1197 and db1247 are new failures [08:58:58] jelto: https://gerrit.wikimedia.org/r/1005708 [08:59:00] (1225 has cleared) [08:59:40] Emperor: the error persists, you'll see random recoveries and error since the puppetmaster used by the agents are rotated [09:00:17] moritzm: +1. [09:00:24] moritzm: OK, thanks, I'll stop poking them for a bit then [09:01:44] (+1 to removing the sad puppetmaster from me too, for what little that's worth :) ) [09:02:39] so the reboot did not help. I also don't see the load again which dropped yesterday at 21:30 (CPU, network): https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=puppetmaster1002&var-datasource=thanos&var-cluster=puppet&from=now-24h&to=now [09:04:45] I've merged r1005708 and ran puppet on puppetmaster1001 [09:05:07] I'm opening a task for further diagnosis [09:05:11] I'll run puppet on the two currently-sad dp nodes [09:08:21] Hm, still playing whack-a-mole but I've not had repeats yet [09:08:29] great thanks! [09:08:29] moritz: when looking at the metrics it seems your apache restart lowered the puppet failures a bit and then it got worse again (due to the reboot): https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&from=now-12h&to=now&viewPanel=6 [09:08:29] Not sure if that metrics make sense when puppet is disabled. But probably something to investigate (so restarting apache vs restarting the vm) [09:10:10] does alertmanager have the equivalent of icinga's "force a recheck of these alerts on this host"? [09:11:04] jelto: thanks, could you please add this to https://phabricator.wikimedia.org/T358187 as well? I had just opened it before I saw your followup [09:11:48] keeping an eye on https://puppetboard.wikimedia.org/nodes?status=failed, but things should be recovering [09:12:04] and on the upside I'll use this opportunity to discuss next steps with DBAs to move forward with moving servers to Puppet 7 :-) [09:15:16] moritzm: we can move misc servers I think [09:16:48] yes puppet failures going down: https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&from=now-6h&to=now&viewPanel=6 [09:21:33] marostegui: ack, I'll send a mail to sre-data-persistence later, then we can sort out the details [09:24:20] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=team%3Ddata-persistence still has one left, but that node ran puppet successfully 14 minutes ago, so I don't know why alerts haven't caught up :( [09:28:24] I just clicked the link and shows up empty now [09:32:39] I have https://gerrit.wikimedia.org/r/c/operations/alerts/+/1005712 to move the spammy alert to warning, if anyone can take a look [09:34:04] puppet failures look recovered (below 1% in eqiad already) [09:39:36] cheers Emperor [09:39:55] should be better going forward [09:40:31] 👍 [19:59:37] We got communication of MaxMind changing their URLs: T358268. DPE SRE should be able to take care of that, but if you know of specific places where we download MaxMind that are non obvious, please let us know on the ticket. [19:59:38] T358268: Update maxmind download to pull databases from new url - https://phabricator.wikimedia.org/T358268 [20:30:43] sre-collab: It looks like all of our gitlab wmf-debci pipelines are broken when they previously worked. Is there something we need to change? E.g. https://gitlab.wikimedia.org/repos/sre/fifo-log-demux/-/ci/editor?branch_name=bookworm-wikimedia [22:30:04] brett: I think the issue is that `.build_wmf_deb` isn't defined anywhere. Is it possible that you're looking for `.build_ci_deb`? [22:33:33] Hm, but that might not be right either. [22:44:18] Thanks for taking a look. It's puzzling [22:46:40] Yeah, it's strange. Mostly strange that it worked before... [23:20:56] brett: I'll take a further look in the morning, not making any progress now. [23:27:50] I suspect you may want to include includes.yml not builddebs.yml ; sorry this got shuffled around with https://gitlab.wikimedia.org/repos/sre/wmf-debci/-/merge_requests/5 and following changes. [23:28:27] "document the changes that result" is on my TODO list, and I've not quite got to it; and it Just Worked for people who were just using builddebs.yml but not for folk doing includes of it. Apologies [23:29:39] but note that includes.yml has its own build_ci_deb job now. [23:31:56] it looks like you might just be able to wholesale use builddebs.yml (i.e. just set it as your CI/CD file), you don't look to be doing anything different in that CI file you link