[02:33:49] FIRING: [3x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:03:49] FIRING: [3x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:04:04] FIRING: [3x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:05:09] ^ this should recover soon as I fixed it at https://phabricator.wikimedia.org/T393034#10784948 [06:05:15] Puppet runs well already on db1155 [06:08:49] FIRING: [3x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:13:49] RESOLVED: [2x] PuppetFailure: Puppet has failed on db2186:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:24:17] I wonder if those alerts are old, because all those hosts have been fixed and puppet runs totally fine [06:24:18] 2025-05-02T06:01:57.013609+00:00 db2186 puppet-agent[1520548]: Using environment 'production' [06:24:19] 2025-05-02T06:01:57.161014+00:00 db2186 puppet-agent[1520548]: Retrieving pluginfacts [06:24:19] 2025-05-02T06:01:57.235297+00:00 db2186 puppet-agent[1520548]: Retrieving plugin [06:24:19] 2025-05-02T06:01:57.909810+00:00 db2186 puppet-agent[1520548]: Loading facts [06:24:19] 2025-05-02T06:02:22.997443+00:00 db2186 puppet-agent[1520548]: Caching catalog for db2186.codfw.wmnet [06:24:19] 2025-05-02T06:02:23.542734+00:00 db2186 puppet-agent[1520548]: Applying configuration version '(7f090974cb) Manuel Arostegui - installserver: Format es2047|es2048' [06:24:19] 2025-05-02T06:02:41.684447+00:00 db2186 puppet-agent[1520548]: Applied catalog in 18.41 seconds [06:24:27] There is absolutely 0 errors on that run [06:24:40] In fact the puppet board do not show them as failed [06:25:10] https://puppetboard.wikimedia.org/node/db2186.codfw.wmnet [06:25:32] All green since 07:53 [07:17:09] marostegui: yeah, looking at puppetboard all the hosts I'd flagged yesterday are now OK [08:27:47] hi [08:31:40] hi jynus! Hope you're feeling better :) [08:32:55] Emperor: thank you, much less pain indeed [09:07:34] Hi folks, could I get a review of installserver setup for new thanos backends, please? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140644 [09:59:09] Emperor: do you happen to know if backups alerted/created many issues in the last month? [10:02:23] not asking for a formal update, just in case you remember someone asking for urgent help at this channel [10:10:49] jynus: I don't think so, no. There's a bunch of reboots that need doing, though. [10:11:50] T392804 (NDA) [10:11:53] I would guess, there is a lot of pending upgrades that were fully paused [12:15:30] jynus: No, no issues with backups [12:16:13] nice [15:59:13] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140752 for adding two new ms proxies if anyone feels like a late-Friday code review (I won't merge until next week) [16:03:49] checking [16:05:33] My onliy criticism would be if you can silence some of their services on alertmanager, I think they were alerting while setup was in progress a few hours ago [16:07:24] https://i.imgflip.com/9so9mb.jpg [16:24:11] I can downtime the hosts 'til Monday [16:24:30] <3 [16:24:37] though they don't appear on alerts.wikimedia.org [16:24:52] weird, they showed up to me before [16:25:23] I filter on @state=active, @cluster=wikimedia.org, team=data-persistence [16:25:23] What about https://alerts.wikimedia.org/?q=alertname%3DSwift%20https%20backend&q=team%3Dsre&q=%40receiver%3Dirc-spam ? [16:25:55] I use "team=~(sre|data-persistence)" as things are sometimes weird [16:26:11] ah, they're counting against sre not data-persistence [16:26:14] + "alertname!=SystemdUnitFailed" [16:27:03] thank you, it is not a big deal, it is just that they add to the noise when something actually important or new happens [16:28:31] [downtimed] [16:29:14] and to be fair, the dual stack icinga/alertmanager & automation is not helping either