[00:23:58] 10Mail, 10Infrastructure-Foundations, 10MW-1.42-notes (1.42.0-wmf.18; 2024-02-13), 10User-notice: Stop sending change notification email if edit is done by a bot - https://phabricator.wikimedia.org/T356984#9553943 (10Tacsipacsi) Since the train arrived, I’ve missed * Wikidata constraint report updates that... [06:03:54] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9554047 (10Marostegui) [07:25:17] 10netops, 10DC-Ops, 10Data-Persistence, 10Infrastructure-Foundations, and 5 others: Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547#9554129 (10Marostegui) [09:20:27] 10Mail, 10Infrastructure-Foundations, 10MW-1.42-notes (1.42.0-wmf.18; 2024-02-13), 10User-notice: Stop sending change notification email if edit is done by a bot - https://phabricator.wikimedia.org/T356984#9554287 (10TheDJ) >>! In T356984#9553943, @Tacsipacsi wrote: > All this because of obscure reasons li... [10:14:26] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9554373 (10taavi) [10:20:18] 10Puppet, 10Infrastructure-Foundations: os-reports: KeyError: 'apt2001.wikimedia.org' - https://phabricator.wikimedia.org/T357884#9554395 (10taavi) [10:31:49] (PuppetZeroResources) firing: Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:34:49] (PuppetZeroResources) firing: Puppet has failed generate resources on ganeti2030:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:36:53] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:39:53] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on ganeti2018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:41:53] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:44:58] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on ganeti2018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:46:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:47:49] (PuppetZeroResources) firing: Puppet has failed generate resources on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:49:49] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on ganeti2018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:51:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:54:53] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on ganeti2018:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:56:57] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:57:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:59:54] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on ganeti2014:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:01:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:04:49] (PuppetZeroResources) firing: (9) Puppet has failed generate resources on ganeti1016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:06:48] (PuppetZeroResources) firing: Puppet has failed generate resources on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:06:53] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:08:09] godog: FYI those alerts are quite confusing, my understanding is that they are per cluster and team but the message reports just one instance (while they affect many) and don't report cluster or team. [11:09:53] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on ganeti1016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:10:42] and there are so many of them that is unclear which one are alerting and which one are recovering [11:11:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9554542 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1032.eqiad.wmnet with OS... [11:11:49] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:12:22] volans: indeed, I'm opening a task to tweak the alert so at least it doesn't spam as much, re: multiple hosts when the instance name is in the "summary" field only one from the group alert is reported on irc whereas alerts.w.o has all [11:13:29] yes but without the nme of the cluster/team how can I know if the firing on foo1001 and the resolved on bar1001 was actually the same alert or 2 different ones? [11:14:49] (PuppetZeroResources) firing: (8) Puppet has failed generate resources on ganeti1016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:15:06] godog: ^^^ (forgot to mention) [11:16:54] (PuppetZeroResources) firing: (6) Puppet has failed generate resources on idm2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:19:49] (PuppetZeroResources) firing: (9) Puppet has failed generate resources on ganeti1016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:22:05] volans: I'm not sure I'm understanding what you'd like to see on IRC as the end result in this case, I'm running shortly to lunch and I'm available to chat later [11:24:10] godog: if we fire multiple alerts for PuppetZeroResources for each cluster and team I would expect them to be mentioned, something like: firing: (6) Puppet has failed generate resources for hosts in cluster FOO and team BAR (idm2001)... [11:25:31] but in general, for any case in which we might fire the same alert multiple times, if the hostname reported can change at any re-fire/resolve, we need additional data in the message that allow an operator to recognize is the same alert [11:25:45] or a different one [11:28:23] volans: ok thank you now I get it, I'll think on to best address that [11:29:31] thanks [12:01:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on testvm2005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:04:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9554726 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1032.eqiad.wmnet with OS book... [12:11:10] created https://phabricator.wikimedia.org/T357900 for alerting on /run/puppetserver/restart_required [12:16:44] moritzm: do you know why we do have this machinery in the first place? [12:17:47] I've only learned about this mechanism an hour ago :-) [12:18:00] but most certainly as a safety measure [12:18:35] so that errors a puppet config don't have an immediate blast radius via restarts of puppetserver.service triggered by Puppet itself [12:18:44] but becayse we dont want to have puppet restart puppetserve? [12:18:51] some sort of chicken and egg? [12:18:58] ok [12:20:56] yeah, I'm 99% sure that's the reason [12:21:28] the mechanism makes perfect sense. _If_ one knows about it :-) [12:22:28] ehehehe [12:22:45] so probably unrelated to today's issue right? [12:24:09] I think so, yes. figuring out about the other one is still TBD [12:24:37] sorry for not having left one failing host running [12:24:38] but if the change to the puppet ca would have actual effect, we would have seen similar issues (current patch was NOP) [12:24:56] all good, I'll poke around [12:25:14] and there will surely be gnutls updates again so can just as well try to repro with the next one [12:25:38] eheheh [13:53:02] 10Puppet, 10Infrastructure-Foundations: os-reports: KeyError: 'apt2001.wikimedia.org' - https://phabricator.wikimedia.org/T357884#9555018 (10MoritzMuehlenhoff) Thanks! There is a pre-existing task,I'll merge that in. [13:53:24] 10Puppet, 10Infrastructure-Foundations: os-reports: KeyError: 'apt2001.wikimedia.org' - https://phabricator.wikimedia.org/T357884#9555020 (10MoritzMuehlenhoff) [14:22:30] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE: CAS-based services (?) lose the session after an hour - https://phabricator.wikimedia.org/T268233#9555141 (10fgiunchedi) [15:18:02] 10CAS-SSO, 10Infrastructure-Foundations: Migrate CAS to Bookworm - https://phabricator.wikimedia.org/T357748#9555268 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:18:11] 10CAS-SSO, 10Infrastructure-Foundations: Move CAS to Java 17 - https://phabricator.wikimedia.org/T357749#9555281 (10MoritzMuehlenhoff) p:05Triage→03Medium [16:05:18] 10Mail, 10Infrastructure-Foundations, 10MW-1.42-notes (1.42.0-wmf.18; 2024-02-13), 10User-notice: Stop sending change notification email if edit is done by a bot - https://phabricator.wikimedia.org/T356984#9555513 (10Tacsipacsi) >>! In T356984#9554287, @TheDJ wrote: > "Why can’t those reasons be stated pub... [16:08:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10Patch-For-Review: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918#9555525 (10cmooney) p:05Triage→03Low [16:09:08] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9555526 (10cmooney) p:05Triage→03Low [16:09:21] 10netbox, 10Infrastructure-Foundations: Evaluate usage of Kubernetes/Wikikube Tags in netbox and replace them with something if possible - https://phabricator.wikimedia.org/T354169#9555527 (10joanna_borun) p:05Triage→03Low [16:12:43] 10netops, 10Infrastructure-Foundations, 10sre-alert-triage: Alert in need of triage: BGP status (instance cr1-drmrs) - https://phabricator.wikimedia.org/T357389#9555538 (10ayounsi) p:05Triage→03Low a:03ayounsi [16:24:27] 10SRE-tools, 10Infrastructure-Foundations: Decommission cookbook: lock per switch - https://phabricator.wikimedia.org/T353513#9555609 (10ayounsi) p:05Triage→03Medium [17:16:26] 10netops, 10Infrastructure-Foundations, 10SRE: Update K8S BGP groups eqiad row e-f - https://phabricator.wikimedia.org/T357924#9555767 (10cmooney) p:05Triage→03Medium [17:16:34] 10netops, 10Infrastructure-Foundations, 10SRE: Update K8S BGP groups eqiad row e-f - https://phabricator.wikimedia.org/T357924#9555777 (10cmooney) [17:16:42] 10netops, 10Infrastructure-Foundations, 10SRE: BGP peering from LSW to K8s hosts using loopback IP not IRB - https://phabricator.wikimedia.org/T357619#9555778 (10cmooney) [17:16:50] 10netops, 10Infrastructure-Foundations, 10SRE: Update K8S BGP groups eqiad row e-f - https://phabricator.wikimedia.org/T357924#9555767 (10cmooney) [17:23:59] 10netops, 10Infrastructure-Foundations, 10SRE: Update K8S BGP groups eqiad row e-f - https://phabricator.wikimedia.org/T357924#9555808 (10cmooney) [20:31:49] 10netops, 10Infrastructure-Foundations, 10SRE: Update K8S BGP groups eqiad row e-f - https://phabricator.wikimedia.org/T357924#9556205 (10cmooney) > We will need to arrange a window to push this out to the devices with service ops. I will discuss with them, but I think the easiest way forward may be to make... [21:27:47] 10Mail, 10Infrastructure-Foundations, 10MW-1.42-notes (1.42.0-wmf.18; 2024-02-13), 10User-notice: Stop sending change notification email if edit is done by a bot - https://phabricator.wikimedia.org/T356984#9556248 (10AlbanGeller) The issue I've had is that, whenever a bot edits a page or file, I don't rece...