[02:18:21] (SystemdUnitFailed) firing: upload_puppet_facts.service Failed on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:18:21] (SystemdUnitFailed) firing: upload_puppet_facts.service Failed on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:48:40] jbond: pcc-worker1004 fails to run Puppet due to `ca-certificates-java` 20230103 failing to install for some reason :] [06:48:55] then I guess you are already aware and it is in progress [06:52:06] which looks like a circular dependency issue between Debian packages ( https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030129#33 ) [08:03:45] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Add a cookbook to safely deploy puppet changes - https://phabricator.wikimedia.org/T341442 (10Joe) [08:10:37] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Add a cookbook to safely deploy puppet changes - https://phabricator.wikimedia.org/T341442 (10Joe) Things that I don't think we have to create such a cookbook: * programmatic way to merge changes in gerrit. I'm not sure if this could have some... [08:52:34] hashar: thanks, yes it seems to be a bug upstream which i haven't had toime to look into yet [08:53:04] there seems to be a dependency loop in upstream package [08:53:17] yes i know [08:56:35] I think ca-certificates-java 20230710 will end up being backport for stable/oldstable via point releases, but I need to sort out the finer details [08:56:49] if this breaks things in the interim we can also do a similar build for apt.wikimedia.org? [08:57:54] thanks moritzm ill take a look at this today so should have a better idea latet [08:58:06] *later [08:58:15] ack, sounds good [09:10:24] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) p:05Triage→03High [09:20:36] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10Volans) Adding #traffic for awareness. [09:20:41] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10Volans) [09:38:31] (SystemdUnitFailed) firing: (4) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:21] (SystemdUnitFailed) firing: (5) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:41:42] 10netops, 10Infrastructure-Foundations, 10SRE: BFD flapping from cloudsw1-c8-eqiad (QFX5100) - https://phabricator.wikimedia.org/T341466 (10cmooney) p:05Triage→03Medium [12:54:34] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10RobH) Order Number - 1-228138359365 entered for remote hands to power cycle the device and reply back to the ticket to let us kno... [13:23:32] (SystemdUnitFailed) firing: (2) netbox_ganeti_eqiad_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:28] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) Equinix came back and said they rebooted. Device is reachable again: ` cmooney@mr1-eqsin> show system uptime Current t... [14:59:04] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Management LAN in eqsin offline due to failure of mr1-eqsin - https://phabricator.wikimedia.org/T341447 (10cmooney) p:05High→03Medium Device remains healthy after over an hour. In terms of what caused the initial problem the logs a... [16:02:26] 10netops, 10Infrastructure-Foundations, 10SRE: BFD flapping from cloudsw1-c8-eqiad (QFX5100) - https://phabricator.wikimedia.org/T341466 (10cmooney) 05Open→03Resolved Session to cloudlb1001 is stable after over an hour so think this is good to close now with the fix of using longer timers ` cmooney@cloud... [16:31:33] 10CAS-SSO, 10Infrastructure-Foundations: Better handling of memcached service - https://phabricator.wikimedia.org/T255132 (10jbond) [16:35:07] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10Andrew) I'm fine with making things more verbose for now, then we can trim out things that... [16:48:05] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Puppet (Puppet 7.0): Cumin: update config to use new puppet7 infrastructre - https://phabricator.wikimedia.org/T341497 (10jbond) p:05Triage→03Medium [17:24:19] (SystemdUnitFailed) firing: upload_puppet_facts.service Failed on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:32:12] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2022/2023-Q4): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [17:32:19] 10SRE-tools, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2022/2023-Q4): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [18:35:49] 10SRE-tools, 10Infrastructure-Foundations: Add --depool-sleep runtime argument when using SRELBBatchRunner class - https://phabricator.wikimedia.org/T339151 (10BCornwall) 05Open→03Stalled a:03BBlack This was under the request of @BBlack - I believe the intention was that this would be "good enough" for t... [21:28:21] (SystemdUnitFailed) firing: upload_puppet_facts.service Failed on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed