[00:00:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [00:38:22] (SystemdUnitFailed) firing: (4) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:38:22] (SystemdUnitFailed) firing: (4) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:55:36] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [07:02:05] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [07:50:13] 10netops, 10Infrastructure-Foundations, 10sre-alert-triage: Alert in need of triage: BGP status (instance cr2-eqdfw) - https://phabricator.wikimedia.org/T351083 (10ayounsi) a:03ayounsi Emailed the 2 networks again. I'll delete the sessions if they don't reply or fix them. [08:38:22] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:46] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10fgiunchedi) [09:58:22] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:04:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Add BGP to protocols contributing to aggregates - https://phabricator.wikimedia.org/T351456 (10ayounsi) With the addition of `L3` switches it makes sens to not only take into consideration OSPF or `L2` vlans. For unicast "regular" external... [10:50:22] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) @fgiunchedi what is the probing software? we do have a b... [10:58:22] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:06] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman) [11:28:35] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman) [11:48:43] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10fgiunchedi) The software in this case is prometheus blackbox ex... [12:33:50] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10jbond) >>! In T351624#9344407, @fgiunchedi wrote: > The softwar... [13:44:49] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:21:24] reminder to update the pad for the meeting ;) [14:44:49] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:13:27] (SystemdUnitFailed) firing: (2) production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:49] (SystemdUnitFailed) firing: (3) prometheus_puppet_agent_stats.timer Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:22] (SystemdUnitFailed) firing: (3) prometheus_puppet_agent_stats.timer Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:24:49] (SystemdUnitFailed) firing: (4) prometheus_puppet_agent_stats.timer Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:28:23] (SystemdUnitFailed) firing: (6) prometheus_puppet_agent_stats.timer Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:23] (SystemdUnitFailed) firing: (8) prometheus_puppet_agent_stats.timer Failed on bast2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:49] (SystemdUnitFailed) firing: (11) prometheus_puppet_agent_stats.timer Failed on bast3007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:23] (SystemdUnitFailed) firing: (12) prometheus_puppet_agent_stats.timer Failed on bast3007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:39:49] (SystemdUnitFailed) firing: (13) prometheus_puppet_agent_stats.timer Failed on bast3007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:25] (SystemdUnitFailed) firing: (14) prometheus_puppet_agent_stats.timer Failed on bast3007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:52:22] 10Puppet, 10netbox, 10Infrastructure-Foundations, 10SRE, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10joanna_borun) a:05jbond→03cmooney [15:52:48] 10SRE-tools, 10Infrastructure-Foundations: redfish: minimum version support - https://phabricator.wikimedia.org/T328593 (10joanna_borun) a:05jbond→03Volans [15:54:23] 10SRE-tools, 10Infrastructure-Foundations: Fix autorestart and debclient dependency - https://phabricator.wikimedia.org/T324229 (10jbond) 05Open→03Declined not enough information [15:55:41] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Investigate converting LBRemoteCluster cookbooks to SRELBBatchRunnerBase - https://phabricator.wikimedia.org/T318787 (10joanna_borun) a:05jbond→03Volans [16:09:33] 10CAS-SSO, 10Gerrit, 10Infrastructure-Foundations, 10SRE, and 3 others: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10jbond) a:05jbond→03None [16:11:51] 10CFSSL-PKI, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) 05Open→03Resolved [16:12:43] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Allow idrac ftp fetching of firmware updates (either to existing ftp or new solution) - https://phabricator.wikimedia.org/T283771 (10jbond) 05Open→03Resolved a:05jbond→03None @RobH closing this as we now have the upgrade-firmware cookbook... [16:21:39] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond) [16:25:00] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Probes for centrallog hosts fail to validate with "x509: issuer name does not match subject from issuing certificate" - https://phabricator.wikimedia.org/T351624 (10LSobanski) Removing #collaboration-services as I don't see any... [18:18:48] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [18:35:49] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [19:44:49] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:55] topranks: are you working on cloudcephosd1003-4? there are pending changes in netbox for the dns (see icinga) [22:19:29] volans: my bad yes, I added the hostnames to clear the netbox report [22:19:36] I did try to run the cookbook but it failed earlier [22:19:40] https://www.irccloud.com/pastebin/6BxSQ30e/ [22:19:53] running right now and seems ok... sorry we had a plumbing incident here which dragged me away [22:20:14] ouch sorry to hear that, plumbing is never fun [22:20:30] it wasn't too bad to begin with... I made it a whole lot worse of course :) [22:20:37] that error was a while I didn't see it in the past was just netbox doing something weird with its cache [22:20:41] ahahahah lol [22:20:41] place is mostly dry again and the water is working so yay :P [22:20:56] usually retrying in few minutes "fixes" it [22:21:36] ok right, yeah I decided to do that and then never did it sry [23:44:49] (SystemdUnitFailed) firing: production-images-weekly-rebuild.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed