[04:01:28] 10CFSSL-PKI, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jhathaway) @jbond I spent quite a bit of time on this Friday, but also came up empty handed. I suspected some str... [08:55:35] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10Volans) @fnegri thanks for the work on this! I think that as an interim workaround this is... [09:29:46] (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:35] moritzm: FYI I've downtimed debmonitor2003 for a week starting now [10:07:09] ack [10:29:00] topranks: Have you seen the "Idle BGP sessions" from Meta? [10:32:08] slyngs: thanks yes, it's realted to an issue we are having at DE-CIX in Dallas, our port is down in general (so not just Meta affected) [10:32:11] thanks for the heads up! [10:32:25] I'll follow up with dc-ops a little later when they're online [10:33:11] Cool, I'll leave the email in the ops-mainteance inbox for now [10:33:24] Well, not cool that it's down, but that we're aware ... [10:39:25] slyngs: are you on clinic duty? I think it can be marked no action needed [10:39:43] I'm filling in :-) [10:39:43] it's an automated one so no need really to reply to them, hopefully we get it sorted soon [10:40:06] good stuff :) [11:09:03] 10netops, 10Infrastructure-Foundations, 10SRE: Connection errors from users on Vodafone DE (AS3209) [28.06.2023] - https://phabricator.wikimedia.org/T340670 (10cmooney) 05Open→03Resolved a:03cmooney Session is re-established ~20 mins now and there has been no increase in NELs for this ASN. Marking as... [11:30:00] moritzm: ok to delete /root/cookbooks_testing on cumin2002? [11:30:09] AFAICT all the patches tested there were yours [11:31:18] sure, I was about to myself later, but pleae go ahead! [11:34:46] (SystemdUnitFailed) firing: uwsgi-puppetboard.service Failed on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:35:29] {done} [11:44:46] (SystemdUnitFailed) firing: (2) uwsgi-puppetboard.service Failed on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:47:17] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10LSobanski) @jbond could you confirm that this is still a valid request? If yes, do you think there is a better match for it than #infrastru... [12:02:03] 10Puppet, 10Observability-Alerting, 10SRE, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10jbond) [12:04:25] 10Puppet, 10Observability-Alerting, 10SRE, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10jbond) @LSobanski i have added observability alerting. Im not real sure if that's the best group but it is abo... [12:39:31] 10CFSSL-PKI, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jbond) Thanks for taking a look at this I ended up creating a [[ https://gist.github.com/b4ldr/6822facfe4454c9bf6... [14:26:56] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10LDAP: Should puppet auto-restart slapd? - https://phabricator.wikimedia.org/T171191 (10jbond) i see the following in the puppet manifest so i think this has been ficxed in the mean time File['/etc/ldap/slapd.conf'] ~> Service['slapd'] [14:27:29] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10LDAP: Should puppet auto-restart slapd? - https://phabricator.wikimedia.org/T171191 (10jbond) 05Open→03Resolved a:03jbond [14:28:53] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Seen), 10User-Joe: Re-think puppet management for deployment-prep - https://phabricator.wikimedia.org/T161675 (10joanna_borun) [14:29:33] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, 10Technical-Debt: Uniform cluster nomenclature across puppet - https://phabricator.wikimedia.org/T159411 (10joanna_borun) [14:30:31] 10Puppet, 10Data-Engineering-Icebox, 10observability, 10User-Elukey: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948 (10joanna_borun) [14:31:57] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10jbond) [14:59:29] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Puppet Improvements 2021/2022 - https://phabricator.wikimedia.org/T294906 (10jbond) [14:59:44] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10User-jbond: puppetmaster - ignoring invalid UTF-8 byte sequences in data to be sent to PuppetDB - https://phabricator.wikimedia.org/T255667 (10jbond) 05Open→03Declined This is a by product of having binary objects sent in the catalogue. [15:01:35] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond) [15:16:11] 10CAS-SSO, 10Puppet, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10joanna_borun) It's going to be fixed with puppet 7 upgrade. [15:16:37] 10puppet-compiler, 10Infrastructure-Foundations: investigate state of puppet 7 - https://phabricator.wikimedia.org/T313387 (10joanna_borun) [15:16:42] 10CAS-SSO, 10Puppet, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10joanna_borun) 05Open→03Declined [15:17:55] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond) [15:22:44] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) [15:34:24] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Review puppetmaster SSL configuration - https://phabricator.wikimedia.org/T268040 (10jbond) 05Open→03Resolved a:03jbond This will all change in puppet7 [15:36:06] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond) [15:40:03] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Work required to prepare for puppet 7 - https://phabricator.wikimedia.org/T265138 (10jbond) [15:44:46] (SystemdUnitFailed) firing: (2) uwsgi-puppetboard.service Failed on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:43] 10Puppet, 10Infrastructure-Foundations, 10SRE: empty hiera yaml file makes lookup fail - https://phabricator.wikimedia.org/T89957 (10jbond) 05Open→03Resolved a:03jbond closing this, im guessing this has been fixed upstream in the mean time as we currently have [[ https://github.com/wikimedia/operations... [15:56:28] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10jbond) [15:57:06] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10jbond) [15:57:37] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10jbond) [16:00:05] 10Puppet, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 2 others: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10jbond) I have updated the description i believe the first point was to ensure we had no nodes like the following ` lang=p... [16:14:46] (SystemdUnitFailed) firing: (3) uwsgi-puppetboard.service Failed on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:14:46] (SystemdUnitFailed) firing: (3) uwsgi-puppetboard.service Failed on puppetboard1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed