[08:01:15] there are some puppet certs about to expire https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DPuppetCertificateAboutToExpire [08:01:29] some of them are the puppetmaster ones, should we run the cookbook? [08:03:06] yeah, these will still be used for some months sadly [08:04:28] all right will run it later on [08:04:38] there is also performance.discovery.wmnet in the list [08:08:24] but afaics it is a old cert, we already moved to CFSSL in there [08:09:04] yeah, it uses cfssla s the cert provider, not sure why the alert is still there for the old one [08:09:36] 10netops, 06Infrastructure-Foundations: Enable and scrape gNMIc api Prometheus endpoint - https://phabricator.wikimedia.org/T375361 (10ayounsi) 03NEW [08:13:50] maybe we missed to clear the cert on the puppetmaster side [08:16:17] yeah done [08:24:04] ack, nice [08:28:47] for puppetmaster1001 the following was a bit unexpected [08:28:48] removed '/var/lib/puppet/ssl/private_keys/sessionstore.discovery.wmnet.key' [08:31:40] aside from that, all puppetmaster nodes renewed [08:31:55] and now that I think about it, probably sre.puppet.renew-cert needs to be adapted for puppetserver [08:34:54] oh good point, all those certs are so recent that we never used it for P7 hosts for sure [08:35:23] same for decommission, it seems working only on p5 atm [08:39:05] 10netops, 06Infrastructure-Foundations, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10166669 (10ayounsi) a:03ayounsi [08:54:10] 10netops, 06Infrastructure-Foundations, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10166734 (10ayounsi) Opened high priority JTAC case 2024-0923-266479 and attached logs/debug output. [09:34:41] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#10166787 (10MoritzMuehlenhoff) [10:27:41] 10SRE-tools, 06Infrastructure-Foundations, 06SRE: debmonitor-client: Warning printed with su from buster - https://phabricator.wikimedia.org/T216832#10166945 (10MoritzMuehlenhoff) 05Open→03Resolved Updated deb has been rolled out fleetwide, closing. [10:31:45] 10CFSSL-PKI, 10netops, 06Infrastructure-Foundations: sre.network.tls cookbook - CFSSL error: bad request - https://phabricator.wikimedia.org/T375179#10166957 (10ayounsi) 05Open→03Resolved [11:05:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on ganeti2011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:10:49] FIRING: [8x] PuppetZeroResources: Puppet has failed generate resources on ganeti2011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:15:48] FIRING: [10x] PuppetZeroResources: Puppet has failed generate resources on ganeti2010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:20:48] FIRING: [14x] PuppetZeroResources: Puppet has failed generate resources on ganeti2010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:21:51] ^ just merged a fix, should recover soon [11:25:48] FIRING: [20x] PuppetZeroResources: Puppet has failed generate resources on ganeti2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:30:48] FIRING: [23x] PuppetZeroResources: Puppet has failed generate resources on ganeti2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:35:48] FIRING: [24x] PuppetZeroResources: Puppet has failed generate resources on ganeti2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:40:48] FIRING: [24x] PuppetZeroResources: Puppet has failed generate resources on ganeti2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:45:48] FIRING: [22x] PuppetZeroResources: Puppet has failed generate resources on ganeti2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:50:48] FIRING: [21x] PuppetZeroResources: Puppet has failed generate resources on ganeti2009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:55:48] RESOLVED: [13x] PuppetZeroResources: Puppet has failed generate resources on ganeti2010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:17:25] FIRING: SystemdUnitFailed: apache2.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:27:25] RESOLVED: SystemdUnitFailed: apache2.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:16] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Support listing pooled / active authdns hosts (rather than all) - https://phabricator.wikimedia.org/T375014#10167416 (10Volans) Thanks for the summary @ssingh. I have a local proposal that will send out when ready. There is one main point to decide and... [13:28:35] 10netops, 06Infrastructure-Foundations: Enable and scrape gNMIc api Prometheus endpoint - https://phabricator.wikimedia.org/T375361#10167428 (10ayounsi) 05Open→03Resolved a:03ayounsi Basic demo dashboard : https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?orgId=1&var-site=All [13:31:04] 10netops, 06Infrastructure-Foundations, 06SRE: cr3-ulsfo incident 22 Sep 2024 - https://phabricator.wikimedia.org/T375345#10167444 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a9eff4bb-15d3-41a4-8dd6-65ccc0663c06) set by ayounsi@cumin1002 for 3 days, 0:00:00 on 1 host(s) and their serv... [18:03:25] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:49:19] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418 (10Papaul) 03NEW [18:49:25] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418#10168978 (10Papaul) p:05Triage→03Medium [18:50:15] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419 (10Papaul) 03NEW [18:50:31] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10168991 (10Papaul) p:05Triage→03Medium [22:03:25] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:20:05] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-c-codfw switch stack - https://phabricator.wikimedia.org/T375418#10169564 (10Papaul) [22:21:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10169567 (10Papaul) [23:51:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-d-codfw switch stack - https://phabricator.wikimedia.org/T375419#10169742 (10Papaul)