[02:03:17] (PKICertificateExpiry) firing: (84) A certificate in the trust chain for aux_front_proxy expires in 11d 11h 43m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [02:53:02] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:06:08] (PKICertificateExpiry) firing: (84) A certificate in the trust chain for aux_front_proxy expires in 11d 7h 41m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [06:53:29] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:19] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9579221 (10MoritzMuehlenhoff) [09:11:24] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9579490 (10MoritzMuehlenhoff) [09:17:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9579500 (10cmooney) 05Open→03Resolved Patch tested again and still working consistently, I think the initial prob... [09:18:17] (PKICertificateExpiry) firing: (84) A certificate in the trust chain for aux_front_proxy expires in 11d 4h 29m 25s - https://wikitech.wikimedia.org/wiki/PKI/CA_Operations - TODO - https://alerts.wikimedia.org/?q=alertname%3DPKICertificateExpiry [10:29:15] is this cert expiration tracked somewhere and someone is working on it? ^^^ [10:53:29] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:26] volans: I've silenced the alert. It's the new one I implemented last week, but it's triggering on the wrong certificates [11:20:03] slyngs: but is tht certificate acually expiring? do we need to renew it? [11:20:07] the intermeiate one [11:20:35] No, I checked. them manually and they are completely fine. The Icinga alert is also still in place and not triggering [11:23:15] ok, so I'm lost on where the alert gets the in 11d 4h 29m 25s part [11:23:18] :D [11:23:38] is that the short term TTL of the final cert that gets auto-renewed? [11:24:56] I'll just check, I thought is was 14 days, but apparently not. We do some seven day certs [11:29:57] Default renewal is 11 days [11:30:24] Meaning that my check for 14 days just trigger constantly [11:30:46] Good job me [11:33:17] To be precise, it is 952200 seconds, so a little more than 11 days [11:34:24] lol [11:34:40] Yeaah... We're redoing that one :-) [12:02:14] 10CAS-SSO, 10Infrastructure-Foundations: Migrate CAS to Bookworm - https://phabricator.wikimedia.org/T357748#9580029 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host idp-test1003.wikimedia.org with OS bookworm [13:44:01] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox: get rid of WMF Production Patches - https://phabricator.wikimedia.org/T310717#9580347 (10ayounsi) as well as https://github.com/dennisv/django-storage-swift/pull/113 [14:33:56] 10netops, 10Infrastructure-Foundations, 10SRE: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9580482 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2004.codfw.wmnet with... [14:53:29] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:53:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803#9580655 (10cmooney) >>! In T345803#9479281, @Papaul wrote: > @cmooney can we get those 2 hosts back in decom? Thanks I'm done with sretes... [15:00:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803#9580694 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin1002 for hosts: `sretest2004.codfw.wmnet` - sretes... [15:12:08] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9580720 (10cmooney) [15:24:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9580752 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin1002 for hosts: `testvm2001.codfw.wmnet` - testvm2001.codf... [15:55:39] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9580837 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=94e7352f-26c7-48ff-b2c5-61b1faed7b5a) set by cmooney@cumin1002 fo... [15:56:52] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9580845 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4a16f229-e545-4883-81ab-3b2ddd2d7636) set by cmooney@cumin1002 fo... [16:15:48] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9580894 (10cmooney) All moves complete, everything looking good and back responding to ping :) [16:17:08] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9580897 (10ABran-WMF) thanks! will repool! [16:28:01] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9580929 (10fnegri) 05Open→03Stalled [18:53:29] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:45:09] slyngs: do you happen to have a developer account handy that isn't in any groups or anything? would it be okay for me to create one for myself? [20:52:48] cdanis: I have https://ldap.toolforge.org/user/majavah-test as mostly unprivileged for tests, feel free to use that if you want [20:52:57] or just create an additional account [20:53:46] thanks taavi sent a dm :) [21:18:02] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:23:02] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed