[01:12:13] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:52:03] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:58] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:51] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9636601 (10ayounsi) FYI it's alerting for one of its PSU being down, but we don't really care anymore : > asw-a-codfw> show system alarms > 1 alarms currently active > Ala... [09:17:13] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:30:43] 10SRE-tools, 06cloud-services-team, 06Infrastructure-Foundations, 10Spicerack: 14[spicerack] Add remote command output to log file - 14https://phabricator.wikimedia.org/T347093#9636927 (10aborrero) 14I was bitten by this recently. I think the proposal made to show at least _something_ in the logs with... [10:53:27] 10SRE-tools, 10Spicerack, 10Puppet (Puppet 7.0): Spicerack puppetserver.destroy() raises an exception when certificate does not exist - https://phabricator.wikimedia.org/T360293 (10taavi) 03NEW [10:56:48] 10Mail, 06Infrastructure-Foundations, 07User-notice-archive: 14Stop sending change notification email if edit is done by a bot - 14https://phabricator.wikimedia.org/T356984#9637027 (10Ladsgroup) 14>>! In T356984#9608335, @Tacsipacsi wrote: >>>! In T356984#9606145, @Ladsgroup wrote: >> Yeah, the second p... [11:17:58] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10Puppet (Puppet 7.0): Spicerack puppetserver.destroy() raises an exception when certificate does not exist - https://phabricator.wikimedia.org/T360293#9637119 (10Volans) p:05Triage→03Medium That's indeed the current behaviour and clearly an error... [11:31:27] 10netbox, 10netops, 06Infrastructure-Foundations, 13Patch-For-Review: Netbox: bug preventing removing a parent bridge in custom script automation - https://phabricator.wikimedia.org/T359629#9637159 (10Aklapper) [11:46:18] XioNoX, topranks: FYI I've opened a task with the list of 10Gb NICs we discussed last week, for now is private in abundance of caution, feel free to open it if you feel it's ok [11:46:38] volans: thx, yeah nothing private in there [12:10:38] 10netops, 06DC-Ops, 06Infrastructure-Foundations: Take advantage of 10Gb NICs in the new network stack - https://phabricator.wikimedia.org/T360297#9637345 (10ayounsi) p:05Triage→03Low Thanks for the task, nothing private in there. I think we should : 1/ filter out the hosts that are due for a refresh if... [12:11:06] volans: commented, the main step is to write doc :) [12:30:16] thnks [13:12:17] moritzm: I'm reimaging idp-test1003, it's a bit weird to debug with two versions of Tomcat [13:13:21] ack, but let me quickly check [13:13:51] there was one issue with the patch you merged on the Friday before the Summit [13:17:13] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:17:58] moritzm: Ah,... It's currently reimaging :-) [13:18:06] left comments on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009709 [13:18:18] I think it's best if we sort this out and then re-re-reimage [13:40:27] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate CAS to Bookworm - https://phabricator.wikimedia.org/T357748#9637912 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host idp-test1003.wikimedia.org with OS bookworm completed: - idp-test1003 (... [13:58:55] moritzm: Something wonky is going on. The existing idp servers have server_enable_ssl = false, yet they have a keystore, which CAS think it needs, even if it's not configured [14:01:42] I'll have a look in a bit [14:03:34] We also need to rebuild the CAS 6.6 package for Bookworm, it currently depends on Tomcat 10, to it pull the dependency in at puts the WAR file in /var/lib/omcat10 [14:06:49] I'll simply push reverts for the last three commits to the overlay repo, then we should be back to a clean tomcat9-preferring state [14:15:06] I'll just try to create an empty keystore, I think it's the OIDC that needs it [14:21:58] Okay, that's better. There's some permission stuff that needs sorting. I'll look at that a bit later [14:59:23] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9638240 (10Papaul) [15:02:16] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10Puppet (Puppet 7.0): Spicerack puppetserver.destroy() raises an exception when certificate does not exist - https://phabricator.wikimedia.org/T360293#9638262 (10taavi) If there would be a method to check whether a certificate exists or not we could... [15:22:07] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10Puppet (Puppet 7.0): Spicerack puppetserver.destroy() raises an exception when certificate does not exist - https://phabricator.wikimedia.org/T360293#9638308 (10Volans) We do have `get_certificate_metadata()` that raises `spicerack.puppet.PuppetServ... [15:29:32] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9638343 (10Papaul) [15:57:53] slyngs: I've uploaded a new cas build which defaults to tomcat 9 again [16:11:58] (SystemdUnitFailed) firing: (2) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:46] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9638559 (10Papaul) [16:18:26] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9638580 (10Papaul) [16:18:39] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9638581 (10Papaul) [16:19:43] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE, 13Patch-For-Review: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9638602 (10Papaul) [17:54:23] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9639269 (10Papaul) Zeroize done on asw-a1 setups: - delete the member from the master - Disconnect both cable going to asw-a2 and asw-a7 - while login into to console r... [19:21:58] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:46:58] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:21:58] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:46:58] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed