[00:32:25] FIRING: SystemdUnitFailed: prometheus-nft-throttling-denylist.service on durum7003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:32:40] FIRING: SystemdUnitFailed: prometheus-nft-throttling-denylist.service on durum7003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:39:10] 06Traffic, 10MobileFrontend, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: MobileFrontend should declare "X-Subdomain" variance via "Vary" response header - https://phabricator.wikimedia.org/T390929#10930411 (10Krinkle) [05:36:58] <_joe_> looks like ocsp stapling doesn't work [07:21:59] _joe_: hmmm care to provide more details? [07:22:02] DC? [07:22:51] if it's a US DC using Let's Encrypt that's more than expected, Let's Encrypt decommissioned their OCSP infrastructure a few months ago, see https://letsencrypt.org/2024/12/05/ending-ocsp/ [07:29:06] and in non-US DCs where we are using Google Trust Services OCSP stapling is working as expected: https://www.irccloud.com/pastebin/LNAaTkdM/ [07:45:39] oh, it triggered a new alert during the night [07:56:50] <_joe_> vgutierrez: yes [08:26:44] I got a CR ready to address it, I'll wait till sukhe gets online to review it [08:26:50] fab.fur is OoO today :) [08:27:11] sukhe: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1161397 that's for you :) [08:28:46] and I'll simplify all the certificate picking logic soon given that we are getting rid of digicert so all the certs will be managed by acmechief [08:32:40] FIRING: SystemdUnitFailed: prometheus-nft-throttling-denylist.service on durum7003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:36] no need to wait for sukhe, I reviewed it (plumbers finished and I'm waiting for the dentist appointment) [08:49:33] fabfur: thx <3 [08:49:45] plumbers and dentist on the same day... [08:49:49] that's torture [08:52:00] 06Traffic, 06SRE, 13Patch-For-Review: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821#10931172 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez [08:59:43] vgutierrez: the important thing is that the plumber doesn't put it's hands in my mouth [09:00:07] [SFW filter blocked this response] [12:32:40] FIRING: SystemdUnitFailed: prometheus-nft-throttling-denylist.service on durum7003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:17] ^ this is insetup so not worried but will check shortly on the why [13:06:03] 06Traffic, 10Liberica, 13Patch-For-Review: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561#10932093 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0aa85ab2-b396-4052-aaf9-110edaba6d78) set by vgutierrez@cumin1002 for 1 day, 0:00:00 on 1 ho... [13:30:31] 06Traffic, 10Liberica, 13Patch-For-Review: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561#10932183 (10Vgutierrez) [13:38:20] 06Traffic, 10Liberica, 13Patch-For-Review: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561#10932228 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=400f81d0-c245-4849-9749-f6118bd2595b) set by vgutierrez@cumin1002 for 1 day, 0:00:00 on 1 ho... [13:55:19] 06Traffic: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 (10ssingh) 03NEW [13:55:29] 06Traffic: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456#10932312 (10ssingh) [13:55:37] 06Traffic: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456#10932314 (10ssingh) p:05Triage→03Medium [14:10:21] 06Traffic: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456#10932376 (10ssingh) [14:23:44] 06Traffic, 10Liberica, 13Patch-For-Review: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561#10932413 (10Vgutierrez) [15:07:25] FIRING: [4x] SystemdUnitFailed: prometheus-nft-throttling-denylist.service on durum7003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:12] sigh [15:26:22] fixing it [15:59:33] 07HTTPS, 06Traffic, 06SRE, 06Traffic-Icebox, and 2 others: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378#10932632 (10ssingh) The ECH experiment has been reverted as of today. [16:02:25] RESOLVED: [3x] SystemdUnitFailed: prometheus_nic_queue_cpu_eno12399np0.service on lvs3008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on lvs6002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:15:55] ^ known [16:20:48] FIRING: [2x] PuppetZeroResources: Puppet has failed generate resources on lvs6002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:30:48] FIRING: [3x] PuppetZeroResources: Puppet has failed generate resources on lvs6001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:38:57] ^ this should be resolved [16:38:59] artifact [16:40:48] RESOLVED: [3x] PuppetZeroResources: Puppet has failed generate resources on lvs6001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:05:25] FIRING: [6x] SystemdUnitFailed: prometheus_nic_queue_cpu_ens3f0np0.service on lvs6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:40] looking ^ [17:06:33] hmm [17:12:07] Failed to locate executable /usr/local/bin/prometheus-nic-queue-cpu: [17:12:11] this is true though, now looking why not [17:12:25] it's also the right iface, so that's not the issue [17:28:57] oh lol... i messed up that if clause [17:29:02] should be if !defined(File....) [17:29:49] yep fixed in https://gerrit.wikimedia.org/r/1161576 [17:30:25] (I also missed it in the review and since PCC ran, well...) [17:40:25] RESOLVED: [6x] SystemdUnitFailed: prometheus_nic_queue_cpu_ens3f0np0.service on lvs6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed