[00:02:16] (03CR) 10Meno25: "Thank you in advance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910723 (https://phabricator.wikimedia.org/T335019) (owner: 10Meno25) [00:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:39:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910539 [00:39:20] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910539 (owner: 10TrainBranchBot) [00:55:43] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910539 (owner: 10TrainBranchBot) [01:52:34] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [01:54:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:09] (03CR) 10Jon Harald Søby: [C: 04-1] Add `fat` to InterwikiSortOrders (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910723 (https://phabricator.wikimedia.org/T335019) (owner: 10Meno25) [02:12:32] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [02:14:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:15:32] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [02:15:47] (03CR) 10Jon Harald Søby: [C: 04-1] "The previous update to this file was by me, adding 'guc' and 'gur'. Since then, we've also had a Wikipedia in 'anp' and Wiktionaries in 'c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910723 (https://phabricator.wikimedia.org/T335019) (owner: 10Meno25) [02:20:16] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:22:35] (03CR) 10Jon Harald Søby: [C: 04-1] Add `fat` to InterwikiSortOrders (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910723 (https://phabricator.wikimedia.org/T335019) (owner: 10Meno25) [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:33] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [02:33:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [04:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:20:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:57:18] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [05:59:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [06:14:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:17:17] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [06:20:16] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:20:17] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [06:37:18] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [06:38:19] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230423T0700) [08:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:47:49] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:54:01] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:55:11] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:57:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [09:59:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [10:14:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:17:32] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [10:20:16] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [10:20:32] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [10:37:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [10:38:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [10:54:31] (03PS1) 10Krinkle: webperf: enable libapache2-mod-php7.3 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [10:55:33] (03CR) 10CI reject: [V: 04-1] webperf: enable libapache2-mod-php7.3 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [10:57:43] (03PS1) 10Superpes15: [fywiki] Add portal and portal talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910857 (https://phabricator.wikimedia.org/T334807) [11:01:36] (03PS2) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [11:02:00] (03CR) 10CI reject: [V: 04-1] webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [11:02:59] (03PS3) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [11:55:56] (03PS3) 10Meno25: Add `fat` to InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910723 (https://phabricator.wikimedia.org/T335019) [12:04:13] (03PS4) 10Meno25: Update InterwikiSortOrders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910723 (https://phabricator.wikimedia.org/T335019) [12:13:01] (03PS4) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [12:15:01] (03CR) 10CI reject: [V: 04-1] webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [12:21:32] (03CR) 10Meno25: Update InterwikiSortOrders (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910723 (https://phabricator.wikimedia.org/T335019) (owner: 10Meno25) [12:24:19] (03CR) 10Meno25: Update InterwikiSortOrders (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910723 (https://phabricator.wikimedia.org/T335019) (owner: 10Meno25) [12:25:03] (03CR) 10Krinkle: [C: 03+1] "LGTM. Let's deploy this next week. I've created a calendar invite for shortly after the Tuesday afternoon backport window at https://wikit" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910427 (https://phabricator.wikimedia.org/T334550) (owner: 10Barakat Ajadi) [12:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:46:39] (03PS5) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [13:01:17] (03PS6) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [13:03:52] (03PS7) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [13:10:17] (03PS2) 10Krinkle: Define dummy pass for passwords::excimer_ui_server [labs/private] - 10https://gerrit.wikimedia.org/r/910842 (https://phabricator.wikimedia.org/T291015) [13:49:17] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:55:11] (03PS8) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [13:56:58] (03PS9) 10Krinkle: webperf: enable libapache2-mod-php7.4 on profile::webperf::site [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) [14:01:05] (03CR) 10Jon Harald Søby: [C: 03+1] "Thank you very much!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910723 (https://phabricator.wikimedia.org/T335019) (owner: 10Meno25) [14:02:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [14:04:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [14:14:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:20:16] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [14:22:16] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [14:25:17] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [14:28:29] (03CR) 10Krinkle: "This is cherry-picked on Beta Cluster deployment-puppetmaster04 together with https://gerrit.wikimedia.org/r/c/performance/docroot/+/91086" [puppet] - 10https://gerrit.wikimedia.org/r/910856 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [14:41:44] 10SRE: service::uwsgi should mask uwsgi.service - https://phabricator.wikimedia.org/T222874 (10Krinkle) [14:42:00] (03Abandoned) 10Reedy: Add WikiMirror [extensions/TrustedXFF] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/909266 (owner: 10Reedy) [14:42:03] (03Abandoned) 10Reedy: Generate readable PHP diffs [extensions/TrustedXFF] (wmf/1.41.0-wmf.4) - 10https://gerrit.wikimedia.org/r/908965 (owner: 10Reedy) [14:42:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [14:43:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [15:04:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:59:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:23:57] (03PS1) 10Krinkle: coal: Uninstall from webperf role and start decom [puppet] - 10https://gerrit.wikimedia.org/r/910889 (https://phabricator.wikimedia.org/T335242) [16:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:49:17] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:02:33] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [18:04:32] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [18:14:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:20:16] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:22:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [18:25:32] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [18:42:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [18:43:31] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [19:53:47] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:55:29] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:55:55] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:57:33] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:58:13] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:59:53] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:05:21] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:08:43] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:09:53] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:15:19] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:17:09] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:18:11] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:18:37] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:20:27] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:29:07] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:30:01] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:31:39] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:32:23] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:17:03] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1003.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service,rsync-doc-doc2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:17] (NodeTextfileStale) firing: Stale textfile for sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:56:31] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:58:11] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:58:17] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:59:57] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:07:18] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [22:09:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [22:13:35] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:14:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:20:16] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:27:17] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [22:29:59] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:30:17] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [22:31:39] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:42:57] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:44:35] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:47:18] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [22:48:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [22:56:37] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:58:01] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:59:41] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:03:19] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:33:55] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:34:35] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:37:13] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:37:53] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:50:37] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:50:41] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down