[00:03:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [00:39:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910537 [00:39:29] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910537 (owner: 10TrainBranchBot) [00:57:11] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/910537 (owner: 10TrainBranchBot) [01:12:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [01:34:59] RECOVERY - Check systemd state on cloudbackup2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:32] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [01:39:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [01:57:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [02:00:32] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:17:33] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [02:18:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [02:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:26:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:00] (PowerSupply) resolved: (2) Power Supply - PS Redundancy - issue on ms-be2043:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=ms-be2043 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [02:46:17] RECOVERY - IPMI Sensor Status on ms-be2043 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [04:01:33] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:03] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:33:45] !log restart haproxy on cp1087 - T334448 [04:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:52] T334448: HAProxy 2.6.12 segfaults - https://phabricator.wikimedia.org/T334448 [04:58:53] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:03:11] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:10:14] !log sudo cumin -b 1 -s 20 'A:swift-fe-codfw' 'systemctl restart swift-proxy.service' [05:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:02] !log hashar@deploy2002 Started deploy [integration/docroot@b816911]: Update Grafana URL [05:15:13] !log hashar@deploy2002 Finished deploy [integration/docroot@b816911]: Update Grafana URL (duration: 00m 11s) [05:39:19] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [05:39:21] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [05:39:29] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync [05:40:26] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [05:40:47] PROBLEM - Query Service HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [05:41:47] !log $ helmfile --state-values-set roll_restart=1 -e codfw sync [05:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [05:45:35] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [06:02:17] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [06:05:17] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [06:09:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [06:22:19] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [06:23:18] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230422T0700) [08:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:42:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [09:44:31] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [10:02:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [10:05:32] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [10:12:49] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [10:21:53] (03PS1) 10Gergő Tisza: OAuth: Do not require approval for read-only grants on public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910815 (https://phabricator.wikimedia.org/T67750) [10:22:33] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [10:22:34] (03CR) 10CI reject: [V: 04-1] OAuth: Do not require approval for read-only grants on public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910815 (https://phabricator.wikimedia.org/T67750) (owner: 10Gergő Tisza) [10:23:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [10:30:17] (03PS2) 10Gergő Tisza: OAuth: Do not require approval for read-only grants on public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910815 (https://phabricator.wikimedia.org/T67750) [11:41:45] (03CR) 10Mainframe98: [C: 04-1] OAuth: Do not require approval for read-only grants on public wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910815 (https://phabricator.wikimedia.org/T67750) (owner: 10Gergő Tisza) [11:44:53] (03PS3) 10Gergő Tisza: OAuth: Do not require approval for read-only grants on public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910815 (https://phabricator.wikimedia.org/T67750) [11:58:31] (03PS1) 10Gerrit maintenance bot: Add btm to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/910538 (https://phabricator.wikimedia.org/T335216) [12:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:37:58] (03CR) 10Gergő Tisza: OAuth: Do not require approval for read-only grants on public wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/910815 (https://phabricator.wikimedia.org/T67750) (owner: 10Gergő Tisza) [13:47:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [13:49:22] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [14:07:18] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [14:10:17] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [14:14:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [14:27:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [14:28:18] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [14:51:33] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:53:13] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:53:31] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:54:53] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:58:11] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:00:07] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:08:41] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:10:19] RECOVERY - BFD status on cr2-eqsin is OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:59:51] (03CR) 10Krinkle: [C: 03+1] Enable /srv/mediawiki symlink on prod deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/909302 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [17:47:32] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [17:49:32] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [18:07:31] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [18:10:31] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [18:14:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:20:15] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [18:24:27] (03CR) 10Zabe: [C: 03+1] Add btm to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/910538 (https://phabricator.wikimedia.org/T335216) (owner: 10Gerrit maintenance bot) [18:27:33] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [18:28:32] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [20:33:16] (NodeTextfileStale) firing: Stale textfile for labstore1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:36:11] (03PS1) 10Krinkle: Define dummy pass for passwords::excimer_ui_server [labs/private] - 10https://gerrit.wikimedia.org/r/910842 (https://phabricator.wikimedia.org/T291015) [21:52:17] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [21:54:16] 10ops-ulsfo: ManagementSSHDown - https://phabricator.wikimedia.org/T334782 (10phaultfinder) [21:56:43] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:57:33] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:08:53] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:09:41] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:12:18] 10ops-drmrs: ManagementSSHDown - https://phabricator.wikimedia.org/T334784 (10phaultfinder) [22:14:17] (PuppetCertificateAboutToExpire) firing: Puppet CA certificate labtest-puppetmaster.wikimedia.org is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:15:17] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T334785 (10phaultfinder) [22:20:16] (PowerSupply) firing: (2) Power Supply - PS Redundancy - issue on elastic2050:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Power_Supply_Failures - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&var-Sensor=Power%20Supply&var-server=elastic2050 - https://alerts.wikimedia.org/?q=alertname%3DPowerSupply [22:21:07] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:21:59] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:32:18] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder) [22:33:17] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T334665 (10phaultfinder)