[00:01:09] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5024.eqsin.wmnet with OS trixie [00:01:19] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5024.eqsin.wmnet with OS trixie [00:05:22] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5024.eqsin.wmnet with OS trixie [00:11:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 898.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:34:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 1.389% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:37:37] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5024.eqsin.wmnet with reason: host reimage [00:38:13] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5019.eqsin.wmnet with reason: host reimage [00:39:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 0.2137% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:44:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:44:38] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5024.eqsin.wmnet with reason: host reimage [00:44:54] uh oh [00:45:18] cjd91: fyi we're probably about to get paged :) [00:45:47] Thanks for the heads up [00:47:33] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5019 is CRITICAL: connect to address 10.132.0.19 and port 3128: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [00:47:33] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp5019 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [00:48:48] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5019.eqsin.wmnet with reason: host reimage [00:59:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [00:59:44] Deployment mw-web.codfw.main in mw-web at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.codfw.main - ... [00:59:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [01:00:05] PROBLEM - Wikidough DoH Check -IPv6- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:01:57] RECOVERY - Wikidough DoH Check -IPv6- on doh3006 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:05:25] FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:06:51] PROBLEM - SSH on tcp-proxy3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:07:33] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5019 is OK: HTTP OK: HTTP/1.1 200 OK - 47863 bytes in 0.942 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [01:07:41] RECOVERY - SSH on tcp-proxy3002 is OK: SSH OK - OpenSSH_10.0p2 Debian-7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:09:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 4.309% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:09:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:11:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web releases routed via main (k8s) 1.021s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:13:50] FIRING: ProbeDown: Service ganeti3006:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:14:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 10.54% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [01:14:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [01:14:44] Deployment mw-web.codfw.main in mw-web at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=mw-web&var-deployment=mw-web.codfw.main - ... [01:14:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [01:15:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:16:05] PROBLEM - Host ganeti3006 is DOWN: PING CRITICAL - Packet loss = 100% [01:16:07] PROBLEM - Wikidough DoH Check -IPv4- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:16:57] RECOVERY - Host ganeti3006 is UP: PING OK - Packet loss = 0%, RTA = 80.18 ms [01:17:00] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5024.eqsin.wmnet with OS trixie [01:17:37] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [01:17:46] FIRING: GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [01:17:51] RECOVERY - Host ncredir3005 is UP: PING WARNING - Packet loss = 80%, RTA = 80.72 ms [01:17:57] RECOVERY - Wikidough DoH Check -IPv4- on doh3006 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:19:07] PROBLEM - Wikidough DoH Check -IPv6- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:19:17] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [01:19:55] RECOVERY - Host ncredir3005 is UP: PING WARNING - Packet loss = 33%, RTA = 80.64 ms [01:20:17] PROBLEM - Host tcp-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [01:20:37] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp5019 is OK: SSL OK - Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2026-05-13 04:44:41 +0000 (expires in 54 days) https://wikitech.wikimedia.org/wiki/HTTPS [01:20:53] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [01:20:55] RECOVERY - Host tcp-proxy3002 is UP: PING WARNING - Packet loss = 75%, RTA = 80.67 ms [01:21:07] PROBLEM - Host ganeti3006 is DOWN: PING CRITICAL - Packet loss = 100% [01:21:12] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5019.eqsin.wmnet with OS trixie [01:21:21] RECOVERY - Host ncredir3005 is UP: PING OK - Packet loss = 0%, RTA = 80.61 ms [01:21:31] RECOVERY - Host ganeti3006 is UP: PING OK - Packet loss = 0%, RTA = 80.22 ms [01:22:46] RESOLVED: GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [01:22:59] RECOVERY - Wikidough DoH Check -IPv6- on doh3006 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:23:53] PROBLEM - SSH on tcp-proxy3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:24:09] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [01:24:13] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [01:26:15] PROBLEM - Host ganeti3006 is DOWN: PING CRITICAL - Packet loss = 100% [01:26:41] PROBLEM - Host tcp-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [01:27:05] RECOVERY - Host doh3006 is UP: PING WARNING - Packet loss = 77%, RTA = 80.62 ms [01:27:05] RECOVERY - Host ncredir3005 is UP: PING WARNING - Packet loss = 77%, RTA = 80.64 ms [01:28:05] RECOVERY - Host ganeti3006 is UP: PING WARNING - Packet loss = 77%, RTA = 80.14 ms [01:28:53] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [01:29:01] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [01:29:07] RECOVERY - Host doh3006 is UP: PING OK - Packet loss = 0%, RTA = 80.52 ms [01:29:21] RECOVERY - Host ncredir3005 is UP: PING WARNING - Packet loss = 75%, RTA = 80.65 ms [01:30:53] RECOVERY - SSH on tcp-proxy3002 is OK: SSH OK - OpenSSH_10.0p2 Debian-7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:31:01] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [01:31:17] PROBLEM - Host ganeti3006 is DOWN: PING CRITICAL - Packet loss = 100% [01:31:53] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [01:32:33] RECOVERY - Host ncredir3005 is UP: PING WARNING - Packet loss = 33%, RTA = 80.70 ms [01:33:50] FIRING: [2x] ProbeDown: Service ganeti3006:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:35:19] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [01:35:43] PROBLEM - Host ganeti3007 is DOWN: PING CRITICAL - Packet loss = 100% [01:36:03] RECOVERY - Host ganeti3007 is UP: PING OK - Packet loss = 0%, RTA = 80.14 ms [01:36:57] PROBLEM - Wikidough DoT Check -IPv4- on doh3005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:37:01] RECOVERY - Host doh3006 is UP: PING WARNING - Packet loss = 71%, RTA = 80.54 ms [01:37:07] PROBLEM - Wikidough DoH Check -IPv4- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:37:59] PROBLEM - SSH on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:38:27] PROBLEM - Host hcaptcha-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [01:38:41] FIRING: [2x] JobUnavailable: Reduced availability for job tcp_proxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:38:50] FIRING: [3x] ProbeDown: Service ganeti3006:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:39:07] PROBLEM - Wikidough DoH Check -IPv6- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:39:13] RECOVERY - Host ncredir3005 is UP: PING WARNING - Packet loss = 90%, RTA = 80.64 ms [01:39:59] PROBLEM - Wikidough DoT Check -IPv4- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:39:59] PROBLEM - Wikidough DoT Check -IPv6- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:40:01] FIRING: [2x] JobUnavailable: Reduced availability for job tcp_proxy in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:40:13] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [01:40:49] RECOVERY - Wikidough DoT Check -IPv4- on doh3005 is OK: TCP OK - 3.200 second response time on 185.15.59.98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:40:59] PROBLEM - Host doh3005 is DOWN: PING CRITICAL - Packet loss = 100% [01:41:09] RECOVERY - Host doh3005 is UP: PING WARNING - Packet loss = 66%, RTA = 80.49 ms [01:41:49] RECOVERY - Wikidough DoT Check -IPv6- on doh3006 is OK: TCP OK - 0.169 second response time on 2a02:ec80:300:3:185:15:59:100 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:41:59] RECOVERY - Host hcaptcha-proxy3002 is UP: PING WARNING - Packet loss = 33%, RTA = 80.52 ms [01:41:59] PROBLEM - Host doh3005 is DOWN: PING CRITICAL - Packet loss = 100% [01:42:47] RECOVERY - Host doh3005 is UP: PING WARNING - Packet loss = 66%, RTA = 80.62 ms [01:43:25] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [01:43:50] FIRING: [4x] ProbeDown: Service ganeti3006:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:45:05] PROBLEM - Wikidough DoH Check -IPv6- on doh3005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:45:11] PROBLEM - Host bast3007 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:27] PROBLEM - Host hcaptcha-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:41] RECOVERY - Host hcaptcha-proxy3002 is UP: PING WARNING - Packet loss = 33%, RTA = 80.77 ms [01:46:57] PROBLEM - Wikidough DoT Check -IPv4- on doh3005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:46:57] FIRING: GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [01:47:21] PROBLEM - Host install3004 is DOWN: PING CRITICAL - Packet loss = 100% [01:47:39] RECOVERY - Host bast3007 is UP: PING OK - Packet loss = 0%, RTA = 80.65 ms [01:47:49] PROBLEM - Host ganeti3007 is DOWN: PING CRITICAL - Packet loss = 100% [01:48:09] PROBLEM - Host doh3005 is DOWN: PING CRITICAL - Packet loss = 100% [01:48:19] RECOVERY - Host doh3005 is UP: PING OK - Packet loss = 0%, RTA = 84.61 ms [01:48:35] RECOVERY - Host doh3006 is UP: PING WARNING - Packet loss = 77%, RTA = 80.58 ms [01:48:50] FIRING: [6x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:48:59] PROBLEM - Wikidough DoT Check -IPv6- on doh3005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:49:49] PROBLEM - Host hcaptcha-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [01:49:49] RECOVERY - Wikidough DoT Check -IPv6- on doh3005 is OK: TCP OK - 0.171 second response time on 2a02:ec80:300:3:185:15:59:98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:50:01] PROBLEM - Host doh3005 is DOWN: PING CRITICAL - Packet loss = 100% [01:50:23] RECOVERY - Host doh3005 is UP: PING WARNING - Packet loss = 60%, RTA = 80.42 ms [01:50:23] PROBLEM - Host bast3007 is DOWN: PING CRITICAL - Packet loss = 100% [01:50:51] RESOLVED: GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [01:50:57] RECOVERY - Wikidough DoH Check -IPv6- on doh3005 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 1.367 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:50:59] PROBLEM - Wikidough DoT Check -IPv6- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:51:05] PROBLEM - Wikidough DoH Check -IPv4- on doh3005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:51:59] RECOVERY - Host ganeti3007 is UP: PING WARNING - Packet loss = 60%, RTA = 80.16 ms [01:52:57] RECOVERY - Wikidough DoH Check -IPv4- on doh3005 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:53:59] PROBLEM - Wikidough DoT Check -IPv6- on doh3005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:54:59] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [01:54:59] PROBLEM - SSH on doh3005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:55:01] FIRING: [2x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:03] PROBLEM - SSH on ganeti3007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:55:29] RECOVERY - Host bast3007 is UP: PING WARNING - Packet loss = 60%, RTA = 87.47 ms [01:56:05] PROBLEM - Wikidough DoH Check -IPv6- on doh3005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:56:05] PROBLEM - Wikidough DoH Check -IPv4- on doh3005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [01:57:47] PROBLEM - SSH on bast3007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:58:37] RECOVERY - SSH on bast3007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:58:41] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:58:43] PROBLEM - Host doh3005 is DOWN: PING CRITICAL - Packet loss = 100% [01:59:27] RECOVERY - Host ncredir3005 is UP: PING WARNING - Packet loss = 90%, RTA = 80.96 ms [01:59:49] RECOVERY - Wikidough DoT Check -IPv6- on doh3005 is OK: TCP OK - 0.170 second response time on 2a02:ec80:300:3:185:15:59:98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:00:39] PROBLEM - Host bast3007 is DOWN: PING CRITICAL - Packet loss = 100% [02:00:51] RECOVERY - Wikidough DoT Check -IPv6- on doh3006 is OK: TCP OK - 1.197 second response time on 2a02:ec80:300:3:185:15:59:100 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:00:51] RECOVERY - SSH on doh3006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:01:03] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [02:02:27] RECOVERY - Host ganeti3006 is UP: PING WARNING - Packet loss = 90%, RTA = 80.19 ms [02:02:57] RECOVERY - Wikidough DoH Check -IPv4- on doh3005 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.331 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:03:37] RECOVERY - Host ncredir3005 is UP: PING WARNING - Packet loss = 90%, RTA = 80.62 ms [02:04:57] RECOVERY - Wikidough DoT Check -IPv4- on doh3005 is OK: TCP OK - 7.243 second response time on 185.15.59.98 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:05:29] PROBLEM - Host ganeti3006 is DOWN: PING CRITICAL - Packet loss = 100% [02:05:31] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [02:06:03] FIRING: [3x] JobUnavailable: Reduced availability for job nginx in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:41] FIRING: [5x] JobUnavailable: Reduced availability for job nginx in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:57] RECOVERY - Wikidough DoH Check -IPv6- on doh3005 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.332 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:09:49] RECOVERY - SSH on doh3005 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:10:43] PROBLEM - Host ganeti3007 is DOWN: PING CRITICAL - Packet loss = 100% [02:12:53] RECOVERY - SSH on ganeti3007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:13:41] FIRING: [6x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:59] RECOVERY - Wikidough DoH Check -IPv6- on doh3006 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.333 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:15:47] RECOVERY - Host ganeti3007 is UP: PING WARNING - Packet loss = 50%, RTA = 80.18 ms [02:16:59] RECOVERY - Wikidough DoH Check -IPv4- on doh3006 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 0.335 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:17:05] RECOVERY - Host doh3006 is UP: PING WARNING - Packet loss = 60%, RTA = 80.62 ms [02:19:33] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [02:20:27] PROBLEM - Host ganeti3007 is DOWN: PING CRITICAL - Packet loss = 100% [02:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:23:41] FIRING: [6x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:32:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:33:41] FIRING: [6x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:01] FIRING: [6x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:03] RECOVERY - Host doh3006 is UP: PING WARNING - Packet loss = 71%, RTA = 80.67 ms [02:35:59] PROBLEM - SSH on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:36:09] PROBLEM - Wikidough DoH Check -IPv4- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:37:09] PROBLEM - Wikidough DoH Check -IPv6- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:37:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:37:59] PROBLEM - Wikidough DoT Check -IPv6- on doh3006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:41:27] PROBLEM - Host doh3006 is DOWN: PING CRITICAL - Packet loss = 100% [02:41:28] not sure why this is flaping but it's too late to think clearly so I will just downtime. [02:42:50] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on doh3006.wikimedia.org with reason: alerting is flapping [02:43:12] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on doh3005.wikimedia.org with reason: alerting is flapping [02:43:41] FIRING: [6x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:48:41] FIRING: [6x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:59] RECOVERY - Wikidough DoH Check -IPv4- on doh3006 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 1.358 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:50:24] (03PS1) 10Fabian Kaelin: vector-search: add initial deployment chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255948 (https://phabricator.wikimedia.org/T420379) [02:52:51] RECOVERY - Wikidough DoT Check -IPv4- on doh3006 is OK: TCP OK - 0.172 second response time on 185.15.59.100 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [02:53:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 24.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:53:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:53:56] FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [02:58:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:07:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:08:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:12:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.93% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [03:12:33] RECOVERY - Host doh3006 is UP: PING WARNING - Packet loss = 60%, RTA = 80.75 ms [03:13:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:13:46] FIRING: GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [03:18:46] RESOLVED: GerritHAProxyBackendUnavailable: Gerrit backend is unavilable for tcp-proxy (HAProxy) gerrit_ssh - https://wikitech.wikimedia.org/wiki/Gerrit/Operations#GerritHAProxyBackendUnavailable - grafana.wikimedia.org/d/459365f6-df37-48d6-8142-82b22c1875e7/gerrit-tcp-proxy?viewPanel=panel-15 - https://alerts.wikimedia.org/?q=alertname%3DGerritHAProxyBackendUnavailable [03:20:01] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:22:43] RECOVERY - Host ncredir3005 is UP: PING WARNING - Packet loss = 80%, RTA = 80.63 ms [03:24:27] PROBLEM - Host ncredir3005 is DOWN: PING CRITICAL - Packet loss = 100% [03:26:51] RECOVERY - SSH on doh3006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:33:27] RECOVERY - Host hcaptcha-proxy3002 is UP: PING WARNING - Packet loss = 75%, RTA = 80.81 ms [03:35:51] PROBLEM - Host hcaptcha-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [03:53:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:58:13] RECOVERY - Host doh3005 is UP: PING WARNING - Packet loss = 75%, RTA = 80.55 ms [03:58:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:03:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:08:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:15:01] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:15:47] RECOVERY - Host bast3007 is UP: PING WARNING - Packet loss = 90%, RTA = 80.56 ms [04:17:47] PROBLEM - SSH on bast3007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:21:09] RECOVERY - Wikidough DoH Check -IPv6- on doh3006 is OK: HTTP OK: HTTP/1.1 200 OK - 595 bytes in 7.485 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [04:22:11] PROBLEM - Host bast3007 is DOWN: PING CRITICAL - Packet loss = 100% [04:33:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:38:41] FIRING: [3x] JobUnavailable: Reduced availability for job nginx in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:40:01] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:42:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:48:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:53:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:05:25] FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:09:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:13:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:23:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:30:16] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp5019.eqsin.wmnet [reason: trixie reimaging] [05:30:27] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp5024.eqsin.wmnet [reason: trixie reimaging] [05:49:05] FIRING: [6x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260320T0600) [06:08:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:15:01] FIRING: [3x] JobUnavailable: Reduced availability for job nginx in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:18:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:23:39] RECOVERY - SSH on bast3007 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:23:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:28:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:33:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:53:56] FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260320T0700) [07:08:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:15:01] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:29:30] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops-radar: Add --min-uptime to cookbooks - https://phabricator.wikimedia.org/T419967#11731264 (10Ajuanca) `start-datetime` flag of [367592](https://phabricator.wikimedia.org/T367592) acts exactly like the one we're discussing. IMHO, `--not-rebooted-since` i... [07:31:38] (03PS2) 10Muehlenhoff: Apply installserver role to install4004 [puppet] - 10https://gerrit.wikimedia.org/r/1255582 (https://phabricator.wikimedia.org/T418993) [07:33:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:35:54] (03CR) 10Muehlenhoff: [C:03+2] Apply installserver role to install4004 [puppet] - 10https://gerrit.wikimedia.org/r/1255582 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [07:38:41] FIRING: [3x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:40:07] RECOVERY - Host hcaptcha-proxy3002 is UP: PING WARNING - Packet loss = 60%, RTA = 80.82 ms [07:41:18] (03PS1) 10Tiziano Fogli: alertmanager/o11y: adjust YAML indentation for receiver key [puppet] - 10https://gerrit.wikimedia.org/r/1256076 (https://phabricator.wikimedia.org/T415317) [07:41:31] PROBLEM - SSH on hcaptcha-proxy3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:43:41] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:43:41] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:43:50] FIRING: [7x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:44:38] (03CR) 10Tiziano Fogli: [C:03+2] alertmanager/o11y: adjust YAML indentation for receiver key [puppet] - 10https://gerrit.wikimedia.org/r/1256076 (https://phabricator.wikimedia.org/T415317) (owner: 10Tiziano Fogli) [07:46:31] PROBLEM - Host hcaptcha-proxy3002 is DOWN: PING CRITICAL - Packet loss = 100% [07:47:33] (03PS1) 10Jcrespo: mediabackup: Add missing parameter VGW_REGION corresponding to site [puppet] - 10https://gerrit.wikimedia.org/r/1256082 (https://phabricator.wikimedia.org/T420506) [07:48:05] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256082 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [07:49:54] RESOLVED: ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:54:16] (03PS2) 10Jcrespo: mediabackup: Add missing parameter VGW_REGION corresponding to site [puppet] - 10https://gerrit.wikimedia.org/r/1256082 (https://phabricator.wikimedia.org/T420506) [07:54:18] FIRING: [7x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:54:30] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256082 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [07:54:54] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:58:52] (03CR) 10Jcrespo: [C:03+2] mediabackup: Add missing parameter VGW_REGION corresponding to site [puppet] - 10https://gerrit.wikimedia.org/r/1256082 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [08:00:27] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11731315 (10jcrespo) [08:00:32] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for backup2005.mgmt:22 - https://phabricator.wikimedia.org/T420613#11731317 (10jcrespo) →14Duplicate dup:03T419970 [08:00:46] 10ops-codfw, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup2005 power supplies fried or overvoltage - https://phabricator.wikimedia.org/T419970#11731321 (10jcrespo) Any update? [08:04:57] (03Abandoned) 10A smart kitten: Revert "Delete old notifications of users" [extensions/Echo] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254156 (https://phabricator.wikimedia.org/T383948) (owner: 10A smart kitten) [08:05:21] RECOVERY - SSH on hcaptcha-proxy3002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:05:35] (03Abandoned) 10A smart kitten: Revert "Create tests for NotificationMapper::deleteByUserAndAge" [extensions/Echo] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1254155 (https://phabricator.wikimedia.org/T383948) (owner: 10A smart kitten) [08:06:31] (03PS1) 10Jcrespo: mediabackup: Followup bug fix to b3afe9d, fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1256100 (https://phabricator.wikimedia.org/T420506) [08:06:44] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256100 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [08:06:53] (03PS2) 10Jcrespo: mediabackup: Followup bug fix to b3afe9d, fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1256100 (https://phabricator.wikimedia.org/T420506) [08:06:56] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256100 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [08:08:09] RECOVERY - Host ganeti3006 is UP: PING OK - Packet loss = 0%, RTA = 80.16 ms [08:08:13] RECOVERY - Host ncredir3005 is UP: PING OK - Packet loss = 0%, RTA = 80.60 ms [08:08:51] RECOVERY - Wikidough DoT Check -IPv6- on doh3006 is OK: TCP OK - 0.178 second response time on 2a02:ec80:300:3:185:15:59:100 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [08:08:53] RECOVERY - Host tcp-proxy3002 is UP: PING OK - Packet loss = 0%, RTA = 80.59 ms [08:09:18] FIRING: [4x] JobUnavailable: Reduced availability for job mtail in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:09:18] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:09:54] FIRING: [2x] MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [08:10:10] FIRING: [6x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:10:11] RECOVERY - Host ganeti3007 is UP: PING OK - Packet loss = 0%, RTA = 80.10 ms [08:10:11] RECOVERY - Host hcaptcha-proxy3002 is UP: PING OK - Packet loss = 0%, RTA = 80.36 ms [08:10:19] RECOVERY - Host install3004 is UP: PING OK - Packet loss = 0%, RTA = 80.67 ms [08:10:39] RECOVERY - Host bast3007 is UP: PING OK - Packet loss = 0%, RTA = 80.70 ms [08:11:40] (03CR) 10Jcrespo: [C:03+2] mediabackup: Followup bug fix to b3afe9d, fix syntax error [puppet] - 10https://gerrit.wikimedia.org/r/1256100 (https://phabricator.wikimedia.org/T420506) (owner: 10Jcrespo) [08:14:18] RESOLVED: [3x] JobUnavailable: Reduced availability for job nginx in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:14:18] RESOLVED: [6x] ProbeDown: Ripe Atlas anchor atlas3001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:14:18] RESOLVED: [2x] MaxConntrack: Elevated conntrack usage on ganeti3006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [08:14:54] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:22:12] 06SRE, 10LDAP-Access-Requests: Grant Access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11731370 (10Sarmbruster) [08:22:43] 06SRE, 10LDAP-Access-Requests: Grant Access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11731371 (10Sarmbruster) @Scott_French Thanks, I've updated the requested parts. [08:29:18] FIRING: [2x] JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:34:18] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:39:18] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:44:18] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:25] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-codfw [09:07:26] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11731433 (10MPostoronca-WMF) Thank you, I've joined the `ops-l` mailing list and asked access for `spiderpig` [09:09:17] (03PS1) 10JMeybohm: sre.k8s: Add cookbook to print network topology details of nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) [09:09:18] FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:12:44] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, 10media-backups: Unexpected media growth led to low disk resources on several media backup hosts - https://phabricator.wikimedia.org/T410028#11731439 (10jcrespo) After fixing some authentication and some region configuration... [09:13:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11731441 (10Gehel) 05Resolved→03Open Re-opening and moving to in progress to finalize rei... [09:15:01] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux - wonky routing with IPv6 RAs and EVPN Anycast GW - https://phabricator.wikimedia.org/T420706 (10cmooney) 03NEW p:05Triage→03High [09:15:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11731457 (10BTullis) a:05Jclark-ctr→03BTullis [09:15:27] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs - https://phabricator.wikimedia.org/T405562#11731460 (10cmooney) [09:15:28] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux - wonky routing with IPv6 RAs and EVPN Anycast GW - https://phabricator.wikimedia.org/T420706#11731459 (10cmooney) [09:15:40] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:15:56] (03PS2) 10JMeybohm: sre.k8s: Add cookbook to print network topology details of nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) [09:16:43] 06SRE, 06Infrastructure-Foundations, 10netops: Eqiad: move row-wide vlan gateways to Nokia switches - https://phabricator.wikimedia.org/T416872#11731464 (10cmooney) Unfortunately we hit another blocker with this so we will have to review the way forward. See T420706. [09:17:33] 06SRE, 10Infrastructure Security: Sensible updates of java.security properties - https://phabricator.wikimedia.org/T282545#11731483 (10MoritzMuehlenhoff) 05Open→03Declined Starting with Java 21 we've stopped using a hardened java.security file (since the settings we've initially disabled have now becom... [09:17:49] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, 10media-backups: Unexpected media growth led to low disk resources on several media backup hosts - https://phabricator.wikimedia.org/T410028#11731485 (10jcrespo) The trend is clear here: while old objects had some average si... [09:17:57] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye [09:18:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11731487 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@c... [09:18:22] !log jayme@cumin1003 START - Cookbook sre.k8s.print-network-topology [09:18:25] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:18:45] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.print-network-topology (exit_code=0) [09:19:18] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:22] 10ops-codfw, 06DC-Ops: Unresponsive management for backup2005.mgmt:22 - https://phabricator.wikimedia.org/T420708 (10phaultfinder) 03NEW [09:19:49] (03PS3) 10JMeybohm: sre.k8s: Add cookbook to print network topology details of nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) [09:19:54] 10ops-codfw, 06DC-Ops: Unresponsive management for backup2005.mgmt:22 - https://phabricator.wikimedia.org/T420708#11731519 (10jcrespo) :'-( [09:19:56] !log jayme@cumin1003 START - Cookbook sre.k8s.print-network-topology [09:19:56] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:20:18] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.print-network-topology (exit_code=0) [09:22:30] (03CR) 10Btullis: [C:03+1] "Nice." [puppet] - 10https://gerrit.wikimedia.org/r/1255887 (https://phabricator.wikimedia.org/T419041) (owner: 10Ryan Kemper) [09:23:26] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:24:25] RECOVERY - Host an-worker1172 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [09:24:48] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:25:47] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:26:51] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1172.eqiad.wmnet with OS bullseye [09:27:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11731556 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin... [09:31:48] (03PS1) 10Btullis: Temporarily set an-worker1172 into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1256270 (https://phabricator.wikimedia.org/T420416) [09:32:32] (03CR) 10Brouberol: [C:03+2] kafka-mirrormaker: add the mirror_name pod label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255799 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:33:30] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:33:52] (03PS4) 10JMeybohm: sre.k8s: Add cookbook to print network topology details of nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) [09:33:59] !log jayme@cumin1003 START - Cookbook sre.k8s.print-network-topology [09:34:10] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:34:26] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.print-network-topology (exit_code=0) [09:34:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), and 2 others: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11731582 (10BTullis) The first reimage failed because of a partman issue. {F73241550} I'll put the... [09:35:34] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:35:48] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:36:29] (03PS1) 10Kevin Bazira: ml-services: update gpt isvc image to one that supports configurable max_num_seqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1256273 (https://phabricator.wikimedia.org/T418350) [09:36:42] !log jayme@cumin1003 START - Cookbook sre.k8s.print-network-topology [09:36:56] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:37:02] (03CR) 10Gehel: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1256270 (https://phabricator.wikimedia.org/T420416) (owner: 10Btullis) [09:37:04] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.print-network-topology (exit_code=0) [09:37:08] (03PS5) 10JMeybohm: sre.k8s: Add cookbook to print network topology details of nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) [09:37:10] !log brouberol@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/kafka-mirrormaker: apply [09:37:17] !log jayme@cumin1003 START - Cookbook sre.k8s.print-network-topology [09:37:41] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.print-network-topology (exit_code=0) [09:37:58] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:38:00] 06SRE, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11731588 (10BTullis) a:03BTullis [09:38:41] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:40:03] (03CR) 10Ozge: [C:03+1] ml-services: update gpt isvc image to one that supports configurable max_num_seqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1256273 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [09:42:34] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update gpt isvc image to one that supports configurable max_num_seqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1256273 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [09:42:37] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia SR-Linux - wonky routing with IPv6 RAs and EVPN Anycast GW - https://phabricator.wikimedia.org/T420706#11731593 (10cmooney) Ticket 05547487 opened with Nokia. [09:44:27] (03Merged) 10jenkins-bot: ml-services: update gpt isvc image to one that supports configurable max_num_seqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1256273 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [09:45:28] (03PS6) 10JMeybohm: sre.k8s: Add cookbook to print network topology details of nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) [09:45:34] !log jayme@cumin1003 START - Cookbook sre.k8s.print-network-topology [09:45:40] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:45:55] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.print-network-topology (exit_code=0) [09:46:21] !log jayme@cumin1003 START - Cookbook sre.k8s.print-network-topology [09:46:46] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.print-network-topology (exit_code=0) [09:47:54] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:49:23] FIRING: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:50:26] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:50:36] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:53:18] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:54:23] RESOLVED: HelmReleaseBadStatus: Helm release zarcillo/main on k8s-aux@eqiad in state pending-rollback - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-aux&var-namespace=zarcillo - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:55:55] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:56:12] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:56:34] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:57:20] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:57:55] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:58:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255736 (https://phabricator.wikimedia.org/T420574) (owner: 10Kosta Harlan) [09:58:39] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:00:12] (03CR) 10Blake: [C:03+1] sre.k8s: Add cookbook to print network topology details of nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [10:00:18] (03PS1) 10Btullis: Route dse-k8s API blackbox checks to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) [10:00:40] (03CR) 10Btullis: [C:03+2] Temporarily set an-worker1172 into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1256270 (https://phabricator.wikimedia.org/T420416) (owner: 10Btullis) [10:02:10] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:02:15] (03PS1) 10JMeybohm: k8s.pool-depool-node: Use common network topology functions [cookbooks] - 10https://gerrit.wikimedia.org/r/1256288 (https://phabricator.wikimedia.org/T418142) [10:03:12] FIRING: HelmReleaseBadStatus: Helm release kserve/kserve-resources on k8s-mlstaging@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:04:55] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:05:07] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:05:47] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:08:12] RESOLVED: HelmReleaseBadStatus: Helm release kserve/kserve-resources on k8s-mlstaging@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=kserve - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:10:29] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:12:09] (03PS7) 10JMeybohm: sre.k8s: Add cookbook to print network topology details of nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) [10:12:09] (03PS2) 10JMeybohm: k8s.pool-depool-node: Use common network topology functions [cookbooks] - 10https://gerrit.wikimedia.org/r/1256288 (https://phabricator.wikimedia.org/T418142) [10:12:46] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:13:02] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:17:53] (03CR) 10JMeybohm: sre.k8s: Add cookbook to print network topology details of nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [10:19:02] (03CR) 10Blake: sre.k8s: Add cookbook to print network topology details of nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [10:26:24] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-codfw [10:29:45] (03CR) 10Blake: [C:03+1] k8s.pool-depool-node: Use common network topology functions [cookbooks] - 10https://gerrit.wikimedia.org/r/1256288 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [10:30:43] RECOVERY - MariaDB Replica IO: s3 on clouddb1013 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:30:43] RECOVERY - MariaDB Replica SQL: s3 on clouddb1013 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:33:09] (03CR) 10JMeybohm: [C:03+2] k8s.pool-depool-node: Use common network topology functions [cookbooks] - 10https://gerrit.wikimedia.org/r/1256288 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [10:33:13] (03CR) 10JMeybohm: [C:03+2] sre.k8s: Add cookbook to print network topology details of nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [10:36:44] 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716 (10Daria-WMDE) 03NEW [10:37:32] 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11731765 (10Daria-WMDE) @katiamusiolekwmde @WMDE-leszek hey, could you please approve the task and let me know if any additional information is needed from m... [10:38:04] (03Merged) 10jenkins-bot: sre.k8s: Add cookbook to print network topology details of nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1256260 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [10:38:44] (03Merged) 10jenkins-bot: k8s.pool-depool-node: Use common network topology functions [cookbooks] - 10https://gerrit.wikimedia.org/r/1256288 (https://phabricator.wikimedia.org/T418142) (owner: 10JMeybohm) [10:39:43] RECOVERY - MariaDB Replica Lag: s3 on clouddb1013 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:42:39] 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11731769 (10WMDE-leszek) I approve this request on WMDE's end. Thank you. Should that not be clear from the request template: @Daria-WMDE is requesting "ana... [10:47:30] (03CR) 10Hnowlan: [C:03+1] rest-gateway: Add core API support (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1255725 (https://phabricator.wikimedia.org/T418146) (owner: 10Clément Goubert) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260320T0700) [11:00:05] jelto, arnoldokoth, mutante, and arnaudb: May I have your attention please! GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260320T1100) [11:06:41] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T420634#11731792 (10Jclark-ctr) a:03Jclark-ctr @cmooney would you be availabe to assist with this id like to clean fiber a... [11:09:16] 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713#11731797 (10tappof) To incorporate @herron’s comment, we’re exploring a couple of ideas to keep fresh blocks (60-90 days) from Prometheus instances in an SSD-backed bucket. The... [11:16:03] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-e1-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T420645#11731819 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr rebalanced pdu shifting device off L3/L1 onto L2/L3 Sensor: Line, BA:L1, Current Value: 12.17 A... [11:17:43] (03PS1) 10Muehlenhoff: Point proxy in ulsfo to install4004 [dns] - 10https://gerrit.wikimedia.org/r/1256324 (https://phabricator.wikimedia.org/T418993) [11:24:25] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-codfw [11:25:22] (03PS2) 10Btullis: Add analytics-fr-tech system user and corresponding groups [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) [11:26:47] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1251146 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [11:27:32] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye [11:27:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11731845 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@c... [11:27:59] (03PS1) 10Muehlenhoff: Update DHCP server in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1256335 (https://phabricator.wikimedia.org/T418993) [11:28:25] (03CR) 10CI reject: [V:04-1] Update DHCP server in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1256335 (https://phabricator.wikimedia.org/T418993) (owner: 10Muehlenhoff) [11:33:05] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, 10media-backups: Unexpected media growth led to low disk resources on several media backup hosts - https://phabricator.wikimedia.org/T410028#11731866 (10jcrespo) Backups are slowly flowing on eqiad, too: ` db1204.eqiad.wmnet... [11:35:45] (03PS2) 10Muehlenhoff: Update DHCP server in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1256335 (https://phabricator.wikimedia.org/T418993) [11:39:18] FIRING: [8x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:42:06] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1172.eqiad.wmnet with OS bullseye [11:42:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11731877 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin... [11:43:21] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye [11:43:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11731880 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@c... [11:44:18] RESOLVED: [8x] BFDdown: BFD session down between cr3-ulsfo and 198.35.26.14 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [11:47:50] (03PS1) 10Muehlenhoff: Record LDAP access for kineticpelagic [puppet] - 10https://gerrit.wikimedia.org/r/1256353 [11:53:48] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for kineticpelagic [puppet] - 10https://gerrit.wikimedia.org/r/1256353 (owner: 10Muehlenhoff) [11:54:15] 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713#11731933 (10MatthewVernon) What sort of storage volume are we talking about here? The thanos-swift cluster has some lowlatency storage, which is largely unused; each server has... [12:01:01] (03PS1) 10Kevin Bazira: ml-services: lower parallel prefilling and concurrent decoding to decrease gpt isvc latency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1256363 (https://phabricator.wikimedia.org/T418350) [12:19:37] 06SRE, 06Infrastructure-Foundations, 07LDAP: Migrate the r/w LDAP servers to Trixie and MDB storage (and private IPs) - https://phabricator.wikimedia.org/T331699#11731962 (10MoritzMuehlenhoff) [12:20:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org [12:24:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org [12:25:18] (03PS1) 10Effie Mouzeli: memcached: add correct hiera values for nftables [puppet] - 10https://gerrit.wikimedia.org/r/1256370 [12:25:37] (03CR) 10CI reject: [V:04-1] memcached: add correct hiera values for nftables [puppet] - 10https://gerrit.wikimedia.org/r/1256370 (owner: 10Effie Mouzeli) [12:25:46] (03PS1) 10Muehlenhoff: Switch our servers to use deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1256371 (https://phabricator.wikimedia.org/T416707) [12:28:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org [12:28:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256371 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [12:28:38] (03Abandoned) 10Effie Mouzeli: memcached: add correct hiera values for nftables [puppet] - 10https://gerrit.wikimedia.org/r/1256370 (owner: 10Effie Mouzeli) [12:31:07] (03PS1) 10Effie Mouzeli: memcached: fix hieradata key for nftables in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1256372 [12:31:38] (03CR) 10CI reject: [V:04-1] memcached: fix hieradata key for nftables in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1256372 (owner: 10Effie Mouzeli) [12:31:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org [12:32:13] btullis@cumin1003 reimage (PID 935368) is awaiting input [12:32:25] (03PS2) 10Effie Mouzeli: memcached: fix hieradata key for nftables in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1256372 [12:32:55] (03CR) 10CI reject: [V:04-1] memcached: fix hieradata key for nftables in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1256372 (owner: 10Effie Mouzeli) [12:34:14] (03PS3) 10Effie Mouzeli: memcached: fix hieradata key for nftables in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1256372 [12:34:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1005.eqiad.wmnet [12:34:17] (03CR) 10Hnowlan: [C:03+1] kafkamon: rename class [puppet] - 10https://gerrit.wikimedia.org/r/1253505 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [12:35:34] !log jiji@cumin1003 END (ERROR) - Cookbook sre.memcached.roll-reboot-restart (exit_code=97) rolling reboot on A:memcached-codfw [12:40:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1005.eqiad.wmnet [12:41:34] (03CR) 10JMeybohm: [C:03+1] sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) (owner: 10Blake) [12:44:18] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:48] (03CR) 10Ozge: [C:03+1] ml-services: lower parallel prefilling and concurrent decoding to decrease gpt isvc latency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1256363 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [12:45:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest2001.codfw.wmnet [12:45:42] (03PS4) 10Effie Mouzeli: memcached: fix hieradata key for nftables in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1256372 [12:50:13] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:50:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2001.codfw.wmnet [12:51:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest2006.codfw.wmnet [12:55:10] !log cparle@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [12:56:14] !log cparle@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [12:57:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2006.codfw.wmnet [12:58:04] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1256372 (owner: 10Effie Mouzeli) [12:58:35] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:58:53] (03CR) 10Effie Mouzeli: [C:03+2] memcached: fix hieradata key for nftables in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1256372 (owner: 10Effie Mouzeli) [12:59:44] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:00:52] (03CR) 10Kevin Bazira: [C:03+2] ml-services: lower parallel prefilling and concurrent decoding to decrease gpt isvc latency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1256363 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:02:52] (03Merged) 10jenkins-bot: ml-services: lower parallel prefilling and concurrent decoding to decrease gpt isvc latency [deployment-charts] - 10https://gerrit.wikimedia.org/r/1256363 (https://phabricator.wikimedia.org/T418350) (owner: 10Kevin Bazira) [13:03:50] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:04:16] (03PS1) 10Effie Mouzeli: memcached: fix hieradata key for nftables in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1256380 [13:04:18] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:03] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:08:36] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-canary [13:09:19] 06SRE, 06Traffic: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11732131 (10BBlack) TFO is still configured in our TLS terminators. We'll have to investigate to figure out what has gone wrong here. Possibly this is being stripped by our loadbalancers. [13:10:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1256380 (owner: 10Effie Mouzeli) [13:14:09] !log sukhe@cumin1003 START - Cookbook sre.hosts.remove-downtime for doh[3005-3006].wikimedia.org [13:14:11] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for doh[3005-3006].wikimedia.org [13:14:35] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-canary [13:16:39] (03PS2) 10JMeybohm: wikikube: Add wikikube-worker[1335-1349].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1255689 (https://phabricator.wikimedia.org/T418259) [13:17:31] (03CR) 10JMeybohm: [C:03+2] wikikube: Add wikikube-worker[1335-1349].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1255689 (https://phabricator.wikimedia.org/T418259) (owner: 10JMeybohm) [13:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:50] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:29:32] FIRING: [2x] KubernetesCalicoDown: wikikube-worker1341.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:29:43] that's me adding nodes - all good [13:30:14] (03PS1) 10Jgreen: Switch fundraising bastion to codfw to move traffic for eqiad OS upgrade. [dns] - 10https://gerrit.wikimedia.org/r/1256383 [13:31:39] (03CR) 10Brouberol: [C:04-1] Route dse-k8s API blackbox checks to team-data-platform (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis) [13:32:18] (03PS1) 10Kamila Součková: Temporarily add shellbox-icu to $wgShellboxUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256384 (https://phabricator.wikimedia.org/T419049) [13:33:05] (03CR) 10CI reject: [V:04-1] Temporarily add shellbox-icu to $wgShellboxUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256384 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [13:33:43] !log jiji@cumin1003 START - Cookbook sre.memcached.roll-reboot-restart rolling reboot on A:memcached-eqiad [13:37:18] (03CR) 10CDanis: [C:03+1] dse-k8s: Add CFSSL profile for longer-lived certificates (6 mo). [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking) [13:39:22] !log bking@deploy2002 restarting opensearch-ipoid cluster to apply new certificates [13:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:33] FIRING: [5x] KubernetesCalicoDown: wikikube-worker1336.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:44:01] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure Security, 10media-backups: Unexpected media growth led to low disk resources on several media backup hosts - https://phabricator.wikimedia.org/T410028#11732227 (10jcrespo) 05Open→03Resolved a:03jcrespo [13:44:32] FIRING: [8x] KubernetesCalicoDown: wikikube-worker1336.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:47:39] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:48:18] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:49:06] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:49:33] FIRING: [11x] KubernetesCalicoDown: wikikube-worker1336.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:52:10] (03CR) 10Jgreen: [C:03+2] Switch fundraising bastion to codfw to move traffic for eqiad OS upgrade. [dns] - 10https://gerrit.wikimedia.org/r/1256383 (owner: 10Jgreen) [13:52:35] !log jgreen@dns1004 START - running authdns-update [13:54:06] !log jgreen@dns1004 END - running authdns-update [13:54:33] RESOLVED: [11x] KubernetesCalicoDown: wikikube-worker1336.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:56:25] (03CR) 10Effie Mouzeli: [C:03+2] memcached: fix hieradata key for nftables in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1256380 (owner: 10Effie Mouzeli) [13:58:30] (03PS2) 10Kamila Součková: Temporarily add shellbox-icu to $wgShellboxUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256384 (https://phabricator.wikimedia.org/T419049) [13:58:47] 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 10media-backups, 13Patch-For-Review: Evaluate garage as a replacement for an S3-compatible replacement for minio - https://phabricator.wikimedia.org/T410020#11732272 (10jcrespo) 05Open→03Resolved We evaluated Garage and while it is a nice clou... [13:59:22] (03CR) 10CI reject: [V:04-1] Temporarily add shellbox-icu to $wgShellboxUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256384 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [13:59:47] FIRING: [12x] KubernetesCalicoDown: wikikube-worker1335.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:01:05] (03PS3) 10Kamila Součková: Temporarily add shellbox-icu to $wgShellboxUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256384 (https://phabricator.wikimedia.org/T419049) [14:02:26] (03PS10) 10Blake: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) [14:04:22] (03PS11) 10Blake: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) [14:07:19] (03PS12) 10Blake: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) [14:08:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd105[56] - https://phabricator.wikimedia.org/T419892#11732318 (10Andrew) [14:08:32] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2038.codfw.wmnet [14:08:58] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd1053 - https://phabricator.wikimedia.org/T416394#11732321 (10Andrew) [14:09:38] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd1054 - https://phabricator.wikimedia.org/T416395#11732334 (10Andrew) [14:09:47] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install cloudcephosd1053 - https://phabricator.wikimedia.org/T416394#11732335 (10Andrew) [14:10:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd105[56] - https://phabricator.wikimedia.org/T419892#11732340 (10Andrew) [14:10:59] (03CR) 10Kamila Součková: "Adding shellbox-icu globally for simplicity, will only configure `wgTempCategoryCollations` for (and run `updateCollation.php` against) th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256384 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [14:14:31] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2038.codfw.wmnet [14:16:24] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply [14:16:29] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply [14:18:23] (03PS13) 10Blake: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) [14:20:52] (03PS1) 10Andrew Bogott: Initial entries for cloudcephosd105[3-6] [puppet] - 10https://gerrit.wikimedia.org/r/1256392 (https://phabricator.wikimedia.org/T416394) [14:21:10] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2039.codfw.wmnet [14:22:03] (03PS14) 10Blake: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) [14:23:34] (03PS15) 10Blake: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) [14:24:24] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install cloudcephosd1054 - https://phabricator.wikimedia.org/T416395#11732400 (10Andrew) eqiad folks: these hosts are untested hardware with a novel drive configuration. I do not expect partman to work on the first go! The intended drive setu... [14:24:31] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install cloudcephosd1053 - https://phabricator.wikimedia.org/T416394#11732402 (10Andrew) eqiad folks: these hosts are untested hardware with a novel drive configuration. I do not expect partman to work on the first go! The intended drive setu... [14:24:47] RESOLVED: KubernetesCalicoDown: wikikube-worker1335.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1335.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:24:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware), 13Patch-For-Review: Q3:rack/setup/install cloudcephosd105[56] - https://phabricator.wikimedia.org/T419892#11732403 (10Andrew) eqiad folks: these hosts are untested hardware with a novel drive configuration. I do not expect partman to work on... [14:26:56] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2039.codfw.wmnet [14:27:49] !log jayme@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1335-1349].eqiad.wmnet [14:27:53] !log jayme@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1335-1349].eqiad.wmnet [14:28:34] 06SRE, 10LDAP-Access-Requests: Grant Access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11732407 (10Scott_French) @Sarmbruster - Thank you! Apologies for the imprecision - for //Contract contact person// could you add the WMDE point of contact you'll be working with [14:29:43] !log blake@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on P{wikikube-worker[2001-2002].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [14:29:47] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2001.codfw.wmnet [14:30:21] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2001.codfw.wmnet [14:34:29] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply [14:35:10] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:35:13] (03PS1) 10Jforrester: Abstract Wikipedia: Fix API call to get page info [extensions/WikiLambda] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1256394 (https://phabricator.wikimedia.org/T420725) [14:35:49] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:36:49] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:37:05] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2001.codfw.wmnet [14:37:07] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2001.codfw.wmnet [14:37:12] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2002.codfw.wmnet [14:37:47] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker2002.codfw.wmnet [14:40:18] 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11732453 (10Scott_French) [14:41:03] (03CR) 10Neriah: "Ok, thank you for your response." [puppet] - 10https://gerrit.wikimedia.org/r/1253662 (https://phabricator.wikimedia.org/T419663) (owner: 10Neriah) [14:44:34] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:44:34] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply [14:45:01] !log blake@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2002.codfw.wmnet [14:45:02] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2002.codfw.wmnet [14:45:03] !log blake@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on P{wikikube-worker[2001-2002].codfw.wmnet} and (A:wikikube-master-codfw or A:wikikube-worker-codfw) [14:45:13] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:45:38] (03CR) 10JHathaway: [C:03+1] "looks good, did folks create manual notifications, in-case, this is forgotten?" [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking) [14:45:49] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [14:46:09] 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11732480 (10Scott_French) Thanks, @Daria-WMDE and @WMDE-leszek. @Daria-WMDE - I don't see an Developer (LDAP) account associated with the email you provided... [14:47:05] (03CR) 10JHathaway: [C:03+1] Switch our servers to use deb.debian.org [puppet] - 10https://gerrit.wikimedia.org/r/1256371 (https://phabricator.wikimedia.org/T416707) (owner: 10Muehlenhoff) [14:47:20] 06SRE, 06Traffic: TCP FastOpen not working since at least December 2025 - https://phabricator.wikimedia.org/T415454#11732482 (10BBlack) @ssingh figured out where we went wrong. At the TLS terminator level, it is enabled, but at the OS level (Linux sysctl settings), it is not. We did have it enabled at that l... [14:48:59] (03PS1) 10Tiziano Fogli: titan/memcached: double memcached size [puppet] - 10https://gerrit.wikimedia.org/r/1256395 (https://phabricator.wikimedia.org/T417336) [14:49:06] 06SRE, 10SRE-Access-Requests, 07OKR-Work, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency / Add lerickson and trueg to analytics-wikidata-users - https://phabricator.wikimedia.org/T418723#11732486 (10Scott_French) @lerickson and @BTulli... [14:49:17] (03PS1) 10Genoveva Galarza: Enable view urls in abstract.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) [14:50:49] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply [14:55:58] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply [14:56:15] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply [14:57:30] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply [14:57:31] (03CR) 10Brouberol: [C:03+1] "Looks good!" [dumps] - 10https://gerrit.wikimedia.org/r/1251169 (https://phabricator.wikimedia.org/T401296) (owner: 10WMDE-leszek) [14:57:32] (03CR) 10Brouberol: [C:03+2] Add output-dir option to specify target directory for JSON dumps [dumps] - 10https://gerrit.wikimedia.org/r/1251169 (https://phabricator.wikimedia.org/T401296) (owner: 10WMDE-leszek) [14:57:33] 06SRE, 10SRE-swift-storage: ms swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419577#11732518 (10Ladsgroup) I have sped up the deletion of thumbnails, maybe that'll make a dent? let's see [14:58:04] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply [14:58:14] (03CR) 10JMeybohm: [C:03+1] sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) (owner: 10Blake) [14:58:32] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply [14:59:07] (03CR) 10Blake: [C:03+2] sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) (owner: 10Blake) [14:59:19] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply [15:00:48] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply [15:01:00] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/wdqs-queryhammer: apply [15:02:15] !log trueg@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/wdqs-queryhammer: apply [15:03:15] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11732539 (10Scott_French) Thanks, all. @MPostoronca-WMF - Just to confirm, given the access-reason you provided: As @thcipriani notes, while participating in backport deployments is /... [15:03:59] (03Merged) 10jenkins-bot: sre.k8s: use SREBatchRunnerBase, rather than SRELBBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/1248486 (https://phabricator.wikimedia.org/T419032) (owner: 10Blake) [15:09:18] FIRING: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:38] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2040.codfw.wmnet [15:14:56] (03CR) 10Brouberol: "You mean for certificate expiry? If so, we have monitors in place and some documentation about how to resolve a firing alert (https://wiki" [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking) [15:15:04] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11732617 (10Scott_French) [15:16:38] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2040.codfw.wmnet [15:22:43] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11732647 (10Scott_French) [15:26:33] 06SRE, 10SRE-Access-Requests, 07OKR-Work, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency / Add lerickson and trueg to analytics-wikidata-users - https://phabricator.wikimedia.org/T418723#11732668 (10Dzahn) Technically this should have... [15:29:50] !log jiji@cumin1003 END (PASS) - Cookbook sre.memcached.roll-reboot-restart (exit_code=0) rolling reboot on A:memcached-eqiad [15:31:11] (03CR) 10JHathaway: [C:03+1] "ah I misunderstood, I thought you were going to put envoy in front first and fly blind" [puppet] - 10https://gerrit.wikimedia.org/r/1251117 (https://phabricator.wikimedia.org/T419289) (owner: 10Bking) [15:32:48] !log cparle@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [15:32:54] !log cparle@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [15:35:52] (03CR) 10Dzahn: [C:03+1] ats: add wmf-navigator entry [puppet] - 10https://gerrit.wikimedia.org/r/1255818 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [15:37:11] (03CR) 10Dzahn: "oh wait, this looks good but does not exist in DNS yet" [puppet] - 10https://gerrit.wikimedia.org/r/1255818 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [15:37:33] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet [15:38:46] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T420634#11732693 (10Jclark-ctr) Cleaned fiber and replaced optic on Spine [15:38:51] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T420634#11732694 (10Jclark-ctr) 05Open→03Resolved [15:39:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11732701 (10Jclark-ctr) a:05Jclark-ctr→03BTullis [15:40:02] 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713#11732705 (10tappof) Blocks from January/February 2026 occupy roughly 50 TiB, as they haven’t been downsampled. [15:40:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11732707 (10Jclark-ctr) @Jgreen I have finished setting these up password is set to Root / prod password. Reminder these are UEFI only No legacy option [15:40:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install fransw100[23] - https://phabricator.wikimedia.org/T417295#11732709 (10Jclark-ctr) a:05BTullis→03Jgreen [15:43:54] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet [15:44:18] RESOLVED: JobUnavailable: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:45:38] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf2001.codfw.wmnet [15:45:47] !log dpogorzelski@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [15:46:48] (03CR) 10Dzahn: "I am not entirely sure if this needs to point to the "rw" name, like os_reports, or "ro" like other services. If I compare to os_reports.." [dns] - 10https://gerrit.wikimedia.org/r/1255523 (https://phabricator.wikimedia.org/T414405) (owner: 10AOkoth) [15:47:06] (03CR) 10Scott French: [C:03+1] "Thanks, Raine!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256384 (https://phabricator.wikimedia.org/T419049) (owner: 10Kamila Součková) [15:51:38] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2001.codfw.wmnet [15:52:56] (03CR) 10Ssingh: "There is another option that I want to discuss and that we did in the IRC channel. acme-chief supports http-01 challenges, so in theory, c" [puppet] - 10https://gerrit.wikimedia.org/r/1242499 (https://phabricator.wikimedia.org/T419887) (owner: 10Cwhite) [15:54:24] (03PS1) 10FNegri: conftool-data: move s3, x3 to new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1256417 (https://phabricator.wikimedia.org/T409557) [15:56:59] 06SRE, 10SRE-Access-Requests, 07OKR-Work, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency / Add lerickson and trueg to analytics-wikidata-users - https://phabricator.wikimedia.org/T418723#11732804 (10lerickson) Hi! This was all a mixup... [15:58:10] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:58:30] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:00:43] 06SRE, 10SRE-Access-Requests, 07OKR-Work, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency / Add lerickson and trueg to analytics-wikidata-users - https://phabricator.wikimedia.org/T418723#11732816 (10Dzahn) The easiest thing is if we ju... [16:02:53] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf2002.codfw.wmnet [16:03:46] 06SRE, 07OKR-Work, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency / Add lerickson and trueg to analytics-wikidata-users - https://phabricator.wikimedia.org/T418723#11732823 (10Dzahn) Done! This lets you keep this open as long as you want... [16:07:35] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11732857 (10Dzahn) For WMDE staff; the standard procedure is that after the NDA is complete they get added to 2 LDAP groups; the one called "nda" and the one... [16:08:52] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2002.codfw.wmnet [16:08:52] 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11732860 (10Dzahn) For WMDE staff; the standard procedure is that after the NDA is complete they get added to 2 LDAP groups; the one called "nda" and the one... [16:09:13] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1172.eqiad.wmnet with OS bullseye [16:09:18] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11732873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin... [16:10:30] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:10:42] (03PS2) 10Btullis: Route dse-k8s API blackbox checks to team-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) [16:11:10] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:14:49] (03PS1) 10Btullis: Update the partman recipe for an-worker1172 [puppet] - 10https://gerrit.wikimedia.org/r/1256426 (https://phabricator.wikimedia.org/T420416) [16:15:55] (03PS2) 10Btullis: Update the partman recipe for an-worker1172 [puppet] - 10https://gerrit.wikimedia.org/r/1256426 (https://phabricator.wikimedia.org/T420416) [16:18:21] (03CR) 10Btullis: [C:03+2] Update the partman recipe for an-worker1172 [puppet] - 10https://gerrit.wikimedia.org/r/1256426 (https://phabricator.wikimedia.org/T420416) (owner: 10Btullis) [16:22:25] 06SRE, 07OKR-Work, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency / Add lerickson and trueg to analytics-wikidata-users - https://phabricator.wikimedia.org/T418723#11732920 (10lerickson) That sounds like the perfect solution. Thank you! [16:22:34] 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713#11732921 (10MatthewVernon) Right, then the existing thanos-swift infrastructure has no-where near the SSD capacity to support that use case. To do this, we'd need around 200 TB... [16:22:50] 06SRE, 10SRE-Access-Requests: Requesting access to uperset for alice.moutinho - https://phabricator.wikimedia.org/T420751 (10Alice.moutinho) 03NEW [16:23:52] (03PS1) 10Jforrester: Wikifunctions: Switch cache from mcrouter-wikifunctions to basic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256432 (https://phabricator.wikimedia.org/T419666) [16:23:54] (03PS1) 10Jforrester: [wikifunctions] Drop m.wikifunctions.org from lists, we've not used it for years [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256433 [16:24:56] 06SRE, 10SRE-Access-Requests: Requesting access to Superset for elishacohenwmde - https://phabricator.wikimedia.org/T404359#11732934 (10Alice.moutinho) [16:29:18] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:32:08] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5021.eqsin.wmnet [reason: trixie reimaging] [16:32:11] 06SRE, 10SRE-Access-Requests: Requesting access to uperset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11732995 (10Scott_French) @WMDE-leszek - Could you please provide approval? Thanks! @KFrancis - Could you please initiate the NDA process? Thanks! @Alice.moutinho - A couple of items: 1.... [16:32:53] 06SRE, 10SRE-Access-Requests: Requesting access to uperset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11732998 (10Scott_French) [16:33:05] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5021.eqsin.wmnet with OS trixie [16:34:18] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:35:52] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11733040 (10Scott_French) Thanks, @Dzahn - that's a good point, as the process around both is awkwardly disjoint at the moment. @AnnieKim_WMDE - Let me know... [16:38:16] 06SRE, 10SRE-Access-Requests: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11733050 (10Pppery) [16:45:59] 06SRE, 10SRE-swift-storage, 10Observability-Metrics: thanos swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419713#11733072 (10tappof) We will further discuss internally the option you suggested and the available ways to implement our idea with Thanos next Monday afternoon (European time) an... [16:46:27] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye [16:46:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11733075 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@c... [16:52:36] (03PS6) 10Arnaudb: gerrit: Wire mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1254940 (https://phabricator.wikimedia.org/T420189) [16:52:41] (03PS1) 10Arnaudb: gerrit: Tune mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1256445 (https://phabricator.wikimedia.org/T420189) [16:52:45] (03PS1) 10Arnaudb: gerrit: Tune mpm_event configuration to allow connection reuse on CDN [puppet] - 10https://gerrit.wikimedia.org/r/1256446 (https://phabricator.wikimedia.org/T420189) [16:52:57] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256446 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb) [16:53:00] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256445 (https://phabricator.wikimedia.org/T420189) (owner: 10Arnaudb) [16:54:14] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1172.eqiad.wmnet with OS bullseye [16:54:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11733105 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin... [16:54:32] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-worker1172.eqiad.wmnet with OS bullseye [16:54:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27), 07Essential-Work: hw troubleshooting: Comm Error: Backplane 0 on an-worker1172.eqiad.wmnet - https://phabricator.wikimedia.org/T420416#11733106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@c... [17:00:23] (03PS3) 10A smart kitten: phabricator: Set a custom default-mail-address for the test instance [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) [17:02:01] (03CR) 10CI reject: [V:04-1] phabricator: Set a custom default-mail-address for the test instance [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [17:08:27] btullis@cumin1003 reimage (PID 972881) is awaiting input [17:08:35] (03PS1) 10Tiziano Fogli: prometheus: adjust join in PrometheusZombieSeriesDetected rule [alerts] - 10https://gerrit.wikimedia.org/r/1256451 (https://phabricator.wikimedia.org/T415317) [17:13:42] 10ops-codfw, 06DC-Ops: Power Supply - Status - issue on cirrussearch2080:9290 - https://phabricator.wikimedia.org/T420760 (10phaultfinder) 03NEW [17:13:44] 10ops-codfw, 06DC-Ops: Power Supply - Status - issue on logstash2036:9290 - https://phabricator.wikimedia.org/T420761 (10phaultfinder) 03NEW [17:13:45] 10ops-codfw, 06DC-Ops: Power Supply - Status - issue on cirrussearch2079:9290 - https://phabricator.wikimedia.org/T420762 (10phaultfinder) 03NEW [17:14:09] (03PS4) 10CDanis: phabricator: Set a custom default-mail-address for the test instance [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [17:16:12] (03CR) 10CI reject: [V:04-1] phabricator: Set a custom default-mail-address for the test instance [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [17:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:20:42] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users and SQL Lab for AnnieKim_WMDE - https://phabricator.wikimedia.org/T420500#11733206 (10KFrancis) Hi all, the NDA has been signed. Thanks! [17:21:07] (03PS5) 10A smart kitten: phabricator: Set a custom default-mail-address for the test instance [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) [17:25:08] (03CR) 10A smart kitten: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [17:26:34] PROBLEM - jenkins_service_running on contint1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [17:32:49] (03CR) 10A smart kitten: "Please review carefully -- judging by [the PCC outputs](https://puppet-compiler.wmflabs.org/output/1256301/6159/) [0] this _seems_ like it" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [17:33:42] 06SRE, 10SRE-Access-Requests: Requesting access to data and Superset for Daria-WMDE (Daria Ammalainen (WMDE)) - https://phabricator.wikimedia.org/T420716#11733268 (10KFrancis) Hi all, I've sent the NDA for signatures. I'll confirm when it's complete. Thanks! [17:39:48] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2014.codfw.wmnet [17:40:03] 06SRE, 10SRE-Access-Requests: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11733314 (10KFrancis) Hi all, I have sent the NDA out for signatures. I'll confirm when it's complete. Thanks! [17:40:41] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on contint1003.wikimedia.org with reason: jenkins on java21 [17:48:57] (03CR) 10Btullis: Route dse-k8s API blackbox checks to team-data-platform (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1256287 (https://phabricator.wikimedia.org/T420264) (owner: 10Btullis) [17:50:27] (03CR) 10Ssingh: "test-run cookbook looks good, one comment for discussion:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 (owner: 10BCornwall) [17:51:26] (03PS5) 10BCornwall: Add sre.cdn.roll-restart-reboot-tcp-proxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 [17:51:34] (03CR) 10BCornwall: Add sre.cdn.roll-restart-reboot-tcp-proxy (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 (owner: 10BCornwall) [17:51:44] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host lvs2014.codfw.wmnet [17:52:08] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:52:58] (03CR) 10Ssingh: [C:03+1] "Thanks for writing a cookbook right now vs doing this manually." [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 (owner: 10BCornwall) [17:53:43] (03CR) 10Ssingh: [C:03+1] "I think since we own this in Traffic, we can simply start the cookbook and forget about it vs doing the manual depool. So +1 for this to r" [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 (owner: 10BCornwall) [17:54:08] (03CR) 10BCornwall: [C:03+2] Add sre.cdn.roll-restart-reboot-tcp-proxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 (owner: 10BCornwall) [17:54:09] (03CR) 10BCornwall: [V:03+2 C:03+2] Add sre.cdn.roll-restart-reboot-tcp-proxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1254997 (owner: 10BCornwall) [17:54:29] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5021.eqsin.wmnet with OS trixie [17:59:00] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5021.eqsin.wmnet with OS trixie [18:03:25] (03CR) 10Majavah: [C:04-1] phabricator: Set a custom default-mail-address for the test instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [18:04:05] (03CR) 10Majavah: [V:03+1 C:04-1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [18:10:19] PROBLEM - MariaDB Replica IO: s7 #page on db1253 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:10:37] PROBLEM - MariaDB Replica SQL: s7 #page on db1253 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:10:37] !ack [18:10:38] Could not ack the alert. Please check the parameters. [18:10:38] PROBLEM - MariaDB Replica Lag: s7 #page on db1253 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:11:35] o/ [18:11:43] o/ [18:11:54] expired downtime? [18:12:01] not sure [18:12:08] https://phabricator.wikimedia.org/T420041 [18:12:36] swfrench-wmf: I'm not sure who is handing out the prizes, but you win one! [18:12:49] two [18:12:55] lol [18:13:18] jhathaway: should I downtime it or are you? [18:13:35] I got it, I'll go for 14 days [18:14:39] !log jhathaway@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on db1253.eqiad.wmnet with reason: T420041 [18:14:43] T420041: db1253 depooled following host crash - https://phabricator.wikimedia.org/T420041 [18:14:45] thanks! [18:14:50] thanks y'all [18:15:10] * sukhe hopes all incidents are easy as this one [18:15:15] ^^^ [18:16:07] 06SRE, 07OKR-Work, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency - https://phabricator.wikimedia.org/T418723#11733538 (10lerickson) [18:16:51] !log brett@cumin2002 START - Cookbook sre.cdn.roll-restart-reboot-tcp-proxy rolling reboot on A:tcpproxy and A:tcpproxy [18:18:34] RECOVERY - jenkins_service_running on contint1003 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [18:20:52] (03PS1) 10Alex.sanford: Reduce reauth timeout for editing site JS to 10 minutes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256472 (https://phabricator.wikimedia.org/T419605) [18:27:28] (03PS6) 10A smart kitten: phabricator: Set a custom default-mail-address for the test instance [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) [18:28:25] !log cdobbins@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs2013.codfw.wmnet with reason: reboot [18:28:30] (03CR) 10A smart kitten: phabricator: Set a custom default-mail-address for the test instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) (owner: 10A smart kitten) [18:39:43] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5021.eqsin.wmnet with reason: host reimage [18:40:36] (03CR) 10BPirkle: [C:03+1] "Code looks fine, had one optional suggestion on the comment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester) [18:43:16] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5021.eqsin.wmnet with reason: host reimage [18:45:55] (03PS1) 10Dzahn: ci::jenkins: add dependency of jenkins service on firewall [puppet] - 10https://gerrit.wikimedia.org/r/1256485 (https://phabricator.wikimedia.org/T418521) [18:49:18] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1256485/8311/contint1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1256485 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [18:52:00] (03CR) 10Catrope: [C:04-1] Reduce reauth timeout for editing site JS to 10 minutes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256472 (https://phabricator.wikimedia.org/T419605) (owner: 10Alex.sanford) [18:52:09] !log cdobbins@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs2013.codfw.wmnet with reason: reboot [18:56:17] (03PS2) 10Alex.sanford: Reduce reauth timeout for editing site JS to 10 minutes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256472 (https://phabricator.wikimedia.org/T419605) [18:56:50] (03CR) 10Alex.sanford: Reduce reauth timeout for editing site JS to 10 minutes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256472 (https://phabricator.wikimedia.org/T419605) (owner: 10Alex.sanford) [18:58:05] (03CR) 10Catrope: [C:03+1] Reduce reauth timeout for editing site JS to 10 minutes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256472 (https://phabricator.wikimedia.org/T419605) (owner: 10Alex.sanford) [19:01:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256472 (https://phabricator.wikimedia.org/T419605) (owner: 10Alex.sanford) [19:02:29] 06SRE, 10SRE-Access-Requests: Requesting access to superset for alice.moutinho - https://phabricator.wikimedia.org/T420751#11733653 (10WMDE-leszek) hello, I approve this request on WMDE's end. And yes, it is about "level 1" analytics_privatedata_users access. thank you! [19:03:39] 06SRE, 10LDAP-Access-Requests: Grant Access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11733659 (10WMDE-leszek) [19:04:23] 06SRE, 10LDAP-Access-Requests: Grant Access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11733673 (10WMDE-leszek) I've added us WMDE contact people in form of phabricator user links. Let me know if I should explicitly state email address or legal names there. [19:09:57] 06SRE, 10LDAP-Access-Requests: Grant Access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11733704 (10Scott_French) @WMDE-leszek - Great, thank you! No, this is fine - I can convert them to email addresses via LDAP anyway. I'll get this rolling shortly. [19:12:04] (03PS1) 10Scott French: admin: Record LDAP access for sarmbruster [puppet] - 10https://gerrit.wikimedia.org/r/1256490 (https://phabricator.wikimedia.org/T420410) [19:14:58] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5021.eqsin.wmnet with OS trixie [19:16:08] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp5021.eqsin.wmnet [reason: trixie reimaging] [19:16:17] !log cdobbins@cumin2002 conftool action : set/pooled=no; selector: name=cp5023.eqsin.wmnet [reason: trixie reimaging] [19:16:45] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5023.eqsin.wmnet with OS trixie [19:16:57] (03CR) 10Ssingh: [C:03+1] "verified name, contract end date and person" [puppet] - 10https://gerrit.wikimedia.org/r/1256490 (https://phabricator.wikimedia.org/T420410) (owner: 10Scott French) [19:21:39] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11733750 (10Scott_French) @katiamusiolekwmde - A couple of items: 1. I don't see a Developer (LDAP) account associated with the email you provided. If you've not created... [19:21:40] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11733752 (10Dreamy_Jazz) Just to give some context There are database tables that are only on the production databases (like #checkuser tables and extension1 cluster etc) and Maxim has... [19:21:40] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-restart-reboot-tcp-proxy (exit_code=0) rolling reboot on A:tcpproxy and A:tcpproxy [19:22:09] (03CR) 10RLazarus: "Please also add httpbb tests, so we can make sure this is working as intended." [puppet] - 10https://gerrit.wikimedia.org/r/1256396 (https://phabricator.wikimedia.org/T420666) (owner: 10Genoveva Galarza) [19:22:45] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11733760 (10Scott_French) [19:30:44] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2013.codfw.wmnet [19:33:41] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2013.codfw.wmnet [19:33:59] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [19:34:17] PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:34:31] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [19:34:44] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11733809 (10Scott_French) Thanks for the background context, @Dreamy_Jazz! Relative the number of folks with production shell access, it's still a rather unusual access justification t... [19:35:44] 06SRE, 07OKR-Work, 06Wikidata Platform Team (Sprint 03 (2026/03/03)): Materialize analytics queries to improve superset dashboard latency - https://phabricator.wikimedia.org/T418723#11733812 (10lerickson) Project update: - We have a "wikidata" database - The user (and owner of the relevant HDFS directo... [19:36:17] RECOVERY - pybal on lvs2013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:37:56] (03CR) 10Scott French: [C:03+2] admin: Record LDAP access for sarmbruster [puppet] - 10https://gerrit.wikimedia.org/r/1256490 (https://phabricator.wikimedia.org/T420410) (owner: 10Scott French) [19:38:21] PROBLEM - Docker registry HTTPS interface on registry2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [19:38:29] PROBLEM - Docker registry HTTPS interface on registry2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [19:38:43] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.codfw.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.codfw.wmnet, port=9243): Max retries exceeded with url: /_cluster/health (Caused by ConnectTimeoutError(urllib3.connection.HTTPSConnection object at 0x7f5baf5ae8d0, Connection to search.svc.codfw.wmnet timed out. [19:38:43] t timeout=4))) https://wikitech.wikimedia.org/wiki/Search%23Administration [19:38:44] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.codfw.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.codfw.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by ConnectTimeoutError(urllib3.connection.HTTPSConnection object at 0x7f5980d0aa90, Connection to search.svc.codfw.wmnet timed out. [19:38:45] t timeout=4))) https://wikitech.wikimedia.org/wiki/Search%23Administration [19:38:46] PROBLEM - ElasticSearch health check for shards on 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.codfw.wmnet:9643/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.codfw.wmnet, port=9643): Max retries exceeded with url: /_cluster/health (Caused by ConnectTimeoutError(urllib3.connection.HTTPSConnection object at 0x7f4ed39cec10, Connection to search.svc.codfw.wmnet timed out. [19:38:47] t timeout=4))) https://wikitech.wikimedia.org/wiki/Search%23Administration [19:38:57] FIRING: [10x] ProbeDown: Service apus:443 has failed probes (http_apus_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:39:41] PROBLEM - Check unit status of ipip-multiqueue-optimizer on lvs2013 is CRITICAL: CRITICAL: Status of the systemd unit ipip-multiqueue-optimizer https://wikitech.wikimedia.org/wiki/LVS%23IPIP_encapsulation_experiments [19:39:50] o/ [19:40:04] codfw LVS? [19:40:21] feels that way yeah [19:40:22] Yeah, we rebooted it [19:40:26] o/ [19:40:35] seems ... load bearing [19:40:55] swfrench-wmf: your on a roll today [19:41:07] lol [19:41:12] stopping pybal on it, cc cjd91 [19:41:13] RECOVERY - Docker registry HTTPS interface on registry2005 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 1.001 second response time https://wikitech.wikimedia.org/wiki/Docker [19:41:19] RECOVERY - Docker registry HTTPS interface on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 3745 bytes in 0.245 second response time https://wikitech.wikimedia.org/wiki/Docker [19:41:35] RECOVERY - ElasticSearch health check for shards on 9643 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-psi-codfw: cluster_name: production-search-psi-codfw, status: green, timed_out: False, number_of_nodes: 28, number_of_data_nodes: 28, discovered_master: True, active_primary_shards: 1726, active_shards: 5177, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, n [19:41:35] _pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:41:36] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-omega-codfw: cluster_name: production-search-omega-codfw, status: green, timed_out: False, number_of_nodes: 27, number_of_data_nodes: 27, discovered_master: True, active_primary_shards: 1729, active_shards: 5182, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: [19:41:37] r_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:41:38] RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-codfw: cluster_name: production-search-codfw, status: green, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1299, active_shards: 3859, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of [19:41:39] _tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:41:56] (03PS1) 10Scardenasmolinar: PersonalDashboard: Add config for Active Discussions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1256498 (https://phabricator.wikimedia.org/T420785) [19:42:26] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [19:43:16] looks like everything is coming back to life [19:43:22] yeah [19:43:31] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:43:41] never mind [19:43:46] spoke too soon [19:43:59] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [19:44:08] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:44:18] forgot to disable puppet too, sorry [19:44:18] FIRING: [3x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:44:24] FIRING: [2x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [19:44:30] RESOLVED: [10x] ProbeDown: Service apus:443 has failed probes (http_apus_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:46:17] PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:46:40] ^fine to ignore [19:47:26] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [19:47:51] so, is lvs2013 is "bad" somehow? (i.e., so whenever it's primary it doesn't work0 [19:48:04] Yeah, ipip-multiqueue-optimizer.service was unhappy after a reboot [19:48:46] * swfrench-wmf thumbs up [19:48:52] thanks, brett! [19:48:55] Sorry for the trouble [19:49:12] RESOLVED: [3x] SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:49:18] FIRING: [6x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [19:49:41] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal [19:49:55] downtiming [19:50:05] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs2013.codfw.wmnet with reason: debugging ipip [19:50:18] * swfrench-wmf goes back to hand-editing LDAP groups [19:50:46] (03PS1) 10Dzahn: jenkins: ensure /srv/jenkins/builds exists [puppet] - 10https://gerrit.wikimedia.org/r/1256508 (https://phabricator.wikimedia.org/T418521) [19:50:49] lol [19:51:19] [ldap-maint1001:~] $ sudo modify-ldap-group wmde [19:52:20] (03PS7) 10A smart kitten: phabricator: Set a custom default-mail-address for the test instance [puppet] - 10https://gerrit.wikimedia.org/r/1256301 (https://phabricator.wikimedia.org/T388022) [19:52:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:53:08] hello `Action? [yYqQvVebB*rsf+?]` my old friend [19:53:21] best prompt ever [19:53:27] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for katiamusiolek - https://phabricator.wikimedia.org/T420459#11733847 (10KFrancis) Hi all, I have sent the NDA out for signatures. I'll confirm when it's complete. Thanks! [19:53:32] RESOLVED: [6x] SLOBudgetBurn: Search update lag is below 95% target in codfw - https://alerts.wikimedia.org/?q=alertname%3DSLOBudgetBurn [19:54:23] it's so good [19:54:25] swfrench-wmf: [lOL] [19:54:36] the delicate placement of the asterisk [19:54:38] chef's kiss [19:54:59] the question mark at the end as if you're being asked a question (you're not, the question mark is *also* one of your choices) [19:55:02] it's a work of art [19:55:29] :) [19:58:18] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to WMDE LDAP group for Sarmbruster - https://phabricator.wikimedia.org/T420410#11733872 (10Scott_French) 05Open→03Resolved a:03Scott_French Alright, this is now done - sarmbruster has been added to both `nda` and `wmde`. Thanks, all! [20:04:04] 10SRE-swift-storage, 06Commons, 07Wikimedia-production-error: uploadstash-exception: Could not store upload in the stash while uploading PDF file - https://phabricator.wikimedia.org/T420786#11733882 (10A_smart_kitten) [20:04:10] 06SRE, 10SRE-Access-Requests: Requesting access to SQL Lab for cohi - https://phabricator.wikimedia.org/T420578#11733883 (10Scott_French) [20:05:47] 06SRE, 10SRE-Access-Requests: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11733886 (10MPostoronca-WMF) I confirm I'll consult with the DBA if I have to run something else than EXPLAINs [20:06:51] 06SRE, 10SRE-Access-Requests: Requesting access to SQL Lab for cohi - https://phabricator.wikimedia.org/T420578#11733891 (10Scott_French) Thanks, @Hany.elmokadem! @CorinnaHillebrand_WMDE - Since you already have an NDA on file from T234429, and based on T420578#11730191 I suspect the issue here is a missing `... [20:08:44] (03PS2) 10Dzahn: jenkins: ensure /srv/jenkins/builds exists [puppet] - 10https://gerrit.wikimedia.org/r/1256508 (https://phabricator.wikimedia.org/T418521) [20:11:06] (03CR) 10Dzahn: "see the code:" [puppet] - 10https://gerrit.wikimedia.org/r/1256508 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [20:11:25] (03PS2) 10Dzahn: jenkins: remove httpd profile from role [puppet] - 10https://gerrit.wikimedia.org/r/1255139 (https://phabricator.wikimedia.org/T418521) [20:11:33] (03PS3) 10Dzahn: jenkins: remove httpd profile from role [puppet] - 10https://gerrit.wikimedia.org/r/1255139 (https://phabricator.wikimedia.org/T418521) [20:11:38] (03CR) 10CI reject: [V:04-1] jenkins: remove httpd profile from role [puppet] - 10https://gerrit.wikimedia.org/r/1255139 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [20:13:03] (03CR) 10Bearloga: [C:03+1] Revert "growhbook: allow WMDE engineers to self-enroll" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247916 (owner: 10Brouberol) [20:13:34] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1255139 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [20:18:07] (03PS1) 10Scott French: admin: Add mpostoronca shell access and deployment membership [puppet] - 10https://gerrit.wikimedia.org/r/1256520 (https://phabricator.wikimedia.org/T420458) [20:18:39] (03PS3) 10Jforrester: Move testwiki-only Attribution REST API definition to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 [20:18:39] (03CR) 10Jforrester: Move testwiki-only Attribution REST API definition to IS (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1250113 (owner: 10Jforrester) [20:20:00] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to production for Mpostoronca-wmf - https://phabricator.wikimedia.org/T420458#11733922 (10Scott_French) Great, thank you @MPostoronca-WMF - I'll [[ https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#Deployment_Groups | also... [20:20:44] (03CR) 10Scott French: "SSH public key verified out of band (Slack)." [puppet] - 10https://gerrit.wikimedia.org/r/1256520 (https://phabricator.wikimedia.org/T420458) (owner: 10Scott French) [20:20:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1255847 (owner: 10Catrope) [20:23:49] !log sukhe@cumin1003 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on doh3005.wikimedia.org with reason: depooled host [20:24:27] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on doh3005.wikimedia.org with reason: depooled host [20:24:49] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on doh3006.wikimedia.org with reason: depooled host [20:29:54] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:33:39] (03CR) 10Brouberol: [C:03+2] Revert "growhbook: allow WMDE engineers to self-enroll" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1247916 (owner: 10Brouberol) [20:38:07] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp5023.eqsin.wmnet with OS trixie [20:38:41] (03CR) 10Dzahn: [C:03+2] jenkins: remove httpd profile from role [puppet] - 10https://gerrit.wikimedia.org/r/1255139 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [20:40:52] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2013.codfw.wmnet [20:43:50] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2013.codfw.wmnet [20:44:17] PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [20:44:31] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [20:45:08] !log contint1003/2003 apt remove --purge apache2* ; apt remove --purge php* | T418521 [20:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:13] T418521: setup 2 contint machines for jenkins - https://phabricator.wikimedia.org/T418521 [20:45:59] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [20:46:00] !log cdobbins@cumin2002 START - Cookbook sre.hosts.reimage for host cp5023.eqsin.wmnet with OS trixie [20:53:19] (03CR) 10Hashar: jenkins: ensure /srv/jenkins/builds exists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1256508 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [20:57:58] (03PS3) 10Dzahn: jenkins: ensure /srv/jenkins/builds exists [puppet] - 10https://gerrit.wikimedia.org/r/1256508 (https://phabricator.wikimedia.org/T418521) [20:59:13] (03CR) 10Dzahn: jenkins: ensure /srv/jenkins/builds exists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1256508 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [20:59:50] (03CR) 10Hashar: [C:03+1] "🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1256508 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [21:01:25] (03CR) 10Dzahn: [C:03+2] jenkins: ensure /srv/jenkins/builds exists [puppet] - 10https://gerrit.wikimedia.org/r/1256508 (https://phabricator.wikimedia.org/T418521) (owner: 10Dzahn) [21:04:51] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs2013.codfw.wmnet with reason: debugging ipip [21:18:41] (03PS3) 10SBassett: Allow-list some additional domains to the currently enforcing CSP [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) (owner: 10Sportzpikachu) [21:19:22] (03CR) 10SBassett: [C:04-1] "Hold for infra deployment window on 2026-03-23" [puppet] - 10https://gerrit.wikimedia.org/r/1255066 (https://phabricator.wikimedia.org/T420539) (owner: 10Sportzpikachu) [21:19:54] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:21:47] !log cdobbins@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5023.eqsin.wmnet with reason: host reimage [21:25:06] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5023.eqsin.wmnet with reason: host reimage [21:35:16] (03PS3) 10Pppery: Add warning of impending Etherpad deletion [puppet] - 10https://gerrit.wikimedia.org/r/1256544 (https://phabricator.wikimedia.org/T420793) [21:37:48] (03PS4) 10Pppery: Add warning of impending Etherpad deletion [puppet] - 10https://gerrit.wikimedia.org/r/1256544 (https://phabricator.wikimedia.org/T420793) [21:55:48] !log Upgrading CI Jenkins T420477 [21:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:23] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5023.eqsin.wmnet with OS trixie [21:57:31] !log cdobbins@cumin2002 conftool action : set/pooled=yes; selector: name=cp5023.eqsin.wmnet [reason: trixie reimaging] [22:02:19] (03PS4) 10Jdlrobson: Limit and standardize thumbnail options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251196 (https://phabricator.wikimedia.org/T376152) [22:02:19] (03CR) 10Jdlrobson: "> It also seems like itwikiquote should be supported in their decision not to have any thumbnail size preferenece at all" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251196 (https://phabricator.wikimedia.org/T376152) (owner: 10Jdlrobson) [22:03:01] (03PS5) 10Jdlrobson: Limit and standardize thumbnail options [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1251196 (https://phabricator.wikimedia.org/T376152) [22:11:55] (03PS1) 10BCornwall: lvs2013: Override txqlen for eno12399np0 [puppet] - 10https://gerrit.wikimedia.org/r/1256556 (https://phabricator.wikimedia.org/T420789) [22:13:57] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8313/co" [puppet] - 10https://gerrit.wikimedia.org/r/1256556 (https://phabricator.wikimedia.org/T420789) (owner: 10BCornwall) [22:16:03] (03CR) 10Vgutierrez: [C:03+1] "de-configured interfaces don't have a cable connected, so effectively this is a NOOP in terms of traffic:" [puppet] - 10https://gerrit.wikimedia.org/r/1256556 (https://phabricator.wikimedia.org/T420789) (owner: 10BCornwall) [22:16:25] (03CR) 10BCornwall: [V:03+1 C:03+2] lvs2013: Override txqlen for eno12399np0 [puppet] - 10https://gerrit.wikimedia.org/r/1256556 (https://phabricator.wikimedia.org/T420789) (owner: 10BCornwall) [22:18:27] (03CR) 10BCornwall: [V:03+1 C:03+2] "The commit summary isn't clear enough: This fixes an issue of configuring unlisted NICs by overriding `interface_tweaks`" [puppet] - 10https://gerrit.wikimedia.org/r/1256556 (https://phabricator.wikimedia.org/T420789) (owner: 10BCornwall) [22:19:17] RECOVERY - pybal on lvs2013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [22:19:41] RECOVERY - Check unit status of ipip-multiqueue-optimizer on lvs2013 is OK: OK: Status of the systemd unit ipip-multiqueue-optimizer https://wikitech.wikimedia.org/wiki/LVS%23IPIP_encapsulation_experiments [22:19:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:27:11] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs2013.codfw.wmnet [22:33:17] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2013 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [22:33:23] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 81 connections established with conf2004.codfw.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal [22:34:08] !log Started pybal on lvs2013 [22:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:44] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host lvs2013.codfw.wmnet [22:34:53] ^successful, just ml-staging failing [22:59:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:19:58] 06SRE, 10SRE-swift-storage: ms swift capacity for FY 26/27 - https://phabricator.wikimedia.org/T419577#11734401 (10Ladsgroup) And another thing, I'm planning to shut down transcoding of videos that are not used anywhere in the projects, so that should reduce the size of transcode bucket by roughly 90%. I don't... [23:30:52] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs2013.codfw.wmnet [23:30:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2013.codfw.wmnet [23:31:21] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet, ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:31:27] ^known, cc dpogorzelski