[00:02:20] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119589 [00:38:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119589 (owner: 10TrainBranchBot) [00:39:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10551389 (10phaultfinder) [00:48:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1119589 (owner: 10TrainBranchBot) [00:58:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:59:31] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119242 (owner: 10TrainBranchBot) [01:02:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:05:06] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, active_primary_shards: 259, active_shards: 449, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 69, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_ [01:05:06] in_queue_millis: 0, active_shards_percent_as_number: 86.34615384615385 https://wikitech.wikimedia.org/wiki/Search%23Administration [01:09:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119594 [01:09:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119594 (owner: 10TrainBranchBot) [01:24:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:29:33] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1119594 (owner: 10TrainBranchBot) [01:32:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10551437 (10phaultfinder) [01:34:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:46:28] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/73f591d58541387705fd1d183c307181331ba3441006a56c1782c186bb5d4095/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:47:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:28] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10551478 (10phaultfinder) [02:25:24] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:33:09] (03CR) 10Volans: k8s.wipe-cluster: Improvements for k8s 1.31 upgrade (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1115380 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:06] (03CR) 10Volans: "Which line length did you pass to black? This repo uses 120 as line length and from the changes black applied it looks like it split lines" [cookbooks] - 10https://gerrit.wikimedia.org/r/1118098 (owner: 10Federico Ceratto) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:07:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:11:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 861ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:12:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [04:16:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 861ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:34:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10551533 (10phaultfinder) [04:47:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:52:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:55:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10551535 (10phaultfinder) [05:02:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:05:22] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10551556 (10phaultfinder) [05:34:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10551557 (10phaultfinder) [05:47:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 837.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:40:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 868.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250214T0700) [07:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10551618 (10phaultfinder) [07:43:34] (03PS1) 10Arnaudb: gitlab-runner: bumping to 17.7 [puppet] - 10https://gerrit.wikimedia.org/r/1119609 (https://phabricator.wikimedia.org/T386297) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250214T0800) [08:55:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 1.389% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:57:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:00:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:02:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:02:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:02:54] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:04:31] FIRING: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:05:42] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:05:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:05:56] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:07:08] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [09:07:14] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.176 second response time https://wikitech.wikimedia.org/wiki/Swift [09:07:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:07:20] RESOLVED: ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:07:50] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:07:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:08:06] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Swift [09:08:14] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Swift [09:10:20] !incidents [09:10:21] 5673 (UNACKED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [09:10:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:10:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:12:16] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:13:14] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.185 second response time https://wikitech.wikimedia.org/wiki/Swift [09:13:36] PROBLEM - Swift https frontend on ms-fe2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [09:13:38] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:14:08] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 2.050 second response time https://wikitech.wikimedia.org/wiki/Swift [09:14:16] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 2.577 second response time https://wikitech.wikimedia.org/wiki/Swift [09:14:16] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:14:31] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:14:50] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:15:29] around to help if needed claime just saw the page [09:16:08] PROBLEM - Swift https frontend on ms-fe2014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.169 second response time https://wikitech.wikimedia.org/wiki/Swift [09:16:14] PROBLEM - Swift https frontend on ms-fe2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.174 second response time https://wikitech.wikimedia.org/wiki/Swift [09:16:16] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 508 bytes in 8.464 second response time https://wikitech.wikimedia.org/wiki/Swift [09:17:10] RECOVERY - Swift https frontend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 3.558 second response time https://wikitech.wikimedia.org/wiki/Swift [09:17:14] RECOVERY - Swift https frontend on ms-fe2011 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.969 second response time https://wikitech.wikimedia.org/wiki/Swift [09:17:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web/canary (k8s) 2.222s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:17:20] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:17:34] RECOVERY - Swift https frontend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Swift [09:17:42] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:17:50] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:17:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:17:57] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:18:06] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:19:31] FIRING: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:48] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:20:44] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:20:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:21:44] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:22:16] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-web/canary (k8s) 2.222s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:22:16] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [09:22:20] RESOLVED: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:22:42] FIRING: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:23:06] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Swift [09:27:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:27:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:27:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:30:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 16.67% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:30:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:45:54] !incidents [09:45:55] 5674 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [09:45:55] 5673 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [09:47:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:56:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:01:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:01:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 5.556% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:05:42] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:06:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:06:50] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:07:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:08:44] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:11:26] PROBLEM - BGP status on cr2-eqsin is CRITICAL: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:12:20] FIRING: [2x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:12:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:13:50] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.130 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:13:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:14:19] Here we go again [10:14:46] PROBLEM - BGP status on cr2-eqsin is CRITICAL: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 323. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:14:48] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.131 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:15:42] FIRING: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:15:50] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:17:20] FIRING: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:17:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:18:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:19:31] FIRING: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:20:42] FIRING: [2x] JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:20:50] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:22:20] RESOLVED: [3x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:25:42] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:26:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 20.83% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:26:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:31:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 15.28% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:36:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 20.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:41:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 18.06% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:42:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:46:30] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:50:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:52:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 23.61% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:55:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:59:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:02:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 12.5% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:02:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:07:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 8.333% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:09:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:11:15] !incidents [11:11:16] 5675 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [11:11:16] 5674 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [11:11:16] 5673 (RESOLVED) ProbeDown sre (10.2.1.27 ip4 swift-https:443 probes/service http_swift-https_ip4 codfw) [11:12:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:20:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:25:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:27:05] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#10552135 (10Urbanecm) >>! In T351202... [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250214T0800) [12:00:05] jelto, arnoldokoth, and mutante: May I have your attention please! GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250214T1200) [12:57:20] FIRING: ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:59:31] RESOLVED: ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:15:02] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1119609 (https://phabricator.wikimedia.org/T386297) (owner: 10Arnaudb) [13:15:11] (03CR) 10Arnaudb: [C:03+2] gitlab-runner: bumping to 17.7 [puppet] - 10https://gerrit.wikimedia.org/r/1119609 (https://phabricator.wikimedia.org/T386297) (owner: 10Arnaudb) [13:20:34] (03CR) 10Ilias Sarantopoulos: [C:03+1] knserve-inference: add seccompProfile to the pod security context [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117939 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [13:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10552359 (10phaultfinder) [13:35:14] (03CR) 10Jforrester: Footer: Wikimedia icon should collapse at lower resolutions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [13:37:29] (03CR) 10Michael Große: beta: A/B test setup for surfacing structured tasks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903) (owner: 10Sergio Gimeno) [13:46:48] (03CR) 10Michael Große: [C:03+1] [Growth] enwiki: Enable mentorship for 100% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119204 (https://phabricator.wikimedia.org/T384505) (owner: 10Urbanecm) [13:47:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:40] (03PS1) 10Andrew Bogott: Openstack codfw1dev: enforce_policy_scope: true [puppet] - 10https://gerrit.wikimedia.org/r/1119715 (https://phabricator.wikimedia.org/T330759) [13:56:49] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1119715 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [14:12:10] (03CR) 10Andrew Bogott: [C:03+2] Openstack codfw1dev: enforce_policy_scope: true [puppet] - 10https://gerrit.wikimedia.org/r/1119715 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [14:18:37] !log arnaudb@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade gitlab [14:18:37] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=93) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade gitlab [14:19:10] !log arnaudb@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade gitlab [14:23:28] !log arnaudb@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade gitlab [14:24:25] arnaudb@cumin1002 arnaudb: The backup on gitlab1003 is complete, ready to proceed with upgrade. [14:27:07] (03PS1) 10Jelto: aptrepo: update gitlab-runner Suite to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1119718 (https://phabricator.wikimedia.org/T386297) [14:30:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade gitlab [14:31:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade gitlab [14:32:47] !log arnaudb@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade gitlab [14:35:28] (03CR) 10Ladsgroup: Footer: Wikimedia icon should collapse at lower resolutions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [14:36:10] (03PS1) 10Gerrit maintenance bot: Add syl to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1119722 (https://phabricator.wikimedia.org/T386441) [14:45:57] FIRING: [6x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:02] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:46:11] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:57:35] 06SRE, 07LDAP: ldap-admins POSIX group does not actually give any permissions to its members - https://phabricator.wikimedia.org/T386472 (10Urbanecm) 03NEW [15:01:44] (03Abandoned) 10Andrew Bogott: nova-compute: update live_migration_uri to use private cloud network [puppet] - 10https://gerrit.wikimedia.org/r/1115461 (https://phabricator.wikimedia.org/T355145) (owner: 10Andrew Bogott) [15:06:24] 06SRE, 10Bitu, 06Infrastructure-Foundations, 07LDAP: jonkolbert is unable to reset their LDAP password - https://phabricator.wikimedia.org/T386473 (10Urbanecm) 03NEW [15:06:54] 06SRE, 10Bitu, 06Infrastructure-Foundations, 07LDAP: jonkolbert is unable to reset their LDAP password - https://phabricator.wikimedia.org/T386473#10552673 (10Urbanecm) I can vouch for `jonkolbert`'s identity, he is sitting next to me as I'm filling this ticket. [15:12:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:18:43] 06SRE, 07LDAP: ldap-admins POSIX group does not actually give any permissions to its members - https://phabricator.wikimedia.org/T386472#10552707 (10Urbanecm) [15:29:05] 06SRE, 10Bitu, 06Infrastructure-Foundations, 07LDAP: jonkolbert is unable to reset their LDAP password - https://phabricator.wikimedia.org/T386473#10552754 (10Urbanecm) [15:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10552757 (10phaultfinder) [15:30:45] 06SRE, 10Bitu, 06Infrastructure-Foundations, 07LDAP: jonkolbert is unable to reset their LDAP password - https://phabricator.wikimedia.org/T386473#10552762 (10Ladsgroup) I don't know bitu code well enough to be sure but at least from exim4 logs, it's not sending any email: ` root@idm2001:/var/log/exim4# gr... [15:36:42] 06SRE, 10Bitu, 06Infrastructure-Foundations, 07LDAP: jonkolbert is unable to reset their LDAP password - https://phabricator.wikimedia.org/T386473#10552767 (10Urbanecm) @Ladsgroup Note the email in description is a backup email (in case we are unable to send mail to gmx.com). Their current email is `jon.ko... [15:38:57] 06SRE, 10Bitu, 06Infrastructure-Foundations, 07LDAP: jonkolbert is unable to reset their LDAP password - https://phabricator.wikimedia.org/T386473#10552781 (10Ladsgroup) Thanks but it's still the same thing. No match has been found. [15:43:09] (03CR) 10Michael Große: beta: A/B test setup for surfacing structured tasks (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119537 (https://phabricator.wikimedia.org/T385903) (owner: 10Sergio Gimeno) [15:46:43] 06SRE, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10552803 (10toni.stoev) Shall a committee be formed? [15:55:49] !log Roses are red / Violets are blue / If you hack on MediaWiki / Wikimedians <3 you! #ilovefs #wmhack [16:00:56] !log roll restart eventgate-main in codfw for T386138 [16:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:59] T386138: Intermittent JobQueueError due to "Unable to deliver all events: 500: Internal Server Error" - https://phabricator.wikimedia.org/T386138 [16:01:09] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [16:02:06] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [16:04:32] !log roll restart eventgate-main in codfw for T386138 -- the previous command roll restarted in eqiad. [16:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:47] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [16:05:18] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [16:10:59] hah, who did that via logmsgbot :) [16:19:57] (03CR) 10Ottomata: [C:03+2] mediawiki.org/beacon/event - don't raise error on failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115111 (https://phabricator.wikimedia.org/T383939) (owner: 10Ottomata) [16:20:24] (03CR) 10Ottomata: [C:04-1] "Didn't meant to merge on a Friday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115111 (https://phabricator.wikimedia.org/T383939) (owner: 10Ottomata) [16:34:30] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f36924ed1c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech [16:34:30] ia.org/wiki/Search%23Administration [16:34:30] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f664971a1c0: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech [16:34:30] ia.org/wiki/Search%23Administration [16:35:08] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fae6903a280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech [16:35:08] ia.org/wiki/Search%23Administration [16:35:18] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [16:36:12] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on relforge1007 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:36:12] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on relforge1005 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:40:25] FIRING: [8x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:43:20] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on relforge1003 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:45:25] FIRING: [9x] SystemdUnitFailed: push_cross_cluster_settings_9400.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:50:40] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:50:40] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:51:02] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:00:25] FIRING: [9x] SystemdUnitFailed: push_cross_cluster_settings_9400.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:03:20] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on relforge1003 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:05:25] FIRING: [9x] SystemdUnitFailed: push_cross_cluster_settings_9400.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:06:12] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on relforge1005 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:06:12] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on relforge1007 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:10:30] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:10:32] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 0.263 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:10:52] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:18:44] (03PS1) 10Krinkle: docroot: Add experimental assetlinks.json from and to various domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119739 (https://phabricator.wikimedia.org/T385520) [17:20:10] (03CR) 10Krinkle: "I've tested the robots.txt changed via https://robotstxt.com/tester." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119739 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [17:24:48] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on relforge1006 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:25:25] FIRING: [9x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:27:10] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fligh [17:27:10] 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:27:10] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1006 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fligh [17:27:10] 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:27:12] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on relforge1005 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:27:12] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on relforge1007 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:27:32] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1005 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fligh [17:27:32] 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:27:32] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1007 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fligh [17:27:32] 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:32:22] (03PS1) 10Clare Ming: Re-enable test experiment for testwiki for upcoming demos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119740 (https://phabricator.wikimedia.org/T383801) [17:34:48] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on relforge1006 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:35:25] FIRING: [9x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:37:12] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on relforge1007 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:37:12] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on relforge1005 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:40:34] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:44:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3549 MB (3% inode=98%): /tmp 3549 MB (3% inode=98%): /var/tmp 3549 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [17:47:20] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:07] arnaudb@cumin1002 arnaudb: The backup on gitlab2002 is complete, ready to proceed with upgrade. [17:55:58] gitlab is about to be upgraded as previously warned, sorry for the inconvenience! [18:00:12] (03CR) 10Brennen Bearnes: [C:03+1] "Tested, confirmed working." [puppet] - 10https://gerrit.wikimedia.org/r/1119202 (https://phabricator.wikimedia.org/T347064) (owner: 10Ahmon Dancy) [18:03:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade gitlab [18:03:56] FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:07:29] FIRING: [4x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:08:56] RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:08:56] (03CR) 10Gergő Tisza: [C:03+1] docroot: Add experimental assetlinks.json from and to various domains (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119739 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [18:09:31] FIRING: [4x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10553375 (10phaultfinder) [18:35:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 354MiB (2% inode=33%): /tmp 354MiB (2% inode=33%): /var/tmp 354MiB (2% inode=33%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [18:44:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3460 MB (3% inode=98%): /tmp 3460 MB (3% inode=98%): /var/tmp 3460 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [18:45:53] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:46:06] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:12:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:14:36] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 246397656 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:15:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [19:15:36] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 49416 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:46:34] PROBLEM - BFD status on cr2-magru is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:46:38] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:47:34] RECOVERY - BFD status on cr2-magru is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:47:38] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:15:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 350MiB (2% inode=33%): /tmp 350MiB (2% inode=33%): /var/tmp 350MiB (2% inode=33%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [21:17:15] (03CR) 10Brennen Bearnes: [C:03+2] Remove a condition which always returns false [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1118472 (owner: 10Aklapper) [21:17:31] (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] Remove a condition which always returns false [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1118472 (owner: 10Aklapper) [21:19:47] (03CR) 10Brennen Bearnes: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1101481 (https://phabricator.wikimedia.org/T309222) (owner: 10Aklapper) [21:35:25] FIRING: [6x] SystemdUnitFailed: nginx.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:43:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:44:44] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 4.695 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [21:46:38] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [21:47:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:48:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:48:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [21:52:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [21:53:20] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [21:55:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [21:57:46] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 6.436 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [21:59:38] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [22:09:31] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:12:44] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 4.575 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [22:15:38] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [22:41:50] I just added a patch that changes CommonSettings.php; is that handled in the normal deployment windows, or is there something else/extra that I should be aware of? [22:45:58] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:46:06] FIRING: PuppetFailure: Puppet has failed on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:52:36] (03PS2) 10Krinkle: docroot: Add experimental assetlinks.json from and to various domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119739 (https://phabricator.wikimedia.org/T385520) [22:52:42] (03CR) 10Krinkle: docroot: Add experimental assetlinks.json from and to various domains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119739 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [22:55:23] (03CR) 10Krinkle: docroot: Add experimental assetlinks.json from and to various domains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119739 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [23:08:50] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [23:09:40] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [23:12:20] FIRING: [2x] HelmReleaseBadStatus: Helm release eventgate-analytics/canary on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventgate-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [23:14:31] FIRING: SystemdUnitFailed: systemd-timedated.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10554143 (10phaultfinder) [23:27:20] RESOLVED: SystemdUnitFailed: systemd-timedated.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:33:44] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.498 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [23:35:20] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 346MiB (2% inode=33%): /tmp 346MiB (2% inode=33%): /var/tmp 346MiB (2% inode=33%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [23:40:40] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.038 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [23:55:20] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [23:56:44] PROBLEM - graphite.wikimedia.org api on graphite1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 4.276 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [23:58:40] RECOVERY - graphite.wikimedia.org api on graphite1005 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting