[00:20:00] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:20:04] PROBLEM - Hadoop NodeManager on an-worker1155 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:38:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1106059 [00:38:12] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1106059 (owner: 10TrainBranchBot) [00:39:04] RECOVERY - Hadoop NodeManager on an-worker1155 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:40:18] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:40:22] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:45:00] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:53:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10421234 (10phaultfinder) [00:56:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1106059 (owner: 10TrainBranchBot) [01:08:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1106060 [01:08:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1106060 (owner: 10TrainBranchBot) [01:11:44] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:15:44] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:25:44] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1106060 (owner: 10TrainBranchBot) [01:30:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10421237 (10phaultfinder) [01:35:00] PROBLEM - Hadoop NodeManager on an-worker1169 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:39:16] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:43:00] RECOVERY - Hadoop NodeManager on an-worker1169 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:44:28] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/3878c84bb2dfb4931b55fc34e4f45aac00439adac76341f14e50d49b0a6f56aa/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:04:28] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:19:32] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:42] FIRING: [2x] JobUnavailable: Reduced availability for job jmx_idp in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:22] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:47:50] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:49:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10421291 (10phaultfinder) [03:03:42] FIRING: [2x] JobUnavailable: Reduced availability for job jmx_idp in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10421292 (10phaultfinder) [05:10:07] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [05:15:07] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [05:19:53] 10SRE-swift-storage: Unable to restore File:Model 4000-First of Odakyu Electric Railway 2.JPG - https://phabricator.wikimedia.org/T382694#10421296 (10Pppery) [05:39:20] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:43:28] 06SRE, 10Dumps 2.0, 10Dumps-Generation: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10421298 (10Marostegui) Thanks Ben! >>! In T368098#10420583, @BTullis wrote: . > > > If we revert this change before the 14th of January, then the full dump wi... [06:19:32] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:37:30] (03PS1) 10Marostegui: instances.yaml: Remove db2230 [puppet] - 10https://gerrit.wikimedia.org/r/1106197 (https://phabricator.wikimedia.org/T373579) [06:38:49] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove db2230 [puppet] - 10https://gerrit.wikimedia.org/r/1106197 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [06:58:42] RESOLVED: JobUnavailable: Reduced availability for job jmx_idp in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:59:16] FIRING: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:02:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:02:54] PROBLEM - Swift https backend on ms-fe2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [07:03:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on dbproxy1028.eqiad.wmnet with reason: maintenance [07:03:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on dbproxy1028.eqiad.wmnet with reason: maintenance [07:03:31] RESOLVED: [2x] ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:03:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on dbproxy1029.eqiad.wmnet with reason: maintenance [07:03:44] RECOVERY - Swift https backend on ms-fe2009 is OK: HTTP OK: HTTP/1.1 200 OK - 506 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on dbproxy1029.eqiad.wmnet with reason: maintenance [07:12:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:13:12] !incidents [07:13:12] 5557 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:13:12] 5556 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:13:13] 5555 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:13:15] lovely morning [07:13:18] !ack 5557 [07:13:18] 5557 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:14:46] Same thing as yesterday possibly [07:16:18] increased 504 from swift in eqsin+ulsfo? [07:16:35] codfw too, it just hasn't paged yet [07:16:43] “11:52 !log restart swift-object on ms-be2082” [07:17:36] there were some similar spikes during the night, but apparently below the threshold [07:17:36] Is what fixed it [07:22:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:23:01] !incidents [07:23:01] 5557 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:23:02] 5556 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:23:02] 5555 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [07:23:56] tbh, I am not sure how to find the misbehaving backend [07:24:04] and if it is the same one or another one [07:24:23] I can see that multiple frontends return errors, so it's not a frontend that's at fault [07:25:20] Another quote from yesterday: "swift-recon -r was getting ECONNREFUSED from ms-be2082" [07:26:02] yeah, and nothing really stands out on https://grafana.wikimedia.org/d/ygUBo45Gk/swift-object-server [07:27:51] RESOLVED: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [07:32:45] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for Ammarpad - https://phabricator.wikimedia.org/T381851#10421320 (10Ammarpad) 05Open→03Resolved It works. Thank you all. [07:46:44] !log akosiaris@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker1290.eqiad.wmnet [07:46:47] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker1290.eqiad.wmnet [07:47:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 4 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421322 (10ops-monitoring-bot) depool host wikikube-worker1290.eqiad.wmnet by akosiaris@cumin1002 with reason: T... [07:47:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 4 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421323 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by akosiaris@cumin1002 depool... [07:50:11] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10421337 (10MoritzMuehlenhoff) [07:50:13] (03PS2) 10Alexandros Kosiaris: wikikube: Add wikikube-ctrl1004 [puppet] - 10https://gerrit.wikimedia.org/r/1092840 (https://phabricator.wikimedia.org/T379790) [07:50:32] (03CR) 10CI reject: [V:04-1] wikikube: Add wikikube-ctrl1004 [puppet] - 10https://gerrit.wikimedia.org/r/1092840 (https://phabricator.wikimedia.org/T379790) (owner: 10Alexandros Kosiaris) [07:53:07] !log akosiaris@cumin1002 START - Cookbook sre.hosts.rename from wikikube-worker1290 to wikikube-ctrl1004 [07:53:27] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [07:57:06] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming wikikube-worker1290 to wikikube-ctrl1004 - akosiaris@cumin1002" [07:59:04] should have mentioned it, but swift-recon -r (run on both a codfw ms-fe and a codfw ms-be returned successfully, nothing suspicious at the output) [08:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241223T0800) [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:08] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming wikikube-worker1290 to wikikube-ctrl1004 - akosiaris@cumin1002" [08:02:08] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:02:09] !log akosiaris@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1004 [08:02:20] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1004 [08:02:27] !log installing libxstream-java security updates [08:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:59] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from wikikube-worker1290 to wikikube-ctrl1004 [08:03:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 4 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421356 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by akosiaris@cumin1002 from wikikube-... [08:09:06] (03PS3) 10Alexandros Kosiaris: wikikube: Add wikikube-ctrl1004 [puppet] - 10https://gerrit.wikimedia.org/r/1092840 (https://phabricator.wikimedia.org/T379790) [08:10:32] (03CR) 10Alexandros Kosiaris: [C:03+2] wikikube: Add wikikube-ctrl1004 [puppet] - 10https://gerrit.wikimedia.org/r/1092840 (https://phabricator.wikimedia.org/T379790) (owner: 10Alexandros Kosiaris) [08:10:44] (03CR) 10Muehlenhoff: "Patch looks good, but we need approval from your manager and Tyler on T382616." [puppet] - 10https://gerrit.wikimedia.org/r/1105952 (https://phabricator.wikimedia.org/T382616) (owner: 10MSantos) [08:15:05] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm [08:15:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 4 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421366 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [08:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10421372 (10phaultfinder) [08:25:20] I'm not really convinced by the misbehaving backend theory, TBPH [08:28:18] I also ran swift-recon --all on a few nodes, but no signs of obvious errors [08:37:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:38:54] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm [08:39:07] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [08:39:15] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm [08:39:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [08:42:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:11:35] 10SRE-swift-storage, 06Commons: Unable to restore File:Model 4000-First of Odakyu Electric Railway 2.JPG - https://phabricator.wikimedia.org/T382694#10421396 (10MatthewVernon) Both swift clusters do indeed have an object there - in eqiad last modified 2021-06-30, in codfw last modified 2021-01-05, and both obj... [09:28:32] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for MSantos - https://phabricator.wikimedia.org/T382616#10421415 (10Volans) [09:30:27] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for MSantos - https://phabricator.wikimedia.org/T382616#10421423 (10Volans) > Name of approving party (manager for WMF/WMDE staff): I am the Product Manager for the MediaWiki release working group, not sure who should... [09:30:59] (03CR) 10Volans: "Thanks for preparing already the patch. One question inline." [puppet] - 10https://gerrit.wikimedia.org/r/1105952 (https://phabricator.wikimedia.org/T382616) (owner: 10MSantos) [09:32:37] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10421428 (10MoritzMuehlenhoff) [09:38:04] !log installing gtk+3.0 security updates [09:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:54] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on P{ms-fe2009*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [09:44:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on P{ms-fe2009*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [09:45:14] !log depool ms-fe2010 to attempt swap clearance [09:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:41] !log repool ms-fe2010 [09:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:22] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3066.esams.wmnet [10:04:32] FIRING: [2x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:04:34] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3066.esams.wmnet [10:09:32] RESOLVED: [2x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:20:05] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1012-1014].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [10:20:59] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1025-1027].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [10:21:47] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1012.eqiad.wmnet with OS bookworm [10:22:39] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1025.eqiad.wmnet with OS bookworm [10:28:45] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#10421459 (10Yann) This happened again while trying to upload a new version o... [10:28:53] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for MSantos - https://phabricator.wikimedia.org/T382616#10421460 (10Volans) p:05Triage→03Medium [10:38:12] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1012.eqiad.wmnet with reason: host reimage [10:41:22] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1025.eqiad.wmnet with reason: host reimage [10:41:56] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1012.eqiad.wmnet with reason: host reimage [10:45:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1025.eqiad.wmnet with reason: host reimage [10:49:13] 10SRE-swift-storage, 06Commons, 07SVG: Check and convert SVGs on commons to have a MIME-type of image/svg+xml - https://phabricator.wikimedia.org/T382445#10421468 (10TheDJ) For any sort of maintenance, we either have to reset the mime type of all svgs, or preferably, we need to list files on swift by a heade... [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241223T1100) [11:00:28] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1012.eqiad.wmnet with OS bookworm [11:02:10] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1013.eqiad.wmnet with OS bookworm [11:05:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1025.eqiad.wmnet with OS bookworm [11:07:24] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1026.eqiad.wmnet with OS bookworm [11:07:45] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm [11:07:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421472 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [11:08:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421473 (10akosiaris) I 've had to enable PXE boot on the 10G card in the BIOS to get the server to PXE, proceed... [11:14:24] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm [11:14:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421489 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [11:19:01] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for MSantos - https://phabricator.wikimedia.org/T382616#10421505 (10MSantos) [11:22:14] !log akosiaris@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [11:23:12] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm [11:23:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421518 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [11:23:29] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm [11:23:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421519 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [11:24:03] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1026.eqiad.wmnet with reason: host reimage [11:25:11] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [11:27:31] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1026.eqiad.wmnet with reason: host reimage [11:30:52] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10421520 (10MoritzMuehlenhoff) [11:36:17] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti test cluster to Bookworm - https://phabricator.wikimedia.org/T382515#10421521 (10Volans) p:05Triage→03Medium [11:36:25] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in drmrs to Bookworm - https://phabricator.wikimedia.org/T382513#10421522 (10Volans) p:05Triage→03Medium [11:36:33] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in eqsin to Bookworm - https://phabricator.wikimedia.org/T382512#10421523 (10Volans) p:05Triage→03Medium [11:36:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10421524 (10Volans) p:05Triage→03Medium [11:36:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in esams to Bookworm - https://phabricator.wikimedia.org/T382509#10421525 (10Volans) p:05Triage→03Medium [11:37:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10421526 (10Volans) p:05Triage→03Medium [11:37:53] !log roll restart of all swift fes in codfw. This seems to have fixed some higher than usual cache_upload error rates. Monitoring. [11:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:00] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10421527 (10Volans) p:05Triage→03Medium [11:39:45] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm [11:40:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421528 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [11:40:01] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: InterfaceSpeedError - https://phabricator.wikimedia.org/T382485#10421529 (10Volans) [11:40:03] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm [11:40:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421530 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host... [11:46:15] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1026.eqiad.wmnet with OS bookworm [11:47:59] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1027.eqiad.wmnet with OS bookworm [11:55:33] FIRING: KubernetesCalicoDown: wikikube-worker1290.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1290.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:59:11] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1004.eqiad.wmnet with reason: host reimage [12:01:13] 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10421531 (10akosiaris) >>! In T382404#10412551, @Urbanecm_WMF wrote: >> Since I am... [12:02:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:02:54] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1004.eqiad.wmnet with reason: host reimage [12:03:19] !incidents [12:03:20] 5558 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [12:03:20] 5557 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [12:03:49] !log 5558 [12:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:52] !ack 5558 [12:03:53] 5558 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [12:04:47] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1027.eqiad.wmnet with reason: host reimage [12:05:13] here [12:05:45] we had the same about five hours ago [12:06:05] yeah and despite a roll restart of swift fes we are in the same situation [12:06:25] (and a similar incident on Sunday) [12:07:41] ok I have on ms-be host that isn't happy [12:07:42] says [12:07:48] [Mon Dec 23 05:03:21 2024] sd 0:0:8:0: Power-on or device reset occurred [12:07:51] a lot lot lot [12:08:01] which host is that? [12:08:07] ms-be2075.codfw.wmnet [12:08:32] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1027.eqiad.wmnet with reason: host reimage [12:08:36] that's on dmesg btw [12:09:18] oh, a ps auxw in that host is very interesting, I can see the matrix [12:09:34] rsync in DN state [12:10:03] should we remove it from the rotation ? [12:10:03] ah yes [12:10:26] SEL is empty otherwise [12:11:24] I think the dmesg error aligns with a faulty cable, let me doublecheck [12:12:51] Should we remove it first? [12:13:57] +1 [12:14:45] Not sure how though. The wikitech docs are more about decom [12:15:14] https://wikitech.wikimedia.org/wiki/Swift/Ring_Management [12:15:14] Yeah, was in the same boat [12:15:35] More specifically https://wikitech.wikimedia.org/wiki/Swift/Ring_Management#Removing_a_host [12:15:36] https://wikitech.wikimedia.org/wiki/Swift/Ring_Management#Removing_a_host [12:15:51] but not sure if there is also a more temporary CLI method [12:16:15] I'll prep a patch [12:16:38] Yeah I reach the same part of the same page, my only question is now, do we do the drain thing? [12:16:50] ack [12:17:21] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1106302 (owner: 10L10n-bot) [12:17:31] (03PS1) 10BCornwall: Swift: Remove ms-be2075 from prod hosts [puppet] - 10https://gerrit.wikimedia.org/r/1106303 [12:18:35] akosiaris: Do you know if it's also doing useful things atm too? [12:19:16] It does emit errors and there a task where a user complained about receiving an error to upload a video [12:19:39] My gut feeling says that error rates are low enough to allow draining [12:19:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10421576 (10phaultfinder) [12:19:48] But I could be wrong [12:20:00] the reset messages are happening a few times a minute [12:20:29] Pretty worrying to say the least [12:20:37] akosiaris: Let's do that, then. I'll prep another patch to drain [12:20:44] Thanks [12:21:50] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/1106305 (owner: 10L10n-bot) [12:23:21] IIRC draining a node is rather a matter of 1-2 weeks, though? and the nodes are redundant by itself, so my understanding is that dropping it right away should be perfectly fine. we had this in the past as well, when swift be nodes failed to come up with a hw error after reboots [12:23:21] (03PS1) 10BCornwall: Swift: Drain ms-be2075 from codfw prod hosts [puppet] - 10https://gerrit.wikimedia.org/r/1106306 [12:23:49] moritzm: I defer to your judgement :) [12:24:19] moritzm: I do too, I don't have a handy memory of having to remove a node from the ring [12:25:00] (03CR) 10Alexandros Kosiaris: [C:03+1] Swift: Remove ms-be2075 from prod hosts [puppet] - 10https://gerrit.wikimedia.org/r/1106303 (owner: 10BCornwall) [12:25:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1106303 (owner: 10BCornwall) [12:25:23] my reading of wikitech is that: [12:25:23] ok, +1ed the remove change and we can ask Matthew whether we did well in the new year [12:25:41] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1106303 shouild do the trick given that it mentions [12:26:00] "Simply removing the host entry from hosts.yaml will cause it to be immediately removed from the rings; it is generally better to drain it first (i.e. gradually remove weight from all of its devices)" [12:27:04] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4729/console" [puppet] - 10https://gerrit.wikimedia.org/r/1106303 (owner: 10BCornwall) [12:27:29] looking at git log for the hiera file, there's now a whole lot of changes, there must also be some other way to kick a node wihout a Hiera change [12:27:43] I meant "not", not "now" [12:27:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1027.eqiad.wmnet with OS bookworm [12:27:48] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1025-1027].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [12:28:35] What do we risk by taking it off suddenly? [12:29:31] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1028-1030].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [12:29:36] Matthew isn't listed as on vacation in the gcal, we can also wait and doublecheck when he's back from lunch? [12:30:00] I was just checking that, no mention of it in the team meeting doc either [12:31:00] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1013.eqiad.wmnet with reason: host reimage [12:31:07] Emperor: Around? [12:31:12] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1028.eqiad.wmnet with OS bookworm [12:32:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [12:34:10] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1013.eqiad.wmnet with reason: host reimage [12:34:30] Dropped a note on -data-persistence too [12:36:41] PROBLEM - Hadoop NodeManager on an-worker1120 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:36:46] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: High amount of 503 for swift uploads - https://phabricator.wikimedia.org/T382705 (10TheDJ) 03NEW [12:36:55] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: High amount of 503 for swift uploads - https://phabricator.wikimedia.org/T382705#10421628 (10TheDJ) p:05Triage→03High [12:37:51] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#10421630 (10TheDJ) Thank you for reporting @yann. I created T382705 for this... [12:37:53] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: High amount of 503 for swift uploads - https://phabricator.wikimedia.org/T382705#10421632 (10TheDJ) [12:38:04] Since the alert has resolved, I'd assume we're in an acceptable amount of errors and we can wait for an answer [12:38:27] If it fires again, though, I guess we can just push that patch? [12:40:18] +1, we should definitely get that node out of active duty before the holiday period, but let's wait for Matthew to get back to figure out hwo to best do that [12:40:30] Thanks for the help :) [12:41:44] akosiaris: how did you identify 2075, sudo dmesg on the ms-be hosts or something more elegant? [12:41:58] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: High amount of 503 for swift uploads - https://phabricator.wikimedia.org/T382705#10421633 (10TheDJ) p:05High→03Unbreak! [12:42:18] cumin dmesg I meant [12:42:41] RECOVERY - Hadoop NodeManager on an-worker1120 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:42:44] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: High amount of 503 for swift uploads - https://phabricator.wikimedia.org/T382705#10421634 (10BCornwall) 05Open→03In progress [12:43:36] (03PS2) 10BCornwall: Swift: Remove ms-be2075 from prod hosts [puppet] - 10https://gerrit.wikimedia.org/r/1106303 (https://phabricator.wikimedia.org/T382705) [12:44:27] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading, 13Patch-For-Review: High amount of 503 for swift uploads - https://phabricator.wikimedia.org/T382705#10421637 (10TheDJ) oh, wrong link, and wrong screenshot, I copied from the wrong browser tab :D This is the Grafana log https://grafana.wikim... [12:45:05] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading, 13Patch-For-Review: High amount of 503 for swift uploads - https://phabricator.wikimedia.org/T382705#10421639 (10BCornwall) ms-be2075 has a data link reset a few times a minute: ` [...] Dec 23 12:42:54 ms-be2075 kernel: sd 0:0:24:0: Power-on... [12:47:30] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading, 13Patch-For-Review: High amount of 503/504 for swift uploads - https://phabricator.wikimedia.org/T382705#10421640 (10TheDJ) [12:53:07] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1013.eqiad.wmnet with OS bookworm [12:54:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10421644 (10phaultfinder) [12:54:51] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1014.eqiad.wmnet with OS bookworm [12:55:15] * Emperor back from lunch, checking scroll [12:56:08] wb :) [12:57:00] can we do this on #wikimedia-data-persistence, it's less noisy there? [12:57:10] how about _security? [12:58:12] meh, OK, if data-persistence is too obscure, we can use this channel, there's no need for a private one [12:59:03] Yeah, OK, that node does look quite sad (good spot). As pointed out above, a drain takes O(weeks), which if the node is bad enough to be causing problems is probably not what we want here. [12:59:50] What's the side effect of pushing that patch to remove it outright? i.e. will it make issues worse? [12:59:52] yeah, this also got reported in Phab: https://phabricator.wikimedia.org/T382705 [13:00:03] brett: cumin dmesg indeed [13:00:49] this may be unknowable, but: is it likely a cable needs reseating, or a more involved fix, and do we have any dcops available in codfw today? [13:02:31] If it's likely fixable later today, we could disable it pro tem, but otherwise if it's causing our current woes then dropping it from the ring is likely best. Removing a host big-bang like that causes a chunk of swift I/O right away and isn't ideal in the round (which is why we favour gradual changes), but it may be the least-bad answer here. [13:04:14] The disks themselves appear to be passing checks so a guess would indeed be a cable swap [13:05:17] by "a chunk" do you mean i/o usage spike? If so, is the fear resource exhaustion? [13:05:43] (03CR) 10MVernon: [C:04-1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1106303 (https://phabricator.wikimedia.org/T382705) (owner: 10BCornwall) [13:06:14] ack [13:06:15] Yeah, reduced availability for client i/o [13:06:27] I checked SEL and no errors are logged there (so the drives seem healthy) and the only place in Linux where this log msg is emitted is for a SDEV_EVT_POWER_ON_RESET_OCCURRED event (which on a healthy system is nly don when a device starts up), so I'm failry sure that drive has a broken power supply over it's cable [13:07:29] (03PS3) 10BCornwall: Swift: Remove ms-be2075 from prod hosts [puppet] - 10https://gerrit.wikimedia.org/r/1106303 (https://phabricator.wikimedia.org/T382705) [13:07:32] moritzm: it's >1 drive, isn't it? [13:08:37] (03CR) 10BCornwall: Swift: Remove ms-be2075 from prod hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1106303 (https://phabricator.wikimedia.org/T382705) (owner: 10BCornwall) [13:08:46] that's true, and I have no idea how these drives are connected, like whether they all originate from one power connector etc. [13:09:14] (03PS4) 10BCornwall: Swift: Mark ms-be2075 as failed, remove from prod [puppet] - 10https://gerrit.wikimedia.org/r/1106303 (https://phabricator.wikimedia.org/T382705) [13:09:20] so if the originating end on the mainboard is faulty it seems unlikely that it can get fixed even if anyone is in the DC today I suppose [13:09:49] Oh, yes, fair [13:11:50] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1014.eqiad.wmnet with reason: host reimage [13:12:27] good spelunking :) [13:12:48] So then it sounds like pushing the patch is the way forward [13:13:08] Sounds that way to me too [13:13:42] brett: patch still wrong, my fault, I was bamboozled by gerrit UI [13:13:54] Trying to produce a coherent comment this time :( [13:14:30] oh yeah, dang [13:14:40] yeah, I'll fix it up [13:14:43] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1028.eqiad.wmnet with reason: host reimage [13:14:47] +1 on taking the node out for the holiday period [13:14:56] (03CR) 10MVernon: [C:04-1] "mea culpa" [puppet] - 10https://gerrit.wikimedia.org/r/1106303 (https://phabricator.wikimedia.org/T382705) (owner: 10BCornwall) [13:15:07] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1014.eqiad.wmnet with reason: host reimage [13:15:34] (03PS5) 10BCornwall: Swift: Mark ms-be2075 as failed, remove from prod [puppet] - 10https://gerrit.wikimedia.org/r/1106303 (https://phabricator.wikimedia.org/T382705) [13:15:56] Should be fixed now [13:16:23] (03CR) 10BCornwall: Swift: Mark ms-be2075 as failed, remove from prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1106303 (https://phabricator.wikimedia.org/T382705) (owner: 10BCornwall) [13:16:55] Arturo wrote a prometheus collector which collects errors from dmesg, it's applied to cloud production nodes (to spot errors on cloudgw and cloudvirt*) and opens Phab tasks on new errors, might be worth exploring for swift nodes as well next year [13:17:09] prometheus::node_kernel_panic [13:18:02] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1028.eqiad.wmnet with reason: host reimage [13:18:05] (and also for other production nodes, just mentioning it here it could have spotted the error ealier than us) [13:18:16] (03CR) 10MVernon: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1106303 (https://phabricator.wikimedia.org/T382705) (owner: 10BCornwall) [13:18:38] Okay, I'm going to push the patch [13:18:41] +1 [13:18:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:18:51] (03CR) 10BCornwall: [C:03+2] Swift: Mark ms-be2075 as failed, remove from prod [puppet] - 10https://gerrit.wikimedia.org/r/1106303 (https://phabricator.wikimedia.org/T382705) (owner: 10BCornwall) [13:19:01] perfect timing :-) [13:19:04] hahaha [13:19:11] probably worth my poking the ring_manager again, the timer wont fire for a while let. [13:19:20] brett: LMK once merged? [13:19:46] (03PS1) 10Urbanecm: [Growth] Remove social campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106316 (https://phabricator.wikimedia.org/T382499) [13:20:29] merged [13:20:32] Emperor: [13:20:36] ack [13:20:52] I'll open a DC ops task, unless someone is already on it? [13:21:01] please do, I'm going to be poking the rings for a bit [13:21:15] ack [13:21:26] Emperor: Is that to say you're applying the change? I was going to run puppet on swift [13:21:30] I think dcops will not be in the DC today [13:22:01] brett: yes, puppet rolls out the new hosts file, but there's a timer that fires hourly that actually makes the ring manager consider if there's anything to do [13:22:05] at least there are vacation items in the calendar for almost all, not sure about pa.paul [13:22:18] volans: yeah, it's just to have this fixed eventually when they are back [13:23:29] Running puppet on swift [13:23:37] Emperor: for future potential occurrence is an iptables rule or power off a valid temporary solution without removing it from the ring? [13:23:56] temporary as in until you or dcops can have a look [13:24:18] (03CR) 10Urbanecm: "question: a similar enwiki task (T378287#10341850) mention certain database issues. what are those? can we be sure they aren't relevant fo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [13:24:21] volans: yeah, if you shut the node down swift will start working around it; I think the ring change is somewhat less disruptive, but just powering it off is OK [13:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10421662 (10phaultfinder) [13:24:48] but the power off wouldn't trigger the rebalance right? [13:25:22] volans: in theory swift will notice the failed host and start making its own copies on the backup devices in the ring. [13:25:32] puppet agent run [13:25:33] but you can't really inspect the process [13:25:43] ah ok, so there is no escape from the increased I/O I see [13:26:02] Alas no. [13:26:29] there should be a way to tell swift "hey I'm ok with reduced number of copies for now, let it be" [13:26:54] Mmm, ceph has flags you can set for that sort of thing [13:27:27] (whether we would be happy with a node out all holiday is another question, but ceph has tunables for how much recovery it does at once too) [13:27:32] (03CR) 10Urbanecm: [C:04-1] "-1, pending clarifications to my comments/questions. Today is the last deployment day of 2024, and this patch is fairly risky and doesn't " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [13:30:11] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707 (10MoritzMuehlenhoff) 03NEW [13:30:16] 10ops-codfw, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10421700 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:30:27] Thank you, moritzm :) [13:31:06] !incidents [13:31:07] 5559 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [13:31:07] 5558 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [13:31:07] 5557 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [13:32:58] Hm, that's taking too long because it's doing the usual dispersion check, which is in turn taking ages. [13:33:45] Emperor: I ran run-puppet-agent on 'A:swift and A:codfw'. Was that the correct selector? [13:33:50] I'm going to stop that, and use the hammer ( /usr/local/bin/swift_ring_manager -o /var/cache/swift_rings --doit --skip-dispersion-check --skip-replication-check --immediate-only ) [13:34:07] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1014.eqiad.wmnet with OS bookworm [13:34:08] brett: that was more than enough, but yes, thanks. [13:34:10] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1012-1014].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [13:36:27] Bother [13:36:35] "Need to wait 23945 seconds before changing ring /etc/swift/object.builder" [13:36:54] we can't apply the desired change until 20:15:24 this evening [13:36:56] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1028.eqiad.wmnet with OS bookworm [13:37:49] swift rings have a "minimum time between changes" that swift enforces, and routine load/drain operation made one sufficiently recently that we can't drop ms-be2075 from the rings until then. [13:38:40] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1029.eqiad.wmnet with OS bookworm [13:38:55] uh [13:39:06] 7 hours? [13:40:04] The min change time for the prod rings is 12 hours [13:40:44] 20:15 is 7ish hours from now, I meant [13:40:52] Okay, I'll explain in the ticket [13:40:53] Yes. [13:40:54] what? [13:41:01] omg [13:41:09] seems swift is a bit of a misnomer :-) [13:41:14] Yeah, not ideal [13:41:17] to say the least [13:41:34] You can see this yourself with e.g. "sudo swift-ring-builder /etc/swift/object.builder" on any ms host [13:41:40] I'm guessing that anything other than waiting would be "very bad"™? [13:41:45] 'The minimum number of hours before a partition can be reassigned is 12 (6:33:28 remaining)' [13:41:51] (03PS2) 10Urbanecm: [Growth] Remove Marketing campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106316 (https://phabricator.wikimedia.org/T382499) [13:42:19] ms-be2075 has picked an especially piquant time for its hardware to start misbehaving [13:45:24] still better today than in the coming days :-) [13:45:28] I think because you have to wait for the data moves caused by the previous ring change to have taken effect before you can safely change them again. [13:47:20] https://docs.openstack.org/mitaka/install-guide-rdo/swift-initial-rings.html suggests that it's usually one hour - do we configure it to 12 ourselves? [13:47:38] brett: I presume past-us did when setting up the ms clusters [13:48:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:49:05] hm. Any way to manually check/run things? I may be missing something but that seems like a glorified "sleep" then [13:49:26] brett: how do you mean check/run things? [13:51:22] "wait for the data moves caused by the previous ring change to have taken effect" made me think that there might be a mechanism to verify the propagation of the ring changes rather than a blanket 12 hour wait [13:51:29] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10421728 (10MoritzMuehlenhoff) [13:51:38] hahahahaha oh you sweet summer child [13:51:57] ahem, sorry, I mean, "no there is not an obvious such mechanism, we have some heuristics we use" [13:52:19] "We have a state-of-the-art AI" [13:53:02] the ring manager waits for dispersion to return to 100% (which is effectively a statistical sampling of "is all the data where the rings think it should be") and for the "longest ago host completed replication" to be less than an hour ago [13:54:29] brett: cf https://grafana.wikimedia.org/goto/L6seBrSHg?orgId=1 which is the process happening on thanos-swift ; each dip is another ring change happening, and then dispersion returns to 100% as the cluster shuffles data. [13:55:05] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1029.eqiad.wmnet with reason: host reimage [13:57:44] We _could_ power-off the host now, but I _think_ that's riskier now (and probably risks two disruptive data moves, one now and another when the ring change happens) [13:57:53] I was just about to ask about that [13:58:41] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1029.eqiad.wmnet with reason: host reimage [13:59:00] I just checked dmesg and the first errors on ms-be2075 date back until the 21st, so let's just not risk any more things and just wait our the remaining six hours [13:59:13] "wait out" [13:59:33] Did they become more frequent since was only recently a task opened, though? [13:59:45] * brett should not be lazy and just check himself [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241223T1400). [14:00:05] ZhaoFJx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:16] i can deploy today [14:00:22] Thanks [14:00:37] ZhaoFJx: can you take a look at my comment on the patch, please? [14:00:52] Sure [14:01:17] 20:15 is a bit outside my working hours, but I'll come back then to push the ring change and monitor [14:01:34] (I might bunk off a bit early before then to make up :) ) [14:01:41] I'm not the original author, so I'll try to answer [14:02:09] I am not sure about database, but for the scrutineer group, community decide to give it as temporary permissions [14:02:34] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: High amount of 503/504 for swift uploads - https://phabricator.wikimedia.org/T382705#10421729 (10BCornwall) ms-be2075 will be effectively removed from the ring (weights set to 0), but a small snag: Swift rings have an enforced minimum time betwe... [14:03:03] Consensus link: https://w.wiki/CVuG [14:03:15] ZhaoFJx: but grantable by who? that's not clear from the conversation on the task. +the concerns regarding the DB impact are problematic by itself [14:03:48] let's gather more info and return to this in January. I'm not comfortable with the change running this late in the year. [14:03:48] PROBLEM - Host wikikube-ctrl1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:06] I asked Stang myself too, they said "Steward will give this flag" [14:04:32] urbanecm Sure! Happy new year and Merry Chrismas [14:05:46] Okay, I'm going to head back to bed for a while :) [14:06:00] Thanks so much all :) [14:06:04] RECOVERY - Host wikikube-ctrl1004 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [14:06:12] FIRING: [2x] ProbeDown: Service wikikube-ctrl1004:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wikikube-ctrl1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:06:16] brett: thanks for the help (and sorry to drag you out of bed!) [14:06:30] ... [14:06:49] !incidents [14:06:49] 5560 (UNACKED) [2x] ProbeDown sre (wikikube-ctrl1004:6443 probes/custom eqiad) [14:06:49] 5559 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:06:50] 5558 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:06:50] 5557 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:06:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:06:54] !ack 5560 [14:06:54] 5560 (ACKED) [2x] ProbeDown sre (wikikube-ctrl1004:6443 probes/custom eqiad) [14:07:01] related to your reimage I suppose? [14:07:01] the ctrl1004 is me [14:07:04] ok [14:07:09] !incidents [14:07:10] 5560 (ACKED) [2x] ProbeDown sre (wikikube-ctrl1004:6443 probes/custom eqiad) [14:07:10] 5561 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:07:10] 5559 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:07:10] 5558 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:07:11] 5557 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:07:13] the swift one though that just fired... no [14:07:16] !ack 5561 [14:07:17] 5561 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:07:49] is it worth depooling codfw-swift (in favour of eqiad) until this evening? It'll still get write traffic, but reads would move [14:08:39] If it lessens the load on the faulty node, that sounds reasonable [14:08:44] is that hard to do? [14:09:03] Simple Matter of Confctl, m'lud [14:09:19] That sounds splendid, then [14:09:51] Emperor: I suppose so? IT would lessen for sure errors viewable by users [14:10:02] !log depool codfw swift T382705 [14:10:03] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1031-1033].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [14:10:03] Emperor: you got it or shall I? [14:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:06] !log mvernon@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw [14:10:06] T382705: High amount of 503/504 for swift uploads - https://phabricator.wikimedia.org/T382705 [14:10:11] oh guess you do :) [14:10:17] {{done}} [14:10:25] FIRING: [2x] SystemdUnitFailed: etcd.service on wikikube-ctrl1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:10:28] We should probably put that back before the US-folks finish for today [14:10:36] ack [14:11:36] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: High amount of 503/504 for swift uploads - https://phabricator.wikimedia.org/T382705#10421748 (10MatthewVernon) The depool won't entirely help (writes always go to both clusters), but diverting read traffic to eqiad swift should help mitigate us... [14:11:44] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1031.eqiad.wmnet with OS bookworm [14:12:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-ctrl1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-ctrl1004 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:15:56] [aside: Ceph also alerts on OSDs with slow IOPS, which would have caught this sooner] [14:16:21] I need to be AFK for a smidge, if anything catches fire, send me a signal msg? [14:17:30] ack [14:17:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1029.eqiad.wmnet with OS bookworm [14:19:24] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1030.eqiad.wmnet with OS bookworm [14:19:37] (03PS1) 10Alexandros Kosiaris: Add wikikube-ctrl1004 to etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1106320 (https://phabricator.wikimedia.org/T379790) [14:21:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqsin&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:25:31] (03CR) 10Alexandros Kosiaris: [C:03+2] Add wikikube-ctrl1004 to etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1106320 (https://phabricator.wikimedia.org/T379790) (owner: 10Alexandros Kosiaris) [14:27:51] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1031.eqiad.wmnet with reason: host reimage [14:32:21] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1031.eqiad.wmnet with reason: host reimage [14:35:49] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1030.eqiad.wmnet with reason: host reimage [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:37] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10421774 (10MoritzMuehlenhoff) [14:39:17] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1030.eqiad.wmnet with reason: host reimage [14:42:43] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [14:42:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [14:43:12] 10SRE-swift-storage, 06Commons: Unable to restore File:Model 4000-First of Odakyu Electric Railway 2.JPG - https://phabricator.wikimedia.org/T382694#10421778 (10Pppery) (with the caveat that I'm not super familiar with this) It seems like the easiest next step is to delete the image at `/8/88/Model_4000-First... [14:43:26] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Swift [14:43:58] !incidents [14:43:59] 5560 (ACKED) [2x] ProbeDown sre (wikikube-ctrl1004:6443 probes/custom eqiad) [14:43:59] 5562 (UNACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [14:43:59] 5563 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:44:00] 5561 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:44:00] 5559 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:44:00] 5558 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:44:00] 5557 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:44:08] !ack 5562 [14:44:09] 5562 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [14:44:10] !ack 5563 [14:44:11] 5563 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:44:17] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:44:24] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Swift [14:44:38] !incidents [14:44:39] 5560 (ACKED) [2x] ProbeDown sre (wikikube-ctrl1004:6443 probes/custom eqiad) [14:44:39] 5562 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [14:44:39] 5563 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:44:39] 5561 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:44:40] 5559 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:44:40] 5558 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:44:40] 5557 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:45:25] RESOLVED: [3x] SystemdUnitFailed: etcd.service on wikikube-ctrl1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:12] RESOLVED: [2x] ProbeDown: Service wikikube-ctrl1004:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wikikube-ctrl1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:47:08] oh perfect [14:47:12] !incidents [14:47:12] 5562 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [14:47:12] 5563 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:47:12] 5564 (UNACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:47:13] 5560 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl1004:6443 probes/custom eqiad) [14:47:13] 5561 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:47:13] 5559 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:47:13] 5558 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:47:14] 5557 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:47:25] !ack 5564 [14:47:26] 5564 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:47:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-ctrl1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-ctrl1004 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:47:50] esams is new [14:47:52] 2k 5xx ? [14:48:08] wow this is bad [14:48:18] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift [14:48:24] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.068 second response time https://wikitech.wikimedia.org/wiki/Swift [14:48:34] Noting that I'm having issues with deleting things on Commons: getting "An unknown error occurred in storage backend "local-swift-eqiad"." and also some error viewing thumbnails. [14:48:54] mdaniels5757: thanks, it appears to be a bigger overall problem, we are investigating [14:49:10] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Swift [14:49:24] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Swift [14:49:28] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.098 second response time https://wikitech.wikimedia.org/wiki/Swift [14:50:10] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Swift [14:50:18] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Swift [14:50:28] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Swift [14:50:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1031.eqiad.wmnet with OS bookworm [14:51:18] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Swift [14:51:22] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 4.244 second response time https://wikitech.wikimedia.org/wiki/Swift [14:51:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl1004.eqiad.wmnet with OS bookworm [14:51:42] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [14:51:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host wiki... [14:51:54] this seems to be traffic induced, switching to other channels [14:52:42] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Swift [14:52:56] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1032.eqiad.wmnet with OS bookworm [14:53:18] PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.050 second response time https://wikitech.wikimedia.org/wiki/Swift [14:53:30] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Swift [14:54:12] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [14:54:18] RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Swift [14:54:22] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Swift [14:54:28] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Swift [14:55:12] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Swift [14:55:20] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Swift [14:55:28] FIRING: SystemdUnitCrashLoop: mjolnir-kafka-bulk-daemon.service crashloop on search-loader1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:55:33] FIRING: KubernetesCalicoDown: wikikube-ctrl1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-ctrl1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:56:51] FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [14:57:00] !incidents [14:57:01] 5562 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [14:57:01] 5563 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [14:57:01] 5564 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [14:57:01] 5560 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl1004:6443 probes/custom eqiad) [14:57:01] 5561 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:57:02] 5559 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:57:02] 5558 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:57:02] 5557 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [14:58:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1030.eqiad.wmnet with OS bookworm [14:58:15] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1028-1030].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [14:59:17] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:01:51] FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:02:12] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:02:43] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [15:02:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [15:05:28] RESOLVED: SystemdUnitCrashLoop: mjolnir-kafka-bulk-daemon.service crashloop on search-loader1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:06:04] !incidents [15:06:04] 5564 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [15:06:05] 5563 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [15:06:05] 5562 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [15:06:05] 5560 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl1004:6443 probes/custom eqiad) [15:06:05] 5561 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [15:06:05] 5559 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [15:06:06] 5558 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [15:06:06] 5557 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:51] RESOLVED: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:11:26] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1032.eqiad.wmnet with reason: host reimage [15:14:46] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1032.eqiad.wmnet with reason: host reimage [15:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10421830 (10phaultfinder) [15:25:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:27:35] this isn't like the previous one [15:27:56] instead of 2k rps, we now have errors just above the threshold (3 rps) [15:28:02] what now [15:30:33] FIRING: [2x] KubernetesCalicoDown: wikikube-ctrl1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:30:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:35:55] (03PS1) 10Alexandros Kosiaris: Add wikikube-ctrl1004.eqiad.wmnet to cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1106322 (https://phabricator.wikimedia.org/T379790) [15:36:15] (03CR) 10CI reject: [V:04-1] Add wikikube-ctrl1004.eqiad.wmnet to cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1106322 (https://phabricator.wikimedia.org/T379790) (owner: 10Alexandros Kosiaris) [15:37:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1032.eqiad.wmnet with OS bookworm [15:38:22] (03PS2) 10Alexandros Kosiaris: Add wikikube-ctrl1004.eqiad.wmnet to cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1106322 (https://phabricator.wikimedia.org/T379790) [15:39:06] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1033.eqiad.wmnet with OS bookworm [15:40:01] (03CR) 10Alexandros Kosiaris: [C:03+2] Add wikikube-ctrl1004.eqiad.wmnet to cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/1106322 (https://phabricator.wikimedia.org/T379790) (owner: 10Alexandros Kosiaris) [15:45:33] RESOLVED: KubernetesCalicoDown: wikikube-ctrl1004.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-ctrl1004.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:53:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 4 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10421845 (10akosiaris) 05Open→03Resolved a:03akosiaris box reimaged, BGP set up, calico double checked.... [15:56:49] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1033.eqiad.wmnet with reason: host reimage [16:01:48] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1033.eqiad.wmnet with reason: host reimage [16:05:32] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714 (10LSobanski) 03NEW [16:06:17] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: Dell PowerEdge RAID Controller (instance an-presto1016) - https://phabricator.wikimedia.org/T382714#10421880 (10LSobanski) Alert is active for an-presto1016, an-presto1017, an-presto1019, an-presto1020. [16:19:36] (03PS1) 10Andrew Bogott: role/profile for a simple vps-hosted dns recursor [puppet] - 10https://gerrit.wikimedia.org/r/1106324 (https://phabricator.wikimedia.org/T374830) [16:19:55] (03CR) 10CI reject: [V:04-1] role/profile for a simple vps-hosted dns recursor [puppet] - 10https://gerrit.wikimedia.org/r/1106324 (https://phabricator.wikimedia.org/T374830) (owner: 10Andrew Bogott) [16:20:46] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1033.eqiad.wmnet with OS bookworm [16:20:49] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1031-1033].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [16:22:22] (03PS2) 10Andrew Bogott: role/profile for a simple vps-hosted dns recursor [puppet] - 10https://gerrit.wikimedia.org/r/1106324 (https://phabricator.wikimedia.org/T374830) [16:22:41] (03CR) 10CI reject: [V:04-1] role/profile for a simple vps-hosted dns recursor [puppet] - 10https://gerrit.wikimedia.org/r/1106324 (https://phabricator.wikimedia.org/T374830) (owner: 10Andrew Bogott) [16:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10421892 (10phaultfinder) [16:30:05] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241223T1630). [16:32:46] (03PS3) 10Andrew Bogott: role/profile for a simple vps-hosted dns recursor [puppet] - 10https://gerrit.wikimedia.org/r/1106324 (https://phabricator.wikimedia.org/T374830) [16:33:06] (03CR) 10CI reject: [V:04-1] role/profile for a simple vps-hosted dns recursor [puppet] - 10https://gerrit.wikimedia.org/r/1106324 (https://phabricator.wikimedia.org/T374830) (owner: 10Andrew Bogott) [16:34:09] (03PS4) 10Andrew Bogott: role/profile for a simple vps-hosted dns recursor [puppet] - 10https://gerrit.wikimedia.org/r/1106324 (https://phabricator.wikimedia.org/T374830) [16:37:07] (03CR) 10Andrew Bogott: [C:03+2] role/profile for a simple vps-hosted dns recursor [puppet] - 10https://gerrit.wikimedia.org/r/1106324 (https://phabricator.wikimedia.org/T374830) (owner: 10Andrew Bogott) [16:39:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10421899 (10phaultfinder) [16:51:36] (03PS7) 10Tiziano Fogli: ripeatlas: remove hardcoded measurements [alerts] - 10https://gerrit.wikimedia.org/r/1105747 [16:51:36] (03CR) 10Tiziano Fogli: "This is ready for review." [alerts] - 10https://gerrit.wikimedia.org/r/1105747 (owner: 10Tiziano Fogli) [17:09:31] 10SRE-swift-storage, 06Commons: Unable to restore File:Model 4000-First of Odakyu Electric Railway 2.JPG - https://phabricator.wikimedia.org/T382694#10421916 (10Ladsgroup) if the image exists in deleted container, I agree, just deleting the file from the public container is the right thing to do. In fact, I th... [17:22:29] !log swift delete wikipedia-commons-local-public.88 8/88/Model_4000-First_of_Odakyu_Electric_Railway_2.JPG T382694 [17:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:33] T382694: Unable to restore File:Model 4000-First of Odakyu Electric Railway 2.JPG - https://phabricator.wikimedia.org/T382694 [17:23:46] 10SRE-swift-storage, 06Commons: Unable to restore File:Model 4000-First of Odakyu Electric Railway 2.JPG - https://phabricator.wikimedia.org/T382694#10421930 (10MatthewVernon) Done, in both clusters. [17:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10421931 (10phaultfinder) [17:26:16] 10SRE-swift-storage, 06Commons: Unable to restore File:Model 4000-First of Odakyu Electric Railway 2.JPG - https://phabricator.wikimedia.org/T382694#10421937 (10Pppery) @Sreejithk2000 Could you try undeleting the file again now? [17:27:57] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:28:47] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:29:46] (03PS2) 10FNegri: Allow pty allocation for cumin ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/1091755 (https://phabricator.wikimedia.org/T379570) [17:31:16] (03CR) 10FNegri: "> If we want to be on safe side we could just add the -T Disable pseudo-terminal allocation. CLI option to cumin's ssh_config file" [puppet] - 10https://gerrit.wikimedia.org/r/1091755 (https://phabricator.wikimedia.org/T379570) (owner: 10FNegri) [17:37:50] (03CR) 10BCornwall: [C:03+1] Add wikikube-ctrl1004 to etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1106320 (https://phabricator.wikimedia.org/T379790) (owner: 10Alexandros Kosiaris) [17:47:50] (03CR) 10Majavah: Allow pty allocation for cumin ssh keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091755 (https://phabricator.wikimedia.org/T379570) (owner: 10FNegri) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241223T1800) [18:00:05] ryankemper: gettimeofday() says it's time for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241223T1800) [18:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10421984 (10phaultfinder) [18:44:23] 10SRE-swift-storage, 06Commons: Unable to restore File:Model 4000-First of Odakyu Electric Railway 2.JPG - https://phabricator.wikimedia.org/T382694#10422007 (10MGA73) Ooops. Sorry. I undeleted the file and it worked fine. [18:45:56] 10SRE-swift-storage, 06Commons: Unable to restore File:Model 4000-First of Odakyu Electric Railway 2.JPG - https://phabricator.wikimedia.org/T382694#10422009 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Great, thanks, I'll close this ticket now :) [19:21:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:22:49] Here. [19:22:56] !incidents [19:22:57] 5566 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [19:22:58] 5565 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [19:22:58] 5564 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [19:22:58] 5563 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [19:22:58] 5562 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [19:22:58] 5560 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl1004:6443 probes/custom eqiad) [19:22:59] 5561 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [19:22:59] 5559 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [19:22:59] 5558 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [19:23:00] 5557 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [19:23:07] We got a page brett [19:23:12] thank you, I'm aware [19:23:21] Nothing we can do here [19:23:28] !ack 5566 [19:23:29] 5566 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [19:24:13] denisse: See https://phabricator.wikimedia.org/T382705 [19:24:50] one more hour :) [19:25:32] Thanks, looking. [19:26:30] Dec 23 19:25:37 ms-be1075 swift-container-reconciler: Timeout connecting to memcached: 10.64.0.45:11211 (txn: tx8d3c4688788845e48b001-006769b931) is a bit odd [19:27:24] that node shouldn't be trying to talk to moss-fe1001, which hasn't been in that swift cluster for months. [19:30:10] and it's not listed in profile::swift::proxy::memcached_servers for eqiad either [19:33:04] !log restart swift-container-reconciler on ms-be1075 [19:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422086 (10phaultfinder) [19:35:39] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Swift [19:36:37] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Swift [19:36:43] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [19:36:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [19:37:12] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:37:34] whoa [19:37:46] (03PS1) 10Bartosz Dziewoński: Fix Azeri alias lang code [extensions/UrlShortener] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1106340 (https://phabricator.wikimedia.org/T382717) [19:38:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/UrlShortener] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1106340 (https://phabricator.wikimedia.org/T382717) (owner: 10Bartosz Dziewoński) [19:38:03] envoy on that node is using a lot of CPU [19:38:58] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:39:17] cause? [19:39:27] FIRING: SystemdUnitCrashLoop: mjolnir-kafka-bulk-daemon.service crashloop on search-loader2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:39:43] jouncebot: next [19:39:43] In 1 hour(s) and 20 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241223T2100) [19:40:07] !incidents [19:40:08] 5566 (ACKED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [19:40:08] 5567 (UNACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [19:40:08] 5568 (UNACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [19:40:08] 5569 (UNACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [19:40:09] 5565 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [19:40:09] 5564 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [19:40:09] 5563 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [19:40:09] 5562 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [19:40:09] 5560 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl1004:6443 probes/custom eqiad) [19:40:10] 5561 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [19:40:10] 5559 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [19:40:11] 5558 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [19:40:26] !ack 5567 5568 5569 [19:40:27] Could not ack the alert. Please check the parameters. [19:40:31] are any deployers around today? i just scheduled a last-minute patch for the last-minute window (sorry :) ) [19:40:34] !ack 5567 [19:40:34] 5567 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [19:40:39] !ack 5568 [19:40:40] 5568 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [19:40:47] !ack 5569 [19:40:47] 5569 (ACKED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [19:41:07] varnish/haproxy being unavailable is separate, surely [19:42:01] PROBLEM - Swift https backend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.748 second response time https://wikitech.wikimedia.org/wiki/Swift [19:42:12] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:42:22] not sure what is going on today :( [19:42:25] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift [19:43:07] RECOVERY - Swift https backend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 8.766 second response time https://wikitech.wikimedia.org/wiki/Swift [19:43:23] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift [19:43:58] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:44:00] MatmaRex: I can deploy as long as SRE is fine with me deploying in the middle of this swift mess [19:44:39] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift [19:44:42] taavi: How long can you wait? [19:44:53] I figure one mess at a time would be ideal [19:44:59] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Swift [19:45:37] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Swift [19:45:50] whatever is up with eqiad-swift is not AFAICT the problem with codfw-swift we hope to resolve by removing the unhappy ms-be2075 later [19:46:19] (I mean, unless another backend has gone pop) [19:46:25] taavi: brett: not urgently, the usual window is in an hour :) [19:46:26] jouncebot: next [19:46:26] In 1 hour(s) and 13 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241223T2100) [19:46:32] at least I'm not in a hurry to anywhere [19:46:33] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Swift [19:46:41] thank you [19:46:51] FIRING: [2x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:47:12] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:47:59] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 297 bytes in 1.211 second response time https://wikitech.wikimedia.org/wiki/Swift [19:48:27] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.050 second response time https://wikitech.wikimedia.org/wiki/Swift [19:49:25] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Swift [19:49:27] RESOLVED: SystemdUnitCrashLoop: mjolnir-kafka-bulk-daemon.service crashloop on search-loader2002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:49:33] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 1.043 second response time https://wikitech.wikimedia.org/wiki/Swift [19:49:39] PROBLEM - Swift https backend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.048 second response time https://wikitech.wikimedia.org/wiki/Swift [19:50:33] PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [19:50:34] Lot of random date-based uri queries to upload [19:50:39] RECOVERY - Swift https backend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 1.582 second response time https://wikitech.wikimedia.org/wiki/Swift [19:51:51] FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [19:52:25] RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Swift [19:52:59] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Swift [19:54:01] I'm currently at a loss, I'm afraid :( [19:54:17] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:54:27] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift [19:54:57] FIRING: [2x] SystemdUnitCrashLoop: mjolnir-kafka-bulk-daemon.service crashloop on search-loader1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [19:54:59] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.974 second response time https://wikitech.wikimedia.org/wiki/Swift [19:55:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:56:25] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Swift [19:57:05] Emperor: Theory: cachebusting on upload with those date query params causing the cdn to overwhelm swift's envoy? [19:57:12] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:57:33] brett: probably should discuss on the other channel [19:57:36] ack [19:58:25] PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/Swift [19:59:17] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:59:59] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/Swift [20:00:25] RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.930 second response time https://wikitech.wikimedia.org/wiki/Swift [20:00:27] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Swift [20:00:57] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Swift [20:00:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:01:33] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/Swift [20:02:12] FIRING: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:02:14] Emperor: Are you aware of anyone that's dealt with this before? [20:02:25] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.080 second response time https://wikitech.wikimedia.org/wiki/Swift [20:02:40] I can't remember something this bad on swift in my time here [20:02:49] I think it's incident time [20:03:33] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Swift [20:03:38] (sorry, heading over to _security [20:03:51] PROBLEM - Swift https backend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Swift [20:04:51] RECOVERY - Swift https backend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.936 second response time https://wikitech.wikimedia.org/wiki/Swift [20:04:57] FIRING: [2x] SystemdUnitCrashLoop: mjolnir-kafka-bulk-daemon.service crashloop on search-loader1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:05:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:07:12] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:10:24] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: High amount of 503/504 for swift uploads - https://phabricator.wikimedia.org/T382705#10422114 (10TheDJ) Ehm. it this a problem ? or a side effect of the depool taking effect after that 20:15 window ? {F58048216} [20:10:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:11:43] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [20:11:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [20:11:51] FIRING: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:14:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422120 (10phaultfinder) [20:14:57] RESOLVED: [2x] SystemdUnitCrashLoop: mjolnir-kafka-bulk-daemon.service crashloop on search-loader1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:16:51] RESOLVED: [7x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:16:56] !log weighted ms-be2075 to zero T382705 T382707 [20:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:02] T382705: High amount of 503/504 for swift uploads - https://phabricator.wikimedia.org/T382705 [20:17:02] T382707: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707 [20:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422131 (10phaultfinder) [20:37:42] !log cumin run on swift nodes [20:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:50:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=upload&var-origin=swift.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [20:58:16] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: High amount of 503/504 for swift uploads - https://phabricator.wikimedia.org/T382705#10422179 (10BCornwall) @TheDJ That was a result of a separate issue that is now resolved (it's been quite a day for swift!) [20:59:18] !log mvernon@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=swift,name=codfw [20:59:49] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: High amount of 503/504 for swift uploads - https://phabricator.wikimedia.org/T382705#10422185 (10TheDJ) >>! In T382705#10422179, @BCornwall wrote: > (it's been quite a day for swift!) @BCornwall lets just hope then that it had to get this out o... [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241223T2100). [21:00:05] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:25] hi [21:00:34] is it safe to deploy now, or should we wait/cancel? [21:04:47] !log depool/restart/repoo ms-fe1013 [21:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:55] hmm [21:14:00] MatmaRex: I think you can proceed [21:14:27] 10SRE-swift-storage: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913#10422199 (10andrea.denisse) [21:15:04] is anyone around who can do it? [21:17:54] I can deploy [21:18:29] (03CR) 10Zabe: [C:03+2] Fix Azeri alias lang code [extensions/UrlShortener] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1106340 (https://phabricator.wikimedia.org/T382717) (owner: 10Bartosz Dziewoński) [21:21:06] (03Merged) 10jenkins-bot: Fix Azeri alias lang code [extensions/UrlShortener] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1106340 (https://phabricator.wikimedia.org/T382717) (owner: 10Bartosz Dziewoński) [21:24:07] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1106340|Fix Azeri alias lang code (T382717 T381048)]] [21:24:13] T382717: UrlShortener EN url is being replaced with the AZ url title - https://phabricator.wikimedia.org/T382717 [21:24:13] T381048: Add Azerbaijani namespaces to WMF deployed extensions - https://phabricator.wikimedia.org/T381048 [21:25:33] thanks zabe [21:26:05] it keeps surprising me how quickly merges in extensions repos happen now [21:26:29] i hope we get that for core too, soon [21:27:43] this will take a while [21:27:58] the patch causes the localization cache to be rebuilt [21:29:18] well, we can't have everything [21:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422217 (10phaultfinder) [21:46:38] is it still building? [21:47:59] its currently syncing to the test hosts [21:48:22] meh [21:48:23] 21:48:00 K8s deployment to stage testservers failed: K8s Deployment had the following errors: [21:48:23] codfw: Deployment of mw-debug-next failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=next', 'apply']' returned non-zero exit status 1. [21:48:24] Deployment of mw-debug-pinkunicorn failed: Command '['helmfile', '-e', 'codfw', '--selector', 'name=pinkunicorn', 'apply']' returned non-zero exit status 1. [21:48:25] 21:48:00 Rolling back to prior state... [21:49:59] uh oh [21:53:04] !log zabe@deploy2002 Started scap sync-world: T382717 [21:53:09] T382717: UrlShortener EN url is being replaced with the AZ url title - https://phabricator.wikimedia.org/T382717 [21:54:05] !log zabe@deploy2002 scap failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.44.0-wmf.8 --multiversion-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion --multiversion-debug-image-name docker-registry.discovery.wmnet/restricted/media [21:54:06] wiki-multiversion-debug --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080' returned non-zero exit status 1. (scap version: 4.134.0) (duration: 01m 01s) [21:55:14] ehm [21:55:40] !log zabe@deploy2002 Started scap sync-world: T382717 [21:56:23] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: High amount of 503/504 for swift uploads - https://phabricator.wikimedia.org/T382705#10422244 (10BCornwall) 05In progress→03Resolved a:03BCornwall This should be fixed now that ms-be2075 is taken out of the ring. Thanks to @MatthewVern... [22:00:04] Reedy, sbassett, Maryum, and manfredi: Time to do the Weekly Security deployment window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241223T2200). [22:02:22] ok now it managed to synd test-k8s [22:03:14] !log zabe@deploy2002 zabe: T382717 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:03:14] !log zabe@deploy2002 Sync cancelled. [22:03:18] T382717: UrlShortener EN url is being replaced with the AZ url title - https://phabricator.wikimedia.org/T382717 [22:03:35] huh [22:03:42] !log zabe@deploy2002 Started scap sync-world: T382717 [22:04:40] did you do the classic thing where you accidentally pretyped enter [22:05:41] !log zabe@deploy2002 zabe: T382717 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:05:52] yeah I think so [22:05:54] looks good on test servers btw [22:05:56] now we can test [22:06:04] !log zabe@deploy2002 zabe: Continuing with sync [22:06:07] lets see [22:18:50] !log zabe@deploy2002 Finished scap sync-world: T382717 (duration: 15m 07s) [22:18:54] T382717: UrlShortener EN url is being replaced with the AZ url title - https://phabricator.wikimedia.org/T382717 [22:18:57] MatmaRex: done :) [22:19:04] took a bit longer than expected lol [22:19:18] thank you for deploying [22:27:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.143s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:30:02] (03PS1) 10BCornwall: Point various parking domains to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1106344 (https://phabricator.wikimedia.org/T380667) [22:32:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.143s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:01:47] (03CR) 10Pppery: [C:03+1] Point various parking domains to ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1106344 (https://phabricator.wikimedia.org/T380667) (owner: 10BCornwall) [23:18:44] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability