[00:00:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 834.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:10:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.032s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:15:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 980.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:31:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 834.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:31:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:36:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 805.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:38:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123811 [00:38:56] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123811 (owner: 10TrainBranchBot) [00:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:03:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123811 (owner: 10TrainBranchBot) [01:08:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123813 [01:08:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123813 (owner: 10TrainBranchBot) [01:16:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.096s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:21:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.096s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:21:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:41:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 857.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:51:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 838.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:08:50] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:09:04] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:09:14] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:10:54] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53656 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:11:06] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:11:42] (03PS1) 10Subramanya Sastry: Revert "Turn on Parsoid fragment support everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123815 [02:15:25] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123813 (owner: 10TrainBranchBot) [02:16:44] (03PS2) 10Subramanya Sastry: Revert "Turn on Parsoid fragment support everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123815 (https://phabricator.wikimedia.org/T387608) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:58] PROBLEM - BFD status on cr1-magru is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:45:58] RECOVERY - BFD status on cr1-magru is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:51:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:18:25] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387692 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [03:18:35] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387692 (10ops-monitoring-bot) 03NEW [03:21:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:41:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:50:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123802 (https://phabricator.wikimedia.org/T386719) (owner: 10Sbisson) [03:57:51] (03CR) 10KartikMistry: [C:03+1] Enable CX unified dashboard on sqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123802 (https://phabricator.wikimedia.org/T386719) (owner: 10Sbisson) [04:11:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:31:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:01:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:03:22] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:03:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:11:22] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:11:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:11:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:15:58] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:17:14] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:37:30] FIRING: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:41:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:42:30] RESOLVED: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:03:15] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:08:15] RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:09:15] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.286s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:11:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:18:30] RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.22s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:24:35] (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123938 (https://phabricator.wikimedia.org/T387547) [06:34:45] FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.22s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:38:30] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 872.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:41:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:55:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 922.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:05:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 847ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:08:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 896.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:10:29] (03CR) 10Brouberol: [C:03+2] airflow: inject the AIRFLOW_APPOWNER environment variable in all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123524 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [07:13:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 802ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:14:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3586 MB (3% inode=98%): /tmp 3586 MB (3% inode=98%): /var/tmp 3586 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [07:15:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 882.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:16:20] (03PS1) 10KartikMistry: Update cxserver to 2025-03-03-041049-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123940 (https://phabricator.wikimedia.org/T369815) [07:18:30] !log installing Linux 6.1.128 on Bookworm hosts [07:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 821.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:21:01] 06SRE, 06DBA: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10594869 (10Marostegui) a:03Marostegui [07:22:28] 06SRE, 06DBA: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10594871 (10Marostegui) Same problem as always: ` ------------------------------------------------------------------------------- Record: 16 Date/Time: 03/02/2025 20:45:55 Source: system Severity: Critical... [07:26:27] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10594873 (10Marostegui) Same issue as: T359940 T361968 T363119 T374215 I am going to write to the Dell thread. [07:30:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.198s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:30:48] (03PS1) 10Marostegui: db1246: Update notes [puppet] - 10https://gerrit.wikimedia.org/r/1124031 (https://phabricator.wikimedia.org/T387673) [07:32:09] (03CR) 10Marostegui: [C:03+2] db1246: Update notes [puppet] - 10https://gerrit.wikimedia.org/r/1124031 (https://phabricator.wikimedia.org/T387673) (owner: 10Marostegui) [07:32:23] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10594894 (10Marostegui) p:05Triage→03Medium [07:33:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1233', diff saved to https://phabricator.wikimedia.org/P73923 and previous config saved to /var/cache/conftool/dbconfig/20250303-073358-root.json [07:35:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.295s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:38:03] (03CR) 10Nikerabbit: [C:04-1] metawiki: Enable Chinese variant translation for message bundles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122632 (https://phabricator.wikimedia.org/T387230) (owner: 10Abijeet Patro) [07:40:15] RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.295s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:41:16] (03PS1) 10Marostegui: db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124036 (https://phabricator.wikimedia.org/T387673) [07:42:30] (03CR) 10Marostegui: [C:03+2] db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124036 (https://phabricator.wikimedia.org/T387673) (owner: 10Marostegui) [07:45:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:45:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2164,2186].codfw.wmnet,db1172.eqiad.wmnet with reason: Rebuilding indexes [07:45:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1172 db2164', diff saved to https://phabricator.wikimedia.org/P73925 and previous config saved to /var/cache/conftool/dbconfig/20250303-074525-marostegui.json [07:45:35] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1172.eqiad.wmnet [07:45:43] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2164.codfw.wmnet [07:46:07] !log T387658 Ran mwscript-k8s --comment="T387658" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=bawiki --logwiki=metawiki 'Əkrəm Cəfər' 'Əkrəm' [07:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:13] T387658: Unblock stuck global rename of Əkrəm - https://phabricator.wikimedia.org/T387658 [07:48:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2210 db1190', diff saved to https://phabricator.wikimedia.org/P73926 and previous config saved to /var/cache/conftool/dbconfig/20250303-074804-marostegui.json [07:48:39] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1190.eqiad.wmnet [07:48:44] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2210.codfw.wmnet [07:49:43] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1123665 (owner: 10Muehlenhoff) [07:50:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1027.eqiad.wmnet to cluster eqiad and group C [07:50:16] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:51:06] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1027.eqiad.wmnet to cluster eqiad and group C [07:52:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1172.eqiad.wmnet [07:52:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2164.codfw.wmnet [07:52:37] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Index rebuild [07:52:49] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2164.codfw.wmnet with reason: Index rebuild [07:53:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2210.codfw.wmnet [07:53:41] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2210.codfw.wmnet with reason: Index rebuild [07:55:19] (03CR) 10Jelto: [C:03+1] "lgtm" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123593 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm) [07:55:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1190.eqiad.wmnet [08:00:06] Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T0800). [08:00:06] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:01:14] Only my patch, I'll go ahead in a minute. [08:03:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123802 (https://phabricator.wikimedia.org/T386719) (owner: 10Sbisson) [08:03:59] (03Merged) 10jenkins-bot: Enable CX unified dashboard on sqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123802 (https://phabricator.wikimedia.org/T386719) (owner: 10Sbisson) [08:04:23] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1123802|Enable CX unified dashboard on sqwiki (T386719)]] [08:04:26] T386719: Deploy unified dashboard - https://phabricator.wikimedia.org/T386719 [08:08:49] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1190.eqiad.wmnet with reason: Index rebuild [08:16:22] !log kartik@deploy2002 sbisson, kartik: Backport for [[gerrit:1123802|Enable CX unified dashboard on sqwiki (T386719)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:16:25] T386719: Deploy unified dashboard - https://phabricator.wikimedia.org/T386719 [08:18:21] 06SRE, 10SRE-Access-Requests: Requesting access to wmf for bwang - https://phabricator.wikimedia.org/T387614#10594978 (10MoritzMuehlenhoff) Requests to the wmf LDAP group are handled within Wikimedia IDM: Can you please log into https://idm.wikimedia.org and request the group by following the steps listed at... [08:19:19] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for chuckonwumelu - https://phabricator.wikimedia.org/T387627#10594981 (10MoritzMuehlenhoff) Requests to the wmf LDAP group are handled within Wikimedia IDM: Can you please log into https://idm.wikimedia.org and request the group by following the steps l... [08:20:49] !log kartik@deploy2002 sbisson, kartik: Continuing with sync [08:23:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:29:56] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123802|Enable CX unified dashboard on sqwiki (T386719)]] (duration: 25m 32s) [08:29:59] T386719: Deploy unified dashboard - https://phabricator.wikimedia.org/T386719 [08:30:35] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db2166.codfw.wmnet onto db2167.codfw.wmnet [08:33:35] (03CR) 10Volans: "Some additional thoughts." [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [08:34:51] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3552 MB (3% inode=98%): /tmp 3552 MB (3% inode=98%): /var/tmp 3552 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [08:37:31] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10595023 (10Peachey88) [08:37:32] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387692#10595025 (10Peachey88) →14Duplicate dup:03T382984 [08:40:38] (03PS1) 10Muehlenhoff: Add mszabo to analytics-privatedata-users and record new Kerberos access [puppet] - 10https://gerrit.wikimedia.org/r/1124038 (https://phabricator.wikimedia.org/T386918) [08:40:39] (03PS1) 10Vgutierrez: hiera,prometheus: Enable IPIP on jobrunner@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295) [08:40:40] (03PS1) 10Vgutierrez: hiera,prometheus: Enable IPIP on jobrunner@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) [08:41:38] (03PS2) 10Elukey: kserve: allow Prometheus metrics to be fetched [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123692 (https://phabricator.wikimedia.org/T387580) [08:42:17] (03PS1) 10Federico Ceratto: clone.py: Fix fqdn variables [cookbooks] - 10https://gerrit.wikimedia.org/r/1124041 (https://phabricator.wikimedia.org/T387023) [08:42:37] (03CR) 10Marostegui: [C:03+1] clone.py: Fix fqdn variables [cookbooks] - 10https://gerrit.wikimedia.org/r/1124041 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [08:43:15] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:43:39] (03PS1) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 [08:43:46] (03CR) 10CI reject: [V:04-1] airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (owner: 10Brouberol) [08:44:35] (03CR) 10Muehlenhoff: [C:03+2] Add mszabo to analytics-privatedata-users and record new Kerberos access [puppet] - 10https://gerrit.wikimedia.org/r/1124038 (https://phabricator.wikimedia.org/T386918) (owner: 10Muehlenhoff) [08:44:52] (03PS2) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 [08:45:51] (03PS3) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (https://phabricator.wikimedia.org/T387700) [08:46:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez) [08:46:55] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez) [08:48:52] (03CR) 10Federico Ceratto: [C:03+2] clone.py: Fix fqdn variables [cookbooks] - 10https://gerrit.wikimedia.org/r/1124041 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [08:49:07] (03CR) 10Federico Ceratto: [V:03+1 C:03+2] clone.py: Fix fqdn variables [cookbooks] - 10https://gerrit.wikimedia.org/r/1124041 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [08:49:10] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] clone.py: Fix fqdn variables [cookbooks] - 10https://gerrit.wikimedia.org/r/1124041 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [08:49:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:49:25] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] Revert "Turn on Parsoid fragment support everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123815 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry) [08:49:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123815 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry) [08:50:39] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Máté Szabó - https://phabricator.wikimedia.org/T386918#10595045 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Sorry for the delay! I've just merged a patc... [08:54:16] RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:55:48] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1233.eqiad.wmnet onto db1246.eqiad.wmnet [08:55:48] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1233.eqiad.wmnet onto db1246.eqiad.wmnet [08:57:15] FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:59:24] (03CR) 10Elukey: [C:03+1] hiera,docker_registry_ha: Enable IPIP on docker-registry@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123413 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez) [09:01:12] (03CR) 10JMeybohm: [C:03+2] Remove upgrade checking and notice [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123593 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm) [09:03:38] (03PS4) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (https://phabricator.wikimedia.org/T387700) [09:05:51] (03CR) 10Vgutierrez: [C:03+2] hiera,docker_registry_ha: Enable IPIP on docker-registry@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123413 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez) [09:05:54] (03PS5) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (https://phabricator.wikimedia.org/T387700) [09:07:09] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: docker_registry_ha::registry@eqiad [09:09:58] (03PS6) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (https://phabricator.wikimedia.org/T387700) [09:09:59] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1123797 (https://phabricator.wikimedia.org/T385908) (owner: 10Andrew Bogott) [09:10:30] (03PS7) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (https://phabricator.wikimedia.org/T387700) [09:10:30] (03PS1) 10Muehlenhoff: Add Melos to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/1124045 (https://phabricator.wikimedia.org/T386581) [09:10:43] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [09:11:53] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [09:11:53] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: docker_registry_ha::registry@eqiad [09:12:17] (03CR) 10Jelto: [C:03+1] "looks reasonable" [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1123642 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm) [09:13:30] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2025 Feb-Mar): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10595093 (10Nikerabbit) [09:14:24] (03CR) 10Muehlenhoff: [C:03+2] Add Melos to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/1124045 (https://phabricator.wikimedia.org/T386581) (owner: 10Muehlenhoff) [09:17:37] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2025 Feb-Mar): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10595097 (10Nikerabbit) 05In progress→03Stalled [09:19:21] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to stewards-users for Melos - https://phabricator.wikimedia.org/T386581#10595101 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03MoritzMuehlenhoff @Melos Sorry for the delay! I've just merged a patch to enable your access. You s... [09:20:44] (03PS2) 10Vgutierrez: hiera,prometheus: Enable IPIP on jobrunner|videoscaler@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295) [09:20:46] (03PS2) 10Vgutierrez: hiera,prometheus: Enable IPIP on jobrunner|videoscaler@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) [09:22:15] RESOLVED: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:22:24] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez) [09:22:27] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez) [09:23:09] (03PS1) 10Muehlenhoff: Add ep1c to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/1124047 (https://phabricator.wikimedia.org/T385808) [09:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:28:07] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1233.eqiad.wmnet onto db1246.eqiad.wmnet [09:28:54] (03CR) 10Muehlenhoff: [C:03+2] Add ep1c to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/1124047 (https://phabricator.wikimedia.org/T385808) (owner: 10Muehlenhoff) [09:31:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:32:17] (03PS4) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294) [09:33:53] (03PS5) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294) [09:34:39] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10595124 (10brouberol) [09:35:46] (03CR) 10Elukey: [C:03+1] hiera,docker_registry_ha: Enable IPIP on docker-registry@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez) [09:37:36] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10595139 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03MoritzMuehlenhoff @EPIC Sorry for the delay! I've just merged a patch to enable your access. You should now be able to SS... [09:38:22] (03CR) 10Vgutierrez: [C:03+2] hiera,docker_registry_ha: Enable IPIP on docker-registry@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez) [09:38:37] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: docker_registry_ha::registry@codfw [09:43:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10595179 (10MoritzMuehlenhoff) [09:43:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1027.eqiad.wmnet to cluster eqiad and group A [09:43:32] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1027.eqiad.wmnet to cluster eqiad and group A [09:43:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1030.eqiad.wmnet to cluster eqiad and group A [09:44:02] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [09:44:29] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:44:44] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1030.eqiad.wmnet to cluster eqiad and group A [09:45:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:45:20] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707 (10MatthewVernon) 03NEW [09:45:23] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707#10595196 (10MatthewVernon) p:05Triage→03High [09:45:28] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [09:45:28] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: docker_registry_ha::registry@codfw [09:46:50] !log mvernon@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ms-be1080.eqiad.wmnet with reason: disk failed [09:46:56] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707#10595201 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=951f5a09-1cfc-43ed-af34-bcbbe604524f) set by mvernon@cumin1002 for 7 days, 0:00:00 on 1 host(s) and thei... [09:47:06] (03PS1) 10Elukey: knative-serving: add a variable in the templates for the Prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124050 (https://phabricator.wikimedia.org/T387580) [09:47:19] RECOVERY - Disk space on ms-be1080 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1080&var-datasource=eqiad+prometheus/ops [09:47:23] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:48:07] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:49:01] (03PS1) 10Giuseppe Lavagetto: Revert^2 "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051 [09:54:43] (03PS1) 10Elukey: admin_ng: set a different Prometheus port for knative in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124052 (https://phabricator.wikimedia.org/T387580) [09:54:56] (03CR) 10Effie Mouzeli: [C:03+1] mw-api-int: serve 10% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123700 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [09:57:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [09:59:05] (03PS2) 10Giuseppe Lavagetto: Revert^2 "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051 [10:01:45] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [10:05:27] I am going to cut a new release of `scap` which I need to finish T303828 [10:05:28] T303828: Delete wmf branches from Gerrit repositories - https://phabricator.wikimedia.org/T303828 [10:06:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73930 and previous config saved to /var/cache/conftool/dbconfig/20250303-100603-root.json [10:11:38] (03CR) 10Effie Mouzeli: [C:03+1] Enroll 100% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123694 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [10:11:44] RESOLVED: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [10:16:34] (03PS3) 10Vgutierrez: hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295) [10:16:34] (03PS3) 10Vgutierrez: hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) [10:19:10] (03PS1) 10Ladsgroup: labs: Set thumbnail steps and ratio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124055 (https://phabricator.wikimedia.org/T360589) [10:19:12] jouncebot: now [10:19:13] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [10:19:19] I am upgrading scap [10:19:53] (03CR) 10CI reject: [V:04-1] labs: Set thumbnail steps and ratio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124055 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:20:35] (03PS2) 10Ladsgroup: labs: Set thumbnail steps and ratio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124055 (https://phabricator.wikimedia.org/T360589) [10:20:45] FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [10:21:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73931 and previous config saved to /var/cache/conftool/dbconfig/20250303-102109-root.json [10:21:44] !log hashar@deploy2002 Installing scap version "4.139.0" for 204 host(s) [10:21:45] (03PS3) 10Giuseppe Lavagetto: Revert^2 "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051 [10:21:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:22:24] (03CR) 10Effie Mouzeli: [C:03+1] "woohoo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123694 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [10:23:05] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10595416 (10brouberol) a:03BTullis [10:23:21] hashar: regarding your scap deploy, I'm merging and rebasing this beta cluster patch. Not deploying it so it shouldn't affect you, please tell if I need to stop https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1124055 [10:23:33] (03CR) 10Effie Mouzeli: [C:03+1] "Makes sense, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123699 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [10:24:01] if it is only for beta, in prod scap is nowadays smart enough to skip the patch :) [10:24:21] should be fine anyway, I am updating scap itself [10:24:39] and I am not sure how it is ugpraded on beta [10:24:45] I guessed, just wanted to be sure [10:24:56] for beta, I know, it's not in a rush though [10:25:03] (03CR) 10Ladsgroup: [C:03+2] labs: Set thumbnail steps and ratio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124055 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:25:43] (03Merged) 10jenkins-bot: labs: Set thumbnail steps and ratio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124055 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:26:03] done, rebased on deploy200x [10:26:15] !log hashar@deploy2002 Installation of scap version "4.139.0" completed for 204 hosts [10:26:42] !log Upgraded scap to 4.139.0 # T303828 [10:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:44] T303828: Delete wmf branches from Gerrit repositories - https://phabricator.wikimedia.org/T303828 [10:27:44] (03PS1) 10Marostegui: check_depooled.sh: Add pc1-pc7 [software] - 10https://gerrit.wikimedia.org/r/1124058 [10:28:11] (03CR) 10Marostegui: "This is a noop" [software] - 10https://gerrit.wikimedia.org/r/1124058 (owner: 10Marostegui) [10:28:13] (03CR) 10Marostegui: [C:03+2] check_depooled.sh: Add pc1-pc7 [software] - 10https://gerrit.wikimedia.org/r/1124058 (owner: 10Marostegui) [10:28:22] (03PS12) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [10:28:32] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) of db1248.eqiad.wmnet onto db1252.eqiad.wmnet [10:28:40] (03Merged) 10jenkins-bot: check_depooled.sh: Add pc1-pc7 [software] - 10https://gerrit.wikimedia.org/r/1124058 (owner: 10Marostegui) [10:34:53] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1248.eqiad.wmnet onto db1252.eqiad.wmnet [10:36:44] (03PS3) 10Lucas Werkmeister (WMDE): Enable fixed Wikibase RDF everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118486 (https://phabricator.wikimedia.org/T384344) [10:36:44] (03PS3) 10Lucas Werkmeister (WMDE): Remove Wikibase fixed RDF feature flag again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118487 (https://phabricator.wikimedia.org/T384344) [10:37:04] (03PS1) 10Vgutierrez: hiera,restbase: Enable IPIP on restbase-(backend|https)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124061 (https://phabricator.wikimedia.org/T387299) [10:37:07] (03PS1) 10Vgutierrez: hiera,restbase: Enable IPIP on restbase-(backend|https)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124062 (https://phabricator.wikimedia.org/T387299) [10:37:10] (03CR) 10Lucas Werkmeister (WMDE): "Okay to deploy later today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118486 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [10:37:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118486 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [10:37:44] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124061 (https://phabricator.wikimedia.org/T387299) (owner: 10Vgutierrez) [10:37:53] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124062 (https://phabricator.wikimedia.org/T387299) (owner: 10Vgutierrez) [10:38:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73933 and previous config saved to /var/cache/conftool/dbconfig/20250303-103820-root.json [10:40:10] !log ayounsi@cumin1002 START - Cookbook sre.network.cf [10:40:11] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [10:40:16] (03CR) 10Klausman: [C:03+1] admin_ng: set a different Prometheus port for knative in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124052 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey) [10:40:54] (03CR) 10Klausman: [C:03+1] knative-serving: add a variable in the templates for the Prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124050 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey) [10:41:05] (03CR) 10Klausman: [C:03+1] kserve: allow Prometheus metrics to be fetched [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123692 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey) [10:42:07] (03CR) 10Elukey: [C:03+2] kserve: allow Prometheus metrics to be fetched [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123692 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey) [10:42:13] (03CR) 10Elukey: [C:03+2] knative-serving: add a variable in the templates for the Prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124050 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey) [10:42:18] (03CR) 10Elukey: [C:03+2] admin_ng: set a different Prometheus port for knative in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124052 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey) [10:43:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [10:44:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73934 and previous config saved to /var/cache/conftool/dbconfig/20250303-104438-root.json [10:46:55] (03CR) 10Effie Mouzeli: [C:03+1] shellbox-media: serve 1/8 of requests on 8.1 with more logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123690 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [10:48:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [10:49:10] (03CR) 10Hnowlan: [C:03+1] hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez) [10:49:16] (03CR) 10Hnowlan: [C:03+1] hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez) [10:50:56] (03CR) 10Ladsgroup: [C:04-2] Add config needed to re-architecture mainstash away from x2 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [10:51:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1233.eqiad.wmnet onto db1246.eqiad.wmnet [10:51:44] PROBLEM - MariaDB Replica Lag: s8 #page on db2166 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3650.59 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:52:21] !incidents [10:52:21] 5708 (UNACKED) db2166 (paged)/MariaDB Replica Lag: s8 (paged) [10:52:21] 5707 (RESOLVED) db1246 (paged)/MariaDB Replica IO: s2 (paged) [10:52:21] 5706 (RESOLVED) Host db1246 (paged) - PING - Packet loss = 100% [10:52:35] !ack 5708 [10:52:36] 5708 (ACKED) db2166 (paged)/MariaDB Replica Lag: s8 (paged) [10:52:38] (03PS1) 10Vgutierrez: hiera,analytics_cluster: Enable IPIP on datahubsearch@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124064 (https://phabricator.wikimedia.org/T387306) [10:52:43] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:53:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73935 and previous config saved to /var/cache/conftool/dbconfig/20250303-105325-root.json [10:53:41] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2166&var-port=9104 [10:54:30] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124064 (https://phabricator.wikimedia.org/T387306) (owner: 10Vgutierrez) [10:54:37] This one doesn't seem to have recent maintenance happening [10:54:48] no one logged in recently [10:54:58] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:55:15] but something happened at 08:30 [10:55:31] see also -data-persistence [10:55:48] lagging for 2h 23m 49s [10:55:51] and yes federico executed a cookbook today at 8:30 utc [10:55:57] "START - Cookbook sre.mysql.clone of db2166.codfw.wmnet onto db2167.codfw.wmnet" [10:56:30] ^ federico3 [10:56:38] ah sorry, thats 2167 not 2166 [10:56:55] or both [10:57:08] Looks like both [10:57:55] they've been doing cloning and 66 is still catching up [10:58:27] was it pooled? [10:58:28] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2166 - catching up replication [10:58:33] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2166 - catching up replication [10:58:54] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [10:58:57] the lag is going down https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2166&var-port=9104&viewPanel=6 on db2166 [10:59:03] (03PS1) 10Elukey: kserve: fix chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124065 (https://phabricator.wikimedia.org/T387580) [10:59:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73936 and previous config saved to /var/cache/conftool/dbconfig/20250303-105943-root.json [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T1100) [11:00:40] lag is at ~8 minutes and going down fast [11:00:55] 3 minutes [11:01:14] (03CR) 10Vgutierrez: [C:03+2] hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez) [11:01:27] db2166 was not pooled it (yet) [11:01:35] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: mediawiki::jobrunner@codfw [11:01:37] ok, then no worries [11:01:44] RECOVERY - MariaDB Replica Lag: s8 #page on db2166 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:01:52] it was probably a downtime expiration or something like that [11:01:58] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10595538 (10MoritzMuehlenhoff) [11:02:15] (03CR) 10Hnowlan: [C:03+1] hiera,restbase: Enable IPIP on restbase-(backend|https)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124061 (https://phabricator.wikimedia.org/T387299) (owner: 10Vgutierrez) [11:02:20] (03CR) 10Hnowlan: [C:03+1] hiera,restbase: Enable IPIP on restbase-(backend|https)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124062 (https://phabricator.wikimedia.org/T387299) (owner: 10Vgutierrez) [11:02:38] wherever it was setup, it has to be extended, or acked + removed [11:02:39] :) thanks for the help, I'm a bit surprised by the pag.e when it's not pooled but yes probably expired downtime (after 2 hours) [11:03:27] jelto: needs to be discussed, but the original reason is that it prevents accidental pool [11:03:46] better to alert before than after [11:03:54] FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:04:18] (03CR) 10Elukey: [C:03+2] kserve: fix chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124065 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey) [11:05:22] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [11:05:51] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:07:19] (03CR) 10Aklapper: "I don't have permissions to +2 this one." [puppet] - 10https://gerrit.wikimedia.org/r/1101481 (https://phabricator.wikimedia.org/T309222) (owner: 10Aklapper) [11:08:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73937 and previous config saved to /var/cache/conftool/dbconfig/20250303-110830-root.json [11:08:54] FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:09:28] (03CR) 10JMeybohm: [V:03+2 C:03+2] Update to new upstream version 3.10.0 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1123642 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm) [11:09:35] FIRING: [2x] SystemdUnitFailed: prometheus_ferm_mss.service on mw2278:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:57] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:10:13] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:11:47] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:11:51] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [11:11:51] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: mediawiki::jobrunner@codfw [11:11:55] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [11:12:09] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [11:13:54] FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:14:41] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:14:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73938 and previous config saved to /var/cache/conftool/dbconfig/20250303-111448-root.json [11:17:05] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [11:17:22] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [11:18:50] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2166.codfw.wmnet onto db2167.codfw.wmnet [11:18:54] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:20:38] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for Máté Szabó - https://phabricator.wikimedia.org/T386918#10595558 (10mszabo) Thanks! [11:20:57] (03CR) 10AikoChou: [C:03+1] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123938 (https://phabricator.wikimedia.org/T387547) (owner: 10Kevin Bazira) [11:23:54] FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:25:43] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:25:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73939 and previous config saved to /var/cache/conftool/dbconfig/20250303-112548-root.json [11:27:33] (03PS1) 10Vgutierrez: prometheus: Support py3 from buster on mss-ferm [puppet] - 10https://gerrit.wikimedia.org/r/1124068 (https://phabricator.wikimedia.org/T387295) [11:28:30] (03CR) 10Hnowlan: [C:03+1] mw-(api-ext|web): scale next to 40% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123699 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [11:28:54] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1081-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:29:26] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10595564 (10Marostegui) db1246 has been cloned. I will repool it tomorrow. I am sure it will sooner or later crash again, but we need to see if it is again the same HW error. [11:29:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73940 and previous config saved to /var/cache/conftool/dbconfig/20250303-112954-root.json [11:30:06] (03CR) 10Hnowlan: [C:03+1] prometheus: Support py3 from buster on mss-ferm [puppet] - 10https://gerrit.wikimedia.org/r/1124068 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez) [11:30:18] (03CR) 10Clément Goubert: [C:03+1] Revert^2 "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051 (owner: 10Giuseppe Lavagetto) [11:30:34] (03PS1) 10Muehlenhoff: Add harroyo-wmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124069 (https://phabricator.wikimedia.org/T386922) [11:31:05] (03CR) 10Fabfur: [C:03+1] prometheus: Support py3 from buster on mss-ferm [puppet] - 10https://gerrit.wikimedia.org/r/1124068 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez) [11:31:34] (03CR) 10Vgutierrez: [C:03+2] prometheus: Support py3 from buster on mss-ferm [puppet] - 10https://gerrit.wikimedia.org/r/1124068 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez) [11:32:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2206 db1249', diff saved to https://phabricator.wikimedia.org/P73941 and previous config saved to /var/cache/conftool/dbconfig/20250303-113225-root.json [11:32:43] (03PS6) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [11:32:43] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2206.codfw.wmnet [11:32:43] (03PS6) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [11:32:43] (03PS5) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [11:32:44] (03PS6) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [11:32:50] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1249.eqiad.wmnet [11:33:03] (03CR) 10CI reject: [V:04-1] Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 (owner: 10Giuseppe Lavagetto) [11:33:10] (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [11:33:25] (03CR) 10CI reject: [V:04-1] mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto) [11:33:43] (03CR) 10CI reject: [V:04-1] mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [11:35:42] (03CR) 10Vgutierrez: [C:03+2] hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez) [11:35:48] (03PS4) 10Vgutierrez: hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) [11:36:28] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10595587 (10MatthewVernon) @Jhancock.wm So this system has had new backplane and controller cards fitted? From comments on this ticket it looks like maybe controller cards have b... [11:36:41] (03PS13) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [11:37:17] (03CR) 10Vgutierrez: [C:03+2] hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez) [11:37:23] FIRING: [2x] SystemdUnitFailed: prometheus_ferm_mss.service on mw2278:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:28] (03CR) 10Muehlenhoff: [C:03+2] Add harroyo-wmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124069 (https://phabricator.wikimedia.org/T386922) (owner: 10Muehlenhoff) [11:37:35] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: mediawiki::jobrunner@eqiad [11:38:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2206.codfw.wmnet [11:38:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1249.eqiad.wmnet [11:38:52] (03PS14) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [11:38:54] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1100-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:40:15] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to Dashboards in Superset for harroyo-wmf - https://phabricator.wikimedia.org/T386922#10595594 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Sorry for the delay! I've just merged a patch to... [11:40:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73942 and previous config saved to /var/cache/conftool/dbconfig/20250303-114054-root.json [11:41:32] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:41:39] (03PS15) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [11:42:23] RESOLVED: [2x] SystemdUnitFailed: prometheus_ferm_mss.service on mw2278:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:42:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:42:27] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:42:40] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:42:40] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: mediawiki::jobrunner@eqiad [11:42:45] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db2166.codfw.wmnet onto db2167.codfw.wmnet [11:43:07] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2166.codfw.wmnet onto db2167.codfw.wmnet [11:44:06] ^^ that BGP alert was the pybal restart on lvs1019-lvs1020, should recover soon [11:44:44] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123938 (https://phabricator.wikimedia.org/T387547) (owner: 10Kevin Bazira) [11:45:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73943 and previous config saved to /var/cache/conftool/dbconfig/20250303-114500-root.json [11:45:45] (03PS16) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [11:45:59] (03Merged) 10jenkins-bot: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123938 (https://phabricator.wikimedia.org/T387547) (owner: 10Kevin Bazira) [11:48:01] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db2166.codfw.wmnet onto db2167.codfw.wmnet [11:48:19] !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ipoid: apply [11:48:35] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1123005 (https://phabricator.wikimedia.org/T387305) (owner: 10Vgutierrez) [11:48:54] FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2061-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:48:59] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1249.eqiad.wmnet with reason: Index rebuild [11:49:11] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2206.codfw.wmnet with reason: Index rebuild [11:49:43] !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ipoid: apply [11:50:13] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:50:13] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:52:32] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [11:52:46] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: restbase::production@codfw [11:52:52] (03CR) 10Vgutierrez: [C:03+2] hiera,restbase: Enable IPIP on restbase-(backend|https)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124061 (https://phabricator.wikimedia.org/T387299) (owner: 10Vgutierrez) [11:52:58] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [11:53:54] FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2061-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:56:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73944 and previous config saved to /var/cache/conftool/dbconfig/20250303-115559-root.json [11:56:25] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124075 [11:56:40] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124075 (owner: 10PipelineBot) [11:56:50] (03CR) 10Jgiannelos: [V:03+2 C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124075 (owner: 10PipelineBot) [11:56:52] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:57:02] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:57:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10595631 (10MoritzMuehlenhoff) @Ben.buchenau It doesn't appear you have created a Wikimedia Developer Account yet? At least I'm unable to find an account linked to ben.bu... [11:57:54] (03PS1) 10Jcrespo: dbbackups: Setup temporary archival job to archive RT database [puppet] - 10https://gerrit.wikimedia.org/r/1124076 (https://phabricator.wikimedia.org/T385901) [11:58:20] (03CR) 10CI reject: [V:04-1] dbbackups: Setup temporary archival job to archive RT database [puppet] - 10https://gerrit.wikimedia.org/r/1124076 (https://phabricator.wikimedia.org/T385901) (owner: 10Jcrespo) [11:58:41] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:58:54] FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:59:10] (03PS2) 10Jcrespo: dbbackups: Setup temporary archival job to archive RT database [puppet] - 10https://gerrit.wikimedia.org/r/1124076 (https://phabricator.wikimedia.org/T385901) [11:59:39] !log Imported helmfile 0.171.0-2 and helm-diff 3.10.0-1 to bullseye-wikimedia and bookworm-wikimedia - T341984 T387376 [11:59:40] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [11:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:44] T341984: Update Kubernetes clusters to 1.31 - https://phabricator.wikimedia.org/T341984 [11:59:44] T387376: Respect kubeVersion constraints in deployment-charts CI - https://phabricator.wikimedia.org/T387376 [12:00:06] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:00:24] (03CR) 10D3r1ck01: [C:03+1] Set $wgCentralAuthSharedDomainCallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123776 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza) [12:00:36] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [12:00:36] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: restbase::production@codfw [12:00:56] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:01:27] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:01:34] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:02:08] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:02:34] (03CR) 10Vgutierrez: [C:03+2] hiera,wmcs: Enable IPIP on labweb-ssl@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123005 (https://phabricator.wikimedia.org/T387305) (owner: 10Vgutierrez) [12:02:42] (03CR) 10Jcrespo: [C:03+2] dbbackups: Setup temporary archival job to archive RT database [puppet] - 10https://gerrit.wikimedia.org/r/1124076 (https://phabricator.wikimedia.org/T385901) (owner: 10Jcrespo) [12:03:09] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wmcs::openstack::eqiad1::cloudweb@eqiad [12:07:29] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [12:08:25] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:08:27] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:08:37] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [12:08:37] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wmcs::openstack::eqiad1::cloudweb@eqiad [12:09:25] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: restbase::production@eqiad [12:09:41] (03PS2) 10Vgutierrez: hiera,restbase: Enable IPIP on restbase-(backend|https)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124062 (https://phabricator.wikimedia.org/T387299) [12:10:55] (03CR) 10Vgutierrez: [C:03+2] hiera,restbase: Enable IPIP on restbase-(backend|https)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124062 (https://phabricator.wikimedia.org/T387299) (owner: 10Vgutierrez) [12:11:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73945 and previous config saved to /var/cache/conftool/dbconfig/20250303-121104-root.json [12:11:28] (03PS1) 10Jcrespo: Revert "dbbackups: Setup temporary archival job to archive RT database" [puppet] - 10https://gerrit.wikimedia.org/r/1124082 [12:13:50] (03CR) 10Marostegui: clone.py, clone_test.py: Automate cloning (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [12:13:54] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:16:21] (03CR) 10Clément Goubert: php: Allow provisioning MediaWiki with PHP 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [12:16:23] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [12:16:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1237 with weight 0 T387557', diff saved to https://phabricator.wikimedia.org/P73946 and previous config saved to /var/cache/conftool/dbconfig/20250303-121623-root.json [12:16:27] T387557: Switchover x1 master (db1220 -> db1237) - https://phabricator.wikimedia.org/T387557 [12:16:28] (03CR) 10Jcrespo: [C:03+2] Revert "dbbackups: Setup temporary archival job to archive RT database" [puppet] - 10https://gerrit.wikimedia.org/r/1124082 (owner: 10Jcrespo) [12:16:40] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 13 hosts with reason: Primary switchover x1 T387557 [12:17:22] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1237 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1123615 (https://phabricator.wikimedia.org/T387557) (owner: 10Gerrit maintenance bot) [12:17:30] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [12:17:30] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: restbase::production@eqiad [12:17:38] (03PS17) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [12:18:54] FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1054-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:19:09] FIRING: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1054-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:19:40] stevemunene, gehel ^^ [12:22:20] vgutierrez: re CirrusSearchNodeIndexingNotIncreasing: working on a fix [12:22:47] !log Starting x1 eqiad failover from db1220 to db1237 - T387557 [12:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:49] T387557: Switchover x1 master (db1220 -> db1237) - https://phabricator.wikimedia.org/T387557 [12:23:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1237 to x1 primary T387557', diff saved to https://phabricator.wikimedia.org/P73947 and previous config saved to /var/cache/conftool/dbconfig/20250303-122304-root.json [12:23:54] FIRING: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1054-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:23:59] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [12:24:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1220 T387557', diff saved to https://phabricator.wikimedia.org/P73948 and previous config saved to /var/cache/conftool/dbconfig/20250303-122437-marostegui.json [12:25:52] (03PS4) 10Clément Goubert: Revert^2 "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto) [12:26:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73949 and previous config saved to /var/cache/conftool/dbconfig/20250303-122609-root.json [12:28:54] FIRING: [11x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1054-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:29:30] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1220.eqiad.wmnet [12:33:54] FIRING: [13x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1054-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:33:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1220.eqiad.wmnet [12:36:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73950 and previous config saved to /var/cache/conftool/dbconfig/20250303-123651-root.json [12:38:54] FIRING: [12x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1065-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:40:15] (03CR) 10Kamila Součková: [C:04-1] "I need to fix consumer_group" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková) [12:40:31] PROBLEM - Disk space on Hadoop worker on an-worker1098 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 26 GB (1% inode=99%): /var/lib/hadoop/data/d 22 GB (1% inode=99%): /var/lib/hadoop/data/e 26 GB (1% inode=99%): /var/lib/hadoop/data/f 16 GB (0% inode=99%): /var/lib/hadoop/data/g 26 GB (1% inode=99%): /var/lib/hadoop/data/h 21 GB (1% inode=99%): /var/lib/hadoop/data/i 22 GB (1% inode=99%): /var/lib/hadoop/data/j 26 GB (1 [12:40:31] 99%): /var/lib/hadoop/data/l 25 GB (1% inode=99%): /var/lib/hadoop/data/k 27 GB (1% inode=99%): /var/lib/hadoop/data/m 26 GB (1% inode=99%): /var/lib/hadoop/data/n 26 GB (1% inode=99%): /var/lib/hadoop/data/o 26 GB (1% inode=99%): /var/lib/hadoop/data/p 26 GB (1% inode=99%): /var/lib/hadoop/data/r 28 GB (1% inode=99%): /var/lib/hadoop/data/q 23 GB (1% inode=99%): /var/lib/hadoop/data/s 26 GB (1% inode=99%): /var/lib/hadoop/data/t 26 GB (1 [12:40:31] 99%): /var/lib/hadoop/data/u 26 GB (1% inode=99%): /var/lib/hadoop/data/v 25 GB (1% inode=99%): /var/lib/hadoop/data/w 26 GB (1% inode=99%): /var/lib/hadoop/data/x 23 GB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [12:45:01] (03PS2) 10Bartosz Dziewoński: Remove $wmgUseGraphWithJsonNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122709 (https://phabricator.wikimedia.org/T124748) [12:45:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123776 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza) [12:46:53] (03CR) 10Clément Goubert: [C:03+2] mediawiki: CronJob name as Job label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122563 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [12:48:54] FIRING: [11x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1065-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:50:54] (03PS1) 10Muehlenhoff: Add migr to analytics-privatedata-users (plus Kerberos) [puppet] - 10https://gerrit.wikimedia.org/r/1124091 (https://phabricator.wikimedia.org/T387114) [12:51:51] (03Merged) 10jenkins-bot: mediawiki: CronJob name as Job label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122563 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [12:51:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73951 and previous config saved to /var/cache/conftool/dbconfig/20250303-125156-root.json [12:52:10] thanks dcausse ! [12:53:41] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:53:54] FIRING: [9x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1072-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [12:55:22] (03CR) 10Muehlenhoff: [C:03+2] Add migr to analytics-privatedata-users (plus Kerberos) [puppet] - 10https://gerrit.wikimedia.org/r/1124091 (https://phabricator.wikimedia.org/T387114) (owner: 10Muehlenhoff) [12:55:47] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:55:56] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:56:03] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:58:38] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:58:44] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:58:47] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Analytics-Cluster + Kerberos for @Michael - https://phabricator.wikimedia.org/T387114#10595785 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Sorry for the delay! I've just merged a patch to enable your access (it... [12:59:09] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1072-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:00:14] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Analytics-Cluster + Kerberos for @Michael - https://phabricator.wikimedia.org/T387114#10595788 (10Michael) >>! In T387114#10595786, @MoritzMuehlenhoff wrote: > Sorry for the delay! I've just merged a patch to enable your access (it ta... [13:01:12] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:01:13] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:02:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73952 and previous config saved to /var/cache/conftool/dbconfig/20250303-130247-root.json [13:03:54] RESOLVED: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1072-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:04:18] 10ops-eqiad, 06SRE, 10Ceph, 06cloud-services-team, and 2 others: evaluate new drives in cloudcephosd102[123] - https://phabricator.wikimedia.org/T386725#10595793 (10dcaro) I did a first check of the current values for the smartcl reported counters, all look good so far (no more Offline_Uncorrectable_Errors... [13:06:07] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [13:06:31] RECOVERY - Disk space on Hadoop worker on an-worker1098 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:07:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73953 and previous config saved to /var/cache/conftool/dbconfig/20250303-130702-root.json [13:07:21] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:07:39] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:10:25] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:10:37] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:10:54] (03PS5) 10Kamila Součková: benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) [13:11:54] (03CR) 10Kamila Součková: benthos: add input/output config to chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková) [13:12:42] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:12:49] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:13:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73954 and previous config saved to /var/cache/conftool/dbconfig/20250303-131329-root.json [13:14:19] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:14:40] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:15:16] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:15:45] RESOLVED: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [13:16:30] (03PS4) 10Clément Goubert: mediawiki: Change mwcron default concurrency policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116800 [13:17:30] !log undid arbcom_ruwiki block of CirrusSearch_Streaming_Updater via blockUser.php [13:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73955 and previous config saved to /var/cache/conftool/dbconfig/20250303-131752-root.json [13:18:52] (03Abandoned) 10Samtar: IS: Enable wgUseCodexSpecialBlock on prod test.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114324 (https://phabricator.wikimedia.org/T377121) (owner: 10Samtar) [13:19:15] (03PS1) 10JMeybohm: Build with CGO disabled, remove libc dependency [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124096 (https://phabricator.wikimedia.org/T341984) [13:22:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73956 and previous config saved to /var/cache/conftool/dbconfig/20250303-132207-root.json [13:22:22] (03CR) 10Jelto: [C:03+1] "lgtm" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1124097 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:22:29] (03CR) 10Jelto: [C:03+1] "lgtm" [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124096 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:22:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:24:22] !log failover Ganeti master in eqiad to ganeti1048 T382507 [13:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:25] T382507: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507 [13:24:31] (03CR) 10JMeybohm: [V:03+2 C:03+2] Build with CGO disabled, remove libc dependency [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1124097 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:24:50] (03CR) 10JMeybohm: [V:03+2 C:03+2] Build with CGO disabled, remove libc dependency [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124096 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:25:42] (03PS7) 10Clément Goubert: mediawiki: Change mwcron default concurrency policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116800 [13:26:44] (03PS1) 10Michael Große: feat(Surfacing): Add Change Tag for surfaced Add a Link [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124098 (https://phabricator.wikimedia.org/T387160) [13:27:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124098 (https://phabricator.wikimedia.org/T387160) (owner: 10Michael Große) [13:27:07] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:27:07] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:27:13] PROBLEM - ganeti-wconfd running on ganeti1028 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:27:57] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:27:57] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:28:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73957 and previous config saved to /var/cache/conftool/dbconfig/20250303-132834-root.json [13:28:44] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Change mwcron default concurrency policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116800 (owner: 10Clément Goubert) [13:31:18] (03Merged) 10jenkins-bot: mediawiki: Change mwcron default concurrency policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116800 (owner: 10Clément Goubert) [13:32:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73958 and previous config saved to /var/cache/conftool/dbconfig/20250303-133258-root.json [13:35:36] !log cgoubert@deploy2002 Started scap sync-world: Deploying 1116800 1122563 [13:37:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73959 and previous config saved to /var/cache/conftool/dbconfig/20250303-133713-root.json [13:37:19] 06SRE, 06Infrastructure-Foundations: Provide a pxe-bootable rescue image - https://phabricator.wikimedia.org/T78135#10595950 (10fgiunchedi) >>! In T78135#10591609, @jhathaway wrote: > I found this task while pondering similar functionality, as I have been using SystemRescue to troubleshoot some issues on our S... [13:37:23] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:37:40] !log cgoubert@deploy2002 Finished scap sync-world: Deploying 1116800 1122563 (duration: 02m 15s) [13:39:35] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:48] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10595957 (10Jhancock.wm) the controller card was replaced. the two backplanes were not. correct. I figured it was more likely the controller card since it was system wide and not... [13:43:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73960 and previous config saved to /var/cache/conftool/dbconfig/20250303-134340-root.json [13:44:00] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387597#10595959 (10Jhancock.wm) →14Duplicate dup:03T387431 [13:44:02] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10595961 (10Jhancock.wm) [13:44:24] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387612#10595963 (10Jhancock.wm) →14Duplicate dup:03T387431 [13:44:25] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10595965 (10Jhancock.wm) [13:45:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [13:47:51] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387597#10595969 (10fgiunchedi) @Dzahn for the record, I get why you are renaming tasks with the hostname, though that will create more work since duplicate tasks will be opened again. The related work to fix t... [13:48:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73961 and previous config saved to /var/cache/conftool/dbconfig/20250303-134804-root.json [13:50:23] (03PS1) 10Fabfur: puppet: split puppet timer for calendar and startup run options [puppet] - 10https://gerrit.wikimedia.org/r/1124102 (https://phabricator.wikimedia.org/T383976) [13:50:23] 06SRE, 10Observability-Logging, 10Wikimedia-Apache-configuration: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601#10595978 (10fgiunchedi) 05Open→03Declined >>! In T188601#10446174, @andrea.denisse wrote: > Is this related to T187434 ? It is not, the work is t... [13:50:45] FIRING: [3x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [13:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:52:04] (03CR) 10CI reject: [V:04-1] puppet: split puppet timer for calendar and startup run options [puppet] - 10https://gerrit.wikimedia.org/r/1124102 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur) [13:52:31] (03PS1) 10JMeybohm: Depend on helm or helm3 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124104 (https://phabricator.wikimedia.org/T341984) [13:53:45] (03CR) 10JMeybohm: [V:03+2 C:03+2] Depend on helm or helm3 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124104 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [13:54:52] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2166.codfw.wmnet onto db2167.codfw.wmnet [13:55:26] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [13:56:01] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [13:56:22] 10ops-eqiad, 06SRE, 10Ceph, 06cloud-services-team, and 2 others: evaluate new drives in cloudcephosd102[123] - https://phabricator.wikimedia.org/T386725#10596009 (10dcaro) Just checked the number of operations/s (as a proxy for performance): * For cloudcephosd1021, comparing with 1018, there's a bit of an... [13:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:56:47] 10ops-eqiad, 06SRE, 10Ceph, 06cloud-services-team, and 2 others: evaluate new drives in cloudcephosd102[123] - https://phabricator.wikimedia.org/T386725#10596011 (10dcaro) 05Open→03Resolved [13:57:21] PROBLEM - MegaRAID on an-worker1068 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:57:22] ACKNOWLEDGEMENT - MegaRAID on an-worker1068 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387732 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:57:26] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1068 - https://phabricator.wikimedia.org/T387732 (10ops-monitoring-bot) 03NEW [13:58:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73962 and previous config saved to /var/cache/conftool/dbconfig/20250303-135845-root.json [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T1400). [14:00:05] SD_hehua, MatmaRex, ihurbain, Lucas_WMDE, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:57] o/ [14:01:16] o/ [14:01:19] I’m here but wouldn’t mind someone else doing most of the deployments tbh ^^ [14:01:51] hello [14:02:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73963 and previous config saved to /var/cache/conftool/dbconfig/20250303-140249-root.json [14:03:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73964 and previous config saved to /var/cache/conftool/dbconfig/20250303-140309-root.json [14:04:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73965 and previous config saved to /var/cache/conftool/dbconfig/20250303-140414-root.json [14:06:04] ok let’s start with SD_hehua’s change then [14:06:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121622 (https://phabricator.wikimedia.org/T387055) (owner: 10SD hehua) [14:06:18] hi. sorry i'm late [14:06:20] ok [14:06:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2162 db1203', diff saved to https://phabricator.wikimedia.org/P73966 and previous config saved to /var/cache/conftool/dbconfig/20250303-140638-marostegui.json [14:06:51] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1203.eqiad.wmnet [14:06:59] (03Merged) 10jenkins-bot: Set Transwiki namespace on zhwikivoyage and zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121622 (https://phabricator.wikimedia.org/T387055) (owner: 10SD hehua) [14:07:03] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2162.codfw.wmnet [14:07:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 855.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:07:16] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1121622|Set Transwiki namespace on zhwikivoyage and zhwikiversity (T387055)]] [14:07:19] T387055: Set Transwiki namespace on zhwikivoyage and zhwikiversity - https://phabricator.wikimedia.org/T387055 [14:07:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Rebuilding indexes [14:08:00] (03CR) 10Filippo Giunchedi: [C:03+1] "For the benthos part, can't meaningfully comment on the k8s (yet!)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková) [14:08:59] Lucas_WMDE: if you finish this one and then maybe yours (so that it's done and you can go do something else :P ) i can take over the rest if you want me to [14:09:57] sounds good to me, thanks! [14:10:00] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10596056 (10MoritzMuehlenhoff) Sure thing, just give me a brief headsup on IRC whenever it works for you and I'll depool the server. [14:10:09] * ihurbain logs into STUFF then [14:12:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 855.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:12:26] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, sdhehua: Backport for [[gerrit:1121622|Set Transwiki namespace on zhwikivoyage and zhwikiversity (T387055)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:12:29] T387055: Set Transwiki namespace on zhwikivoyage and zhwikiversity - https://phabricator.wikimedia.org/T387055 [14:12:35] SD_hehua: please test [14:13:03] ok [14:13:27] no problem found [14:13:37] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, sdhehua: Continuing with sync [14:13:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1203.eqiad.wmnet [14:13:40] cool, thanks! [14:13:43] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387733 (10phaultfinder) 03NEW [14:13:46] PROBLEM - MariaDB Replica Lag: s4 #page on db1248 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5908.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:13:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73968 and previous config saved to /var/cache/conftool/dbconfig/20250303-141350-root.json [14:13:58] federico3: db1248 is yours? [14:13:59] !incidents [14:13:59] 5709 (UNACKED) db1248 (paged)/MariaDB Replica Lag: s4 (paged) [14:13:59] 5708 (RESOLVED) db2166 (paged)/MariaDB Replica Lag: s8 (paged) [14:13:59] 5707 (RESOLVED) db1246 (paged)/MariaDB Replica IO: s2 (paged) [14:13:59] 5706 (RESOLVED) Host db1246 (paged) - PING - Packet loss = 100% [14:14:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2162.codfw.wmnet [14:14:06] !ack 5709 [14:14:07] 5709 (ACKED) db1248 (paged)/MariaDB Replica Lag: s4 (paged) [14:14:07] !ack 5709 [14:14:07] 5709 (ACKED) db1248 (paged)/MariaDB Replica Lag: s4 (paged) [14:14:18] Split second finish [14:14:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1203.eqiad.wmnet with reason: Index rebuild [14:14:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2162.codfw.wmnet with reason: Index rebuild [14:15:04] tnx jelto [14:15:04] marostegui: yes, it's the source of cloning, should recover [14:15:22] it's the same glitch from before (fixed in the CR) [14:15:25] federico3: Why did it send a p4ge? [14:15:27] Ah ok [14:15:44] great then I'll wait, I can confirm replica lag is going down already [14:15:49] for db1248 [14:15:55] it seems it does not recover replication lag quickly enough when being added back [14:17:22] (03PS1) 10Muehlenhoff: Add mshilova to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124107 (https://phabricator.wikimedia.org/T386754) [14:17:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73969 and previous config saved to /var/cache/conftool/dbconfig/20250303-141754-root.json [14:18:03] (03CR) 10Kgraessle: [C:03+1] Add MP event stream for MassDelete workflows [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123435 (https://phabricator.wikimedia.org/T382147) (owner: 10Jsn.sherman) [14:19:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73970 and previous config saved to /var/cache/conftool/dbconfig/20250303-141919-root.json [14:20:45] RESOLVED: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [14:21:19] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121622|Set Transwiki namespace on zhwikivoyage and zhwikiversity (T387055)]] (duration: 14m 02s) [14:21:22] T387055: Set Transwiki namespace on zhwikivoyage and zhwikiversity - https://phabricator.wikimedia.org/T387055 [14:21:35] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1066* for ban elastic1066 to hopefully stop rejections - bking@cumin2002 - T387176 [14:21:37] T387176: Investigate eqiad Elastic cluster latency - https://phabricator.wikimedia.org/T387176 [14:21:38] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1066* for ban elastic1066 to hopefully stop rejections - bking@cumin2002 - T387176 [14:21:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118486 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [14:22:27] (03Merged) 10jenkins-bot: Enable fixed Wikibase RDF everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118486 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE)) [14:22:42] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1118486|Enable fixed Wikibase RDF everywhere (T384344)]] [14:22:45] T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344 [14:23:46] (03PS2) 10Bartosz Dziewoński: Remove unused config variable $wgJsonConfigInterwikiPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122683 [14:23:58] (03PS2) 10Bartosz Dziewoński: Fix inconsistent definitions for $wmgLocalServices['chart-renderer'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123766 [14:24:09] (03PS5) 10Bartosz Dziewoński: Deduplicate JsonConfig config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 [14:25:45] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [14:25:50] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [14:27:11] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1118486|Enable fixed Wikibase RDF everywhere (T384344)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:27:17] testing… [14:27:44] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [14:27:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:28:10] (03PS1) 10Filippo Giunchedi: pontoon: fix git origin at bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1124109 [14:28:55] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:28:57] lgtm [14:29:45] (03PS1) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [14:30:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10596196 (10phaultfinder) [14:31:11] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] Change license for Russian Wikinews to CC-BY-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123495 (https://phabricator.wikimedia.org/T387279) (owner: 10Bartosz Dziewoński) [14:31:48] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:33:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73971 and previous config saved to /var/cache/conftool/dbconfig/20250303-143259-root.json [14:34:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73972 and previous config saved to /var/cache/conftool/dbconfig/20250303-143425-root.json [14:35:31] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118486|Enable fixed Wikibase RDF everywhere (T384344)]] (duration: 12m 49s) [14:35:41] * Lucas_WMDE done deploying [14:35:45] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [14:35:49] over to you ihurbain :) [14:35:50] MatmaRex: do you think we can do a single deploy for some of our patches? [14:35:50] CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [14:35:55] Lucas_WMDE: ack :) [14:35:57] thank you :) [14:36:14] ihurbain: certainly, even all of them if you wanted [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:53] MatmaRex: like "all of them" -> the 6 of them or the 3 of them? :P [14:37:00] (03PS1) 10Federico Ceratto: db2167: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124112 (https://phabricator.wikimedia.org/T387660) [14:37:21] (i haven't looked at the last three tbh) [14:37:26] ihurbain: heh, maybe just the three, i'm not that much of a maverick [14:37:31] :D [14:37:36] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic1066.eqiad.wmnet [14:37:41] it would probably be fine though ;) [14:37:49] okay, let's try that. (the three) [14:38:01] (03PS1) 10Vgutierrez: hiera,druid: Enable IPIP on druid-public-broker@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307) [14:38:04] ihurbain: also, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1122709 will need a cleanupTitle.php maintenance run on mediawikiwiki. should take just a few seconds. [14:38:12] cleanupTitles.php * [14:38:31] *ah*. [14:38:36] oh, I also meant to check cleanupTitles (or whatever the right script was) for the new namespaces on the zh wikis but forgot [14:38:37] that's a new thing and i had missed that. [14:38:55] namespaceDupes is the one I meant I think [14:38:56] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307) (owner: 10Vgutierrez) [14:39:20] mmmph. [14:39:24] (03CR) 10Elukey: "test-cookbook for puppetserver2004:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [14:39:33] ok nothing to do there [14:40:05] MatmaRex: if that works for you i'll do the parsoid one and the ruwikinews one first, and then the other one for the jsonnamespace one, because i don't want to rush that [14:40:12] (https://phabricator.wikimedia.org/T387055#10596231) [14:40:40] (03PS2) 10Federico Ceratto: db2166: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124112 (https://phabricator.wikimedia.org/T387660) [14:40:40] s/parsoid/fragments/ [14:40:54] ihurbain: sure [14:40:59] okay, let's do that then. [14:41:09] (03PS3) 10Federico Ceratto: db2146: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124112 (https://phabricator.wikimedia.org/T387660) [14:41:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123495 (https://phabricator.wikimedia.org/T387279) (owner: 10Bartosz Dziewoński) [14:41:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123815 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry) [14:42:37] (03Merged) 10jenkins-bot: Change license for Russian Wikinews to CC-BY-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123495 (https://phabricator.wikimedia.org/T387279) (owner: 10Bartosz Dziewoński) [14:42:40] (03Merged) 10jenkins-bot: Revert "Turn on Parsoid fragment support everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123815 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry) [14:42:44] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [14:42:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [14:42:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:42:56] !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1123495|Change license for Russian Wikinews to CC-BY-4.0 (T387279)]], [[gerrit:1123815|Revert "Turn on Parsoid fragment support everywhere" (T387608)]] [14:43:00] T387279: Change of default license for Russian Wikinews to CC-BY-4.0 - https://phabricator.wikimedia.org/T387279 [14:43:01] T387608: Refs inside {{efn}} are now outputting strip markers in Parsoid - https://phabricator.wikimedia.org/T387608 [14:44:05] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm [14:44:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10596271 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host puppetserver2004.codfw.wmnet with OS bookworm [14:44:29] (03PS2) 10Vgutierrez: hiera,druid: Enable IPIP on druid-public-broker@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307) [14:44:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic1066.eqiad.wmnet [14:45:07] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:45:37] (03Abandoned) 10Federico Ceratto: db2146: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124112 (https://phabricator.wikimedia.org/T387660) (owner: 10Federico Ceratto) [14:45:39] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307) (owner: 10Vgutierrez) [14:45:57] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:46:42] !log ihurbain@deploy2002 matmarex, ssastry, ihurbain: Backport for [[gerrit:1123495|Change license for Russian Wikinews to CC-BY-4.0 (T387279)]], [[gerrit:1123815|Revert "Turn on Parsoid fragment support everywhere" (T387608)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:46:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [14:46:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:46:54] MatmaRex: we can check stuff [14:47:52] ihurbain: ruwikinews looks good (https://ru.wikinews.org/w/api.php?action=query&meta=siteinfo&siprop=rightsinfo) [14:47:59] and mine looks good too [14:48:00] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:48:04] !log ihurbain@deploy2002 matmarex, ssastry, ihurbain: Continuing with sync [14:48:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73973 and previous config saved to /var/cache/conftool/dbconfig/20250303-144805-root.json [14:48:15] wheeee! [14:49:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73974 and previous config saved to /var/cache/conftool/dbconfig/20250303-144930-root.json [14:51:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:51:46] RECOVERY - MariaDB Replica Lag: s4 #page on db1248 is OK: OK slave_sql_lag Replication lag: 50.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:52:22] (03CR) 10Ottomata: [C:03+1] [Growth] Add mediawiki.product_metrics.growth_product_interaction stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123607 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [14:52:49] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [14:52:54] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [14:53:08] (03PS1) 10Federico Ceratto: db1252.yaml, db2167.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124114 [14:53:39] (03CR) 10Marostegui: [C:03+1] db1252.yaml, db2167.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124114 (owner: 10Federico Ceratto) [14:54:22] (03CR) 10Federico Ceratto: [C:03+2] db1252.yaml, db2167.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124114 (owner: 10Federico Ceratto) [14:54:36] !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123495|Change license for Russian Wikinews to CC-BY-4.0 (T387279)]], [[gerrit:1123815|Revert "Turn on Parsoid fragment support everywhere" (T387608)]] (duration: 11m 39s) [14:54:41] T387279: Change of default license for Russian Wikinews to CC-BY-4.0 - https://phabricator.wikimedia.org/T387279 [14:54:41] T387608: Refs inside {{efn}} are now outputting strip markers in Parsoid - https://phabricator.wikimedia.org/T387608 [14:54:49] all right [14:55:13] MatmaRex: i'm running the other one; while re-reading doc about running maintenance scripts :P [14:55:45] thanks [14:56:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122709 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński) [14:56:32] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage [14:56:53] (03Merged) 10jenkins-bot: Remove $wmgUseGraphWithJsonNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122709 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński) [14:56:53] apparently it's more complicated post-kubernetes https://wikitech.wikimedia.org/wiki/Maintenance_scripts [14:57:12] !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1122709|Remove $wmgUseGraphWithJsonNamespace (T124748)]] [14:57:14] T124748: Deprecate Graph and Data namespaces on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748 [14:57:36] (03CR) 10Cathal Mooney: [C:03+1] "I'm not really persuaded this would ever be an issue, or it's worth the pain of reworking dashboards now. But no real objection I guess, " [puppet] - 10https://gerrit.wikimedia.org/r/1122955 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi) [14:58:04] but i think you'll need something like… this? mwscript-k8s -f --comment="T124748" -- cleanupTitles --wiki=mediawikiwiki [14:58:20] yeah i'm slowly reaching that conclusion :D [14:59:50] (aaaaa!) (it's fine. PROBABLY.) [14:59:51] !log ihurbain@deploy2002 matmarex, ihurbain: Backport for [[gerrit:1122709|Remove $wmgUseGraphWithJsonNamespace (T124748)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:59:58] MatmaRex: test servers on [15:00:51] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed [15:00:56] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed [15:01:58] ihurbain: seems good [15:02:37] !log ihurbain@deploy2002 matmarex, ihurbain: Continuing with sync [15:03:00] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage [15:03:07] vroom. [15:03:27] note: backport window running a bit over, probably 10 minutes or so [15:06:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1066-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:22] ihurbain: thank you! i'd like to do a non urgent deploy when it is finshed, please let me know [15:08:00] ack - sorry for the delay! [15:08:50] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to Dashboards in Superset / Hive interfaces (like Hue) that do access private data for Mariya Shilova - https://phabricator.wikimedia.org/T386754#10596392 (10Ahoelzl) Approved. [15:09:07] !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122709|Remove $wmgUseGraphWithJsonNamespace (T124748)]] (duration: 11m 55s) [15:09:10] T124748: Deprecate Graph and Data namespaces on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748 [15:09:15] okay. now for the maintenance script. [15:09:15] (no worries at all! take your time) [15:09:42] (doing a dry-run first because eh.) [15:11:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Pooling in after cloning to db1252 T385141', diff saved to https://phabricator.wikimedia.org/P73976 and previous config saved to /var/cache/conftool/dbconfig/20250303-151103-fceratto.json [15:11:07] T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141 [15:11:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73977 and previous config saved to /var/cache/conftool/dbconfig/20250303-151107-root.json [15:11:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73978 and previous config saved to /var/cache/conftool/dbconfig/20250303-151113-root.json [15:11:22] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed [15:11:33] MatmaRex: maintenance script is run, it says "Finished page... 2 of 1804456 rows updated" [15:11:51] ihurbain: thanks [15:11:52] hmm, 2? [15:12:10] https://phabricator.wikimedia.org/T124748#10588324 i think that's expected? [15:12:49] mmh [15:12:53] oh, it had a talk page [15:12:59] yes, all good [15:13:08] i expected only this one: https://www.mediawiki.org/wiki/Broken/NS486:Json:Wikicon i didn't realize it had a talk too [15:13:09] aha. [15:13:13] amazing. [15:13:26] which is now https://www.mediawiki.org/wiki/Broken/NS487:Json:Wikicon [15:13:33] thanks for deploying! [15:13:43] then: deployment window is over, you still have your three "if we have time" to move to another one, and we're done - ottomata you're free to go! [15:13:57] MatmaRex: thank you for having a patch that taught me something :D [15:14:09] (03PS3) 10Vgutierrez: hiera,druid: Enable IPIP on druid-public-broker@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307) [15:16:30] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307) (owner: 10Vgutierrez) [15:16:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1066-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:18:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver2004.codfw.wmnet with OS bookworm [15:18:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10596416 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host puppetserver2004.codfw.wmnet with OS bookworm completed: -... [15:20:56] should https://wikitech.wikimedia.org/wiki/Maintenance_scripts be linked from https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers somewhere and/or " (At some point, this will be obsoleted by mwscript-k8s, currently under development at T341553.)" be updated on that page? (I'm happy to do this, I just don't want to be "too early" - my understanding is "that This (mwscript-k8s) Is Now The Way", but double [15:20:56] checking before I touch stuff :D (yeah yeah i know the wiki way) [15:20:57] T341553: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553 [15:21:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [15:21:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:21:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:23:26] (bah. i'll update doc, if someone disagrees they can fix it.) [15:23:45] ihurbain: yeah it should probably be updated to use mwscript-k8s by default, at least we would catch more potential bugs and issues [15:23:55] (03PS16) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [15:23:55] (03CR) 10Tiziano Fogli: "It will work for pro4x PDUs out of the box. However, some modifications to the netbox-hiera outputs will be needed to include the PDU mode" [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [15:24:03] (03PS7) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) [15:24:13] oops i forgot to log the end of the window [15:24:21] !log UTC afternoon deploys done [15:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:24] (hopla.) [15:28:37] https://wikitech.wikimedia.org/w/index.php?title=Backport_windows%2FDeployers&diff=2279007&oldid=2238500 hop. [15:32:37] (03PS17) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [15:33:28] (03CR) 10Eevans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1124109 (owner: 10Filippo Giunchedi) [15:35:16] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2166 gradually with 4 steps - Cloned db2166 to db2167 [15:35:31] ihurbain: thank you! [15:36:16] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [15:36:25] (03PS18) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb) [15:36:34] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [15:36:38] !log deploying eventgate-logging-external to bump to node20 - T383814 [15:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:41] T383814: Upgrade eventgate-wikimedia to node20 - https://phabricator.wikimedia.org/T383814 [15:36:44] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:37:23] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [15:37:44] (03PS4) 10JMeybohm: Respect kubeVersion constraints in charts and admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123006 (https://phabricator.wikimedia.org/T387376) [15:37:50] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix git origin at bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1124109 (owner: 10Filippo Giunchedi) [15:37:54] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123006 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm) [15:38:36] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [15:40:29] (03PS1) 10Filippo Giunchedi: pontoon: improve pontoon-wait-puppet [puppet] - 10https://gerrit.wikimedia.org/r/1124122 [15:40:44] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2167 gradually with 4 steps - Cloned db2166 to db2167 [15:41:44] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed [15:41:46] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [15:42:12] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [15:43:20] (03CR) 10Muehlenhoff: [C:03+2] Add mshilova to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124107 (https://phabricator.wikimedia.org/T386754) (owner: 10Muehlenhoff) [15:43:24] (03CR) 10Scott French: "Thank you both for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123690 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [15:43:25] (03CR) 10Scott French: [C:03+2] shellbox-media: serve 1/8 of requests on 8.1 with more logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123690 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [15:44:30] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: improve pontoon-wait-puppet [puppet] - 10https://gerrit.wikimedia.org/r/1124122 (owner: 10Filippo Giunchedi) [15:45:04] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to Dashboards in Superset / Hive interfaces (like Hue) that do access private data for Mariya Shilova - https://phabricator.wikimedia.org/T386754#10596562 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzM... [15:45:08] (03Merged) 10jenkins-bot: shellbox-media: serve 1/8 of requests on 8.1 with more logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123690 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [15:47:08] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10596605 (10MoritzMuehlenhoff) 05Open→03Stalled [15:47:30] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [15:47:58] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [15:48:26] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387748 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [15:48:34] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387748 (10ops-monitoring-bot) 03NEW [15:50:28] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [15:50:33] (03PS1) 10JMeybohm: Set DH_GOLANG_BUILDPKG in debian/rules [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124136 [15:50:38] (03PS1) 10Vgutierrez: hiera,eventschemas: Enable IPIP on schema@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124137 (https://phabricator.wikimedia.org/T387308) [15:50:39] (03PS1) 10Vgutierrez: hiera,eventschemas: Enable IPIP on schema@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124138 (https://phabricator.wikimedia.org/T387308) [15:50:43] (03PS1) 10Filippo Giunchedi: pontoon: add pontoonctl wait-puppet command [puppet] - 10https://gerrit.wikimedia.org/r/1124123 [15:50:43] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [15:50:56] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add pontoonctl wait-puppet command [puppet] - 10https://gerrit.wikimedia.org/r/1124123 (owner: 10Filippo Giunchedi) [15:51:24] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124138 (https://phabricator.wikimedia.org/T387308) (owner: 10Vgutierrez) [15:51:29] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124137 (https://phabricator.wikimedia.org/T387308) (owner: 10Vgutierrez) [15:51:29] !log started shellbox-media PHP 8.1 pilot with increased logging - T377038 [15:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:32] T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038 [15:51:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:52:56] (03PS1) 10Filippo Giunchedi: pontoon: misc ctl improvements [puppet] - 10https://gerrit.wikimedia.org/r/1124124 [15:53:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2155,2187].codfw.wmnet with reason: Rebuilding indexes [15:54:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1247 db2155', diff saved to https://phabricator.wikimedia.org/P73985 and previous config saved to /var/cache/conftool/dbconfig/20250303-155447-marostegui.json [15:54:58] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1247.eqiad.wmnet [15:55:06] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2155.codfw.wmnet [15:55:53] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: misc ctl improvements [puppet] - 10https://gerrit.wikimedia.org/r/1124124 (owner: 10Filippo Giunchedi) [15:56:07] (03PS2) 10Filippo Giunchedi: pontoon: misc ctl improvements [puppet] - 10https://gerrit.wikimedia.org/r/1124124 [15:56:32] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: misc ctl improvements [puppet] - 10https://gerrit.wikimedia.org/r/1124124 (owner: 10Filippo Giunchedi) [15:58:04] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1252 gradually with 4 steps - Cloned db124 to db1252 [15:58:06] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1252 gradually with 4 steps - Cloned db124 to db1252 [15:58:42] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1252 gradually with 4 steps - Cloned db124 to db1252 [15:58:44] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1252 gradually with 4 steps - Cloned db124 to db1252 [15:59:35] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset for harroyo-wmf - https://phabricator.wikimedia.org/T386922#10596705 (10hector.arroyo) Thanks! [16:00:05] (03CR) 10CI reject: [V:04-1] Set DH_GOLANG_BUILDPKG in debian/rules [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124136 (owner: 10JMeybohm) [16:00:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1247.eqiad.wmnet [16:01:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2155.codfw.wmnet [16:01:20] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1247.eqiad.wmnet with reason: Index rebuild [16:01:37] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Index rebuild [16:03:58] (03PS1) 10Scott French: Revert "shellbox-media: serve 1/8 of requests on 8.1 with more logging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124140 (https://phabricator.wikimedia.org/T377038) [16:06:33] (03CR) 10Scott French: [C:03+2] Revert "shellbox-media: serve 1/8 of requests on 8.1 with more logging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124140 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [16:08:11] (03Merged) 10jenkins-bot: Revert "shellbox-media: serve 1/8 of requests on 8.1 with more logging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124140 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [16:10:00] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [16:10:06] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [16:10:19] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [16:10:23] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [16:10:49] !log finished shellbox-media PHP 8.1 pilot - T377038 [16:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:51] T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038 [16:13:34] (03PS18) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [16:13:54] (03PS2) 10Filippo Giunchedi: pontoon: add config set/get to base, rework tests [puppet] - 10https://gerrit.wikimedia.org/r/1124125 [16:14:10] (03PS3) 10Filippo Giunchedi: pontoon: prompt the user for host prefix and save it to config [puppet] - 10https://gerrit.wikimedia.org/r/1124126 [16:14:26] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add config set/get to base, rework tests [puppet] - 10https://gerrit.wikimedia.org/r/1124125 (owner: 10Filippo Giunchedi) [16:17:09] (03PS3) 10Filippo Giunchedi: pontoon: refactor controller class into its own file [puppet] - 10https://gerrit.wikimedia.org/r/1124127 [16:17:11] (03PS3) 10Filippo Giunchedi: pontoon: turn relative imports into absolute [puppet] - 10https://gerrit.wikimedia.org/r/1124128 [16:17:12] (03PS3) 10Filippo Giunchedi: pontoon: improve enroll experience [puppet] - 10https://gerrit.wikimedia.org/r/1124129 [16:17:13] (03PS3) 10Filippo Giunchedi: pontoon: rework bootstrap instructions in README.md [puppet] - 10https://gerrit.wikimedia.org/r/1124130 [16:18:04] !log depool maps2009 T387431 [16:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:07] T387431: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431 [16:18:13] (03PS1) 10Tiziano Fogli: network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) [16:18:13] (03CR) 10Tiziano Fogli: "I tried the updated GraphQL query manually, but I didn't test it with test-cookbook." [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [16:18:51] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: prompt the user for host prefix and save it to config [puppet] - 10https://gerrit.wikimedia.org/r/1124126 (owner: 10Filippo Giunchedi) [16:19:23] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: refactor controller class into its own file [puppet] - 10https://gerrit.wikimedia.org/r/1124127 (owner: 10Filippo Giunchedi) [16:19:38] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: turn relative imports into absolute [puppet] - 10https://gerrit.wikimedia.org/r/1124128 (owner: 10Filippo Giunchedi) [16:19:47] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: improve enroll experience [puppet] - 10https://gerrit.wikimedia.org/r/1124129 (owner: 10Filippo Giunchedi) [16:19:58] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: rework bootstrap instructions in README.md [puppet] - 10https://gerrit.wikimedia.org/r/1124130 (owner: 10Filippo Giunchedi) [16:20:11] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [16:20:50] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2166 gradually with 4 steps - Cloned db2166 to db2167 [16:21:08] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:21:08] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:23:08] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 8.411 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:23:08] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53658 bytes in 8.603 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:25:00] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10596835 (10MatthewVernon) I still don't see that Dell can claim we're using the drives incorrectly given they sold us this setup? I think I'd tend to try swapping the backplane... [16:26:20] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2167 gradually with 4 steps - Cloned db2166 to db2167 [16:26:32] (03PS2) 10Ottomata: eventgate-logging-external - upgrade to node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122159 (https://phabricator.wikimedia.org/T383814) [16:26:32] (03PS1) 10Ottomata: eventgate-analytics-external - upgrade to node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124144 (https://phabricator.wikimedia.org/T383814) [16:28:13] (03CR) 10Ottomata: [V:03+2 C:03+2] eventgate-logging-external - upgrade to node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122159 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [16:28:15] (03PS1) 10Muehlenhoff: Record LDAP access for dhardy [puppet] - 10https://gerrit.wikimedia.org/r/1124145 (https://phabricator.wikimedia.org/T387157) [16:29:25] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmf for DHardy-WMF - https://phabricator.wikimedia.org/T387157#10596861 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Access has been granted via Wikimedia IDM. [16:29:49] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for dhardy [puppet] - 10https://gerrit.wikimedia.org/r/1124145 (https://phabricator.wikimedia.org/T387157) (owner: 10Muehlenhoff) [16:30:05] jan_drewniak: That opportune time for a Wikimedia Portals Update deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T1630). [16:31:26] jan_drewniak: I have an unrelated deployment to do, mind if I go now, or should I wait? [16:32:35] (03PS1) 10Muehlenhoff: Track LDAP access for chuckonwumelu [puppet] - 10https://gerrit.wikimedia.org/r/1124146 (https://phabricator.wikimedia.org/T387627) [16:33:18] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387748#10596880 (10Pppery) →14Duplicate dup:03T382984 [16:33:19] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10596882 (10Pppery) [16:33:20] (03CR) 10Muehlenhoff: [C:03+2] Track LDAP access for chuckonwumelu [puppet] - 10https://gerrit.wikimedia.org/r/1124146 (https://phabricator.wikimedia.org/T387627) (owner: 10Muehlenhoff) [16:34:13] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmf for chuckonwumelu - https://phabricator.wikimedia.org/T387627#10596885 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Access has been enabled via Wikimedia IDM. [16:34:23] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1248.eqiad.wmnet onto db1252.eqiad.wmnet [16:34:29] 06SRE: Ops-monitoring-bot creating duplicate tasks for the same RAID failure - https://phabricator.wikimedia.org/T387754 (10Pppery) 03NEW [16:37:23] FIRING: ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:37:32] jan_drewniak: i'm assuming it is okay! proceeding. [16:37:36] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [16:38:04] !log deploying eventgate-logging-external to ACTUALLY bump to node20 - T383814 [16:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:07] T383814: Upgrade eventgate-wikimedia to node20 - https://phabricator.wikimedia.org/T383814 [16:38:10] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10596921 (10MoritzMuehlenhoff) @HCoplin-WMF This needs approval by your manager. [16:38:14] (03CR) 10BryanDavis: php: Allow provisioning MediaWiki with PHP 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [16:38:20] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [16:41:18] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for hswan - https://phabricator.wikimedia.org/T387522#10596943 (10MoritzMuehlenhoff) @HSwan-WMF: Requests to the wmf and logstash-accress LDAP groups are handled within Wikimedia IDM: Can you please log into https://idm.wikimedia.org and request the groups by... [16:42:03] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [16:42:23] FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:42:49] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [16:43:11] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [16:46:53] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:47:06] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:48:05] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [16:49:02] (03CR) 10Elukey: "The code seems not working for UEFI, since "PXE" seems only valid in the "Legacy" domain. I am going to check other UEFI nodes to figure o" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey) [16:49:35] FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:49:49] (03PS1) 10Vgutierrez: site,hiera: Reimage lvs5006 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1124150 (https://phabricator.wikimedia.org/T384477) [16:50:37] (03PS2) 10Vgutierrez: site,hiera: Reimage lvs5006 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1124150 (https://phabricator.wikimedia.org/T384477) [16:50:54] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124150 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [16:52:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [16:52:23] RESOLVED: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:53:02] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10597002 (10MoritzMuehlenhoff) @Jhancock.wm maps2009 is ready [16:53:24] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10597003 (10elukey) I was able to run `restart` (the command is not visible in the help, but available) and the output was: ` elukey@ms... [16:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:59:21] 06SRE, 10SRE-Access-Requests: Requesting access to wmf for bwang - https://phabricator.wikimedia.org/T387614#10597013 (10bwang) [16:59:51] 06SRE, 10SRE-Access-Requests: Requesting access to wmf for bwang - https://phabricator.wikimedia.org/T387614#10597015 (10bwang) @MoritzMuehlenhoff Sorry I realized i already have wmf access, i need access to 'analytics-privatedata-users' for private data on superset [17:02:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123607 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno) [17:02:43] (03CR) 10JMeybohm: [V:03+2 C:03+2] Set DH_GOLANG_BUILDPKG in debian/rules [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124136 (owner: 10JMeybohm) [17:03:09] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10597032 (10Jclark-ctr) @fnegri @VRiley-WMF did this need to be reopened. idrac shows The system inlet temperature is greater than the... [17:03:13] (03PS1) 10JMeybohm: Use goccy/go-yaml instead of gopkg.in/yaml.v2 by default [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1124151 (https://phabricator.wikimedia.org/T341984) [17:05:22] 06SRE, 10SRE-Access-Requests: Requesting access to wmf for bwang - https://phabricator.wikimedia.org/T387614#10597057 (10Jdrewniak) > - access request (or expansion) has sign off of WMF sponsor/manager (sponsor for volunteers, manager for wmf staff) Just confirming that @bwang needs access to 'analytics-priv... [17:06:07] (03CR) 10Volans: [C:03+1] "The addition LGTM, nothing major and the model can surely be useful to simplify the related prometheus code I see in the other CRs related" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [17:08:31] (03CR) 10Vgutierrez: [C:04-2] "to be merged on 2025-03-04" [puppet] - 10https://gerrit.wikimedia.org/r/1124150 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [17:10:52] (03CR) 10Jelto: [C:03+1] "lgtm" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1124151 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:11:50] !log dancy@deploy2002 Installing scap version "4.139.0" for 204 host(s) [17:13:42] (03CR) 10Ssingh: [C:03+1] site,hiera: Reimage lvs5006 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1124150 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [17:16:43] !log dancy@deploy2002 Installation of scap version "4.139.0" completed for 204 hosts [17:17:57] (03CR) 10Krinkle: Add config needed to re-architecture mainstash away from x2 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup) [17:19:09] (03CR) 10JMeybohm: [C:03+2] Use goccy/go-yaml instead of gopkg.in/yaml.v2 by default [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1124151 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [17:21:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:28:54] (03PS1) 10Elukey: knative-serving: add a default value for config-observability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124155 (https://phabricator.wikimedia.org/T387580) [17:29:35] FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:31:24] (03CR) 10Klausman: [C:03+1] knative-serving: add a default value for config-observability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124155 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey) [17:34:10] (03CR) 10Elukey: [C:03+2] knative-serving: add a default value for config-observability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124155 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey) [17:40:35] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [17:42:07] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [17:42:23] FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:44:35] RESOLVED: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [17:48:25] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387769 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [17:48:32] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387769 (10ops-monitoring-bot) 03NEW [17:50:44] PROBLEM - Host maps2009 is DOWN: PING CRITICAL - Packet loss = 100% [17:51:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:54:14] (03PS2) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [17:54:24] maps2009 just went down? [17:54:25] FIRING: SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:37] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [17:54:53] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [17:54:56] vgutierrez: Jennifer is fixing things with the mgmt [17:55:03] thx [17:56:32] RECOVERY - Host maps2009 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [17:59:25] RESOLVED: SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:04] swfrench-wmf: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T1800). [18:00:05] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T1800). [18:00:07] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10597479 (10Jhancock.wm) rebooted the server and drained power. on power up, confirmed that mgmt and network ip were pingable. [18:00:10] !log repool maps2009 T387431 [18:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:13] T387431: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431 [18:00:33] o/ I'll get started on at least one of my two planned changes shortly [18:01:00] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10597483 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:01:22] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387733#10597499 (10Jhancock.wm) →14Duplicate dup:03T387431 [18:01:24] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10597501 (10Jhancock.wm) [18:01:38] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [18:01:45] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123700 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:01:48] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [18:02:02] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123006 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm) [18:02:16] (03PS3) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) [18:02:35] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [18:02:46] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [18:03:05] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [18:03:32] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10597510 (10Jhancock.wm) i agree. I'll get those backplanes replaced and we can try that. (honestly, been trying to figure how to do it since they're behind a lot of other parts)... [18:07:34] (03CR) 10Scott French: "Thank you both!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123699 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:07:37] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): scale next to 40% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123699 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:09:22] (03Merged) 10jenkins-bot: mw-(api-ext|web): scale next to 40% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123699 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:12:14] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:12:30] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:13:19] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:13:31] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:14:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73993 and previous config saved to /var/cache/conftool/dbconfig/20250303-181451-root.json [18:15:17] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:15:33] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:16:30] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:16:42] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:17:23] !log scaled mw-(api-ext|web) next deployments to 40% of main size - T383845 [18:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:25] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:21:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123694 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:21:58] (03Merged) 10jenkins-bot: Enroll 100% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123694 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:22:18] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1123694|Enroll 100% of client sessions in PHP 8.1 (T383845)]] [18:24:56] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1123694|Enroll 100% of client sessions in PHP 8.1 (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:24:58] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:25:35] (03CR) 10JMeybohm: [C:03+2] Respect kubeVersion constraints in charts and admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123006 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm) [18:26:28] !log swfrench@deploy2002 swfrench: Continuing with sync [18:28:11] (03PS1) 10DCausse: cirrus-streaming-updater: scale up consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124166 (https://phabricator.wikimedia.org/T386935) [18:29:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73994 and previous config saved to /var/cache/conftool/dbconfig/20250303-182956-root.json [18:30:26] jouncebot: nowandnext [18:30:26] For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T1800) [18:30:27] In 2 hour(s) and 29 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T2100) [18:33:22] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123694|Enroll 100% of client sessions in PHP 8.1 (T383845)]] (duration: 11m 03s) [18:33:25] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:34:43] (03CR) 10Ebernhardson: [C:03+1] cirrus-streaming-updater: scale up consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124166 (https://phabricator.wikimedia.org/T386935) (owner: 10DCausse) [18:37:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73995 and previous config saved to /var/cache/conftool/dbconfig/20250303-183721-root.json [18:38:25] (03PS1) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774) [18:43:09] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: scale up consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124166 (https://phabricator.wikimedia.org/T386935) (owner: 10DCausse) [18:43:36] (03Merged) 10jenkins-bot: Respect kubeVersion constraints in charts and admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123006 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm) [18:43:51] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:43:54] (03CR) 10Ssingh: geo-maps: update South America DCs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [18:43:58] (03PS3) 10JMeybohm: Update validating-admission-policies for k8s >=1.30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) [18:44:01] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:44:08] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:44:12] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [18:44:13] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:45:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73996 and previous config saved to /var/cache/conftool/dbconfig/20250303-184501-root.json [18:49:32] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for chuckonwumelu - https://phabricator.wikimedia.org/T387627#10597817 (10Aklapper) 05Resolved→03Open Reopening as @Chuckonwumelu did not get added to https://phabricator.wikimedia.org/project/members/61/ per steps on https://wikitech.wikimedia.org/wi... [18:49:32] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for DHardy-WMF - https://phabricator.wikimedia.org/T387157#10597815 (10Aklapper) 05Resolved→03Open Reopening as @Dillon did not get added to https://phabricator.wikimedia.org/project/members/61/ per steps on https://wikitech.wikimedia.org/wiki/SRE/C... [18:50:35] (03PS16) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) [18:51:01] (03CR) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [18:51:18] (03CR) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [18:51:30] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [18:52:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73997 and previous config saved to /var/cache/conftool/dbconfig/20250303-185227-root.json [18:53:43] (03CR) 10Scott French: [C:03+2] mw-api-int: serve 10% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123700 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:55:23] helm-lint jenkins job seems stuck and block gate-and-submit for deployment-chart (https://integration.wikimedia.org/ci/job/helm-lint/23283/console) :( [18:57:06] swfrench-wmf: yours ran in ~2min but it's now blocked by mine :/ [18:58:57] dcausse: ah, yeah I see this one is taking a while [18:59:21] first time I see this job being so slow... [18:59:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10597860 (10phaultfinder) [19:00:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73998 and previous config saved to /var/cache/conftool/dbconfig/20250303-190007-root.json [19:01:19] (03Merged) 10jenkins-bot: cirrus-streaming-updater: scale up consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124166 (https://phabricator.wikimedia.org/T386935) (owner: 10DCausse) [19:01:21] (03Merged) 10jenkins-bot: mw-api-int: serve 10% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123700 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [19:01:56] interesting ... well, those eventually merged :) [19:01:58] dcausse: I see another recent complaint about that in #-releng [19:02:18] dancy: oh, thanks good to know [19:02:52] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:03:04] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:03:20] FYI, going slightly over on the UTC-late infra window today - ETA 5-10m [19:05:02] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [19:05:03] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [19:05:19] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [19:05:35] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [19:05:59] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [19:06:12] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [19:07:02] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [19:07:15] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [19:07:30] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [19:07:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73999 and previous config saved to /var/cache/conftool/dbconfig/20250303-190732-root.json [19:07:42] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [19:08:23] !log serving 10% of mw-api-int traffic on PHP 8.1 - T383845 [19:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:25] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [19:08:59] alright, unless anything goes sideways in the interim, I believe I am done for now [19:15:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74000 and previous config saved to /var/cache/conftool/dbconfig/20250303-191513-root.json [19:18:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123776 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza) [19:21:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:22:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74001 and previous config saved to /var/cache/conftool/dbconfig/20250303-192237-root.json [19:23:51] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10597949 (10Jclark-ctr) @BTullis if you get a chance can you update Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with role... [19:24:14] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10597952 (10Jclark-ctr) a:05Jclark-ctr→03BTullis [19:26:02] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: reforge1005*,relforge1006*,relforge1007* for ban hosts prior to revert - bking@cumin2002 - T387176 [19:26:05] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: reforge1005*,relforge1006*,relforge1007* for ban hosts prior to revert - bking@cumin2002 - T387176 [19:26:05] T387176: Investigate eqiad Elastic cluster latency - https://phabricator.wikimedia.org/T387176 [19:26:55] (03PS1) 10Daimona Eaytoy: Enable $wgCampaignEventsSeparateOngoingEvents by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124172 (https://phabricator.wikimedia.org/T386427) [19:30:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74002 and previous config saved to /var/cache/conftool/dbconfig/20250303-193038-root.json [19:30:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124172 (https://phabricator.wikimedia.org/T386427) (owner: 10Daimona Eaytoy) [19:31:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74003 and previous config saved to /var/cache/conftool/dbconfig/20250303-193136-root.json [19:37:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74004 and previous config saved to /var/cache/conftool/dbconfig/20250303-193742-root.json [19:42:42] (03CR) 10Scott French: php: Allow provisioning MediaWiki with PHP 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [19:45:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74005 and previous config saved to /var/cache/conftool/dbconfig/20250303-194543-root.json [19:46:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74006 and previous config saved to /var/cache/conftool/dbconfig/20250303-194641-root.json [19:47:51] dcausse: I filed T387781 [19:47:52] T387781: Several recent slow (>15 minute) helm-lint job runs - https://phabricator.wikimedia.org/T387781 [19:51:08] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mobile for WRai-WMF - https://phabricator.wikimedia.org/T387786 (10WRai-WMF) 03NEW [19:51:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:55:53] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mobile for WRai-WMF - https://phabricator.wikimedia.org/T387786#10598073 (10WRai-WMF) [19:56:46] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10598078 (10VRiley-WMF) [19:57:17] 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mobile for WRai-WMF - https://phabricator.wikimedia.org/T387786#10598079 (10WRai-WMF) [20:00:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74007 and previous config saved to /var/cache/conftool/dbconfig/20250303-200048-root.json [20:01:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74008 and previous config saved to /var/cache/conftool/dbconfig/20250303-200146-root.json [20:02:48] (03PS1) 10Bking: relforge/elastic: repurpose relforge hosts as elastic [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) [20:03:07] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking) [20:06:46] (03PS2) 10Bking: relforge/elastic: repurpose relforge hosts as elastic [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) [20:09:10] (03PS3) 10Bking: relforge/elastic: repurpose relforge hosts as elastic [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) [20:09:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10598104 (10phaultfinder) [20:13:53] (03PS2) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774) [20:15:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74009 and previous config saved to /var/cache/conftool/dbconfig/20250303-201554-root.json [20:16:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74010 and previous config saved to /var/cache/conftool/dbconfig/20250303-201652-root.json [20:17:36] (03PS3) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774) [20:18:25] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387787 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [20:18:30] 06SRE, 10MW-on-K8s, 06serviceops, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629#10598113 (10dancy) >>! In T288629#10582102, @JMeybohm wrote: > I stumbled upon this again recently and I think the current con... [20:18:31] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387787 (10ops-monitoring-bot) 03NEW [20:19:12] (03CR) 10CDobbins: geo-maps: update South America DCs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [20:20:03] 06SRE, 10MW-on-K8s, 06serviceops, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629#10598121 (10dancy) 05Open→03Resolved a:03dancy ` After the build process creates the restricted mediawiki-multiversi... [20:21:06] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387597#10598124 (10Dzahn) I can stop renaming tickets, no problem. The alternative to duplicate tasks would be a single task for unrelated hosts though. And that seemed worse to me than closing duplicates, ftr. [20:22:36] (03CR) 10Ebernhardson: [C:03+1] "seems reasonable, might want to double check how elastic handles shrinking the number of masters, but worst case we can nuke relforge (the" [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking) [20:26:06] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387597#10598127 (10Dzahn) Or maybe I'm wrong and it doesn't create a single task anymore if it happens for different hosts. No worries either way, I will leave it to dcops how they prefer to handle those. [20:27:19] (03PS1) 10CDobbins: geo-maps: update South America DCs (part 2) [dns] - 10https://gerrit.wikimedia.org/r/1124178 (https://phabricator.wikimedia.org/T387774) [20:28:58] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387787#10598133 (10Pppery) →14Duplicate dup:03T382984 [20:29:01] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10598135 (10Pppery) [20:29:09] (03PS1) 10Herron: aux-k8s codfw: enable worker ingress [puppet] - 10https://gerrit.wikimedia.org/r/1124179 (https://phabricator.wikimedia.org/T381417) [20:29:14] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387769#10598137 (10Pppery) [20:29:15] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10598140 (10Pppery) →14Duplicate dup:03T387769 [20:29:23] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10598142 (10Pppery) 05Duplicate→03Open [20:29:33] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387769#10598144 (10Pppery) →14Duplicate dup:03T382984 [20:29:34] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10598146 (10Pppery) [20:29:45] (03PS7) 10Herron: aux-k8s-worker: deploy role to codfw workers [puppet] - 10https://gerrit.wikimedia.org/r/1123434 (https://phabricator.wikimedia.org/T381417) [20:31:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74011 and previous config saved to /var/cache/conftool/dbconfig/20250303-203100-root.json [20:31:03] (03CR) 10Bking: "Good call, we need to remove the nodes as masters before removing them from the cluster. Since this is relforge I'll go ahead and one-off " [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking) [20:31:31] (03PS1) 10Herron: aux-k8s codfw: enable worker ingress [puppet] - 10https://gerrit.wikimedia.org/r/1124179 (https://phabricator.wikimedia.org/T381417) [20:31:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74012 and previous config saved to /var/cache/conftool/dbconfig/20250303-203158-root.json [20:32:13] (03PS4) 10Bernard Wang: Update experiment name for Search AB test french wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 (https://phabricator.wikimedia.org/T387400) [20:32:37] (03PS3) 10Bernard Wang: Deploy Search AB test to everywhere but English wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122684 (https://phabricator.wikimedia.org/T386849) [20:40:15] (03PS17) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) [20:43:27] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [20:46:25] (03PS18) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) [20:46:34] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [20:48:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123807 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [20:48:25] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387788 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [20:48:38] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387788 (10ops-monitoring-bot) 03NEW [20:51:53] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10598216 (10VRiley-WMF) an-worker1178 A2 U27 CableID 3891 Port 26 an-worker1179 B7 U1 CableID 4884 Port 16 an-workwr1180 C7 U1 CableID 5100 Port 42 an-worker1181 E1 U7 CableID 230304500065... [20:52:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [20:57:25] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387788#10598234 (10Pppery) →14Duplicate dup:03T382984 [20:57:28] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10598236 (10Pppery) [20:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:59:14] (03PS3) 10Gergő Tisza: Update CentralAuth multi-DC rules for SUL3, attempt 2 [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695) [20:59:57] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707#10598241 (10VRiley-WMF) a:03VRiley-WMF [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T2100). [21:00:05] bwang, MichaelG_WMF, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:12] o/ [21:00:21] Hello! [21:00:25] hi [21:02:07] Who is deploying? [21:02:54] I'd hope someone of RoanKattouw, cjming, TheresNoTime, or kindrobot [21:08:26] I've asked in slack [21:10:16] (03Abandoned) 10Fabfur: systemd: added option to remain after exit [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur) [21:10:25] (03Abandoned) 10Fabfur: puppet: split puppet timer for calendar and startup run options [puppet] - 10https://gerrit.wikimedia.org/r/1124102 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur) [21:10:45] I can deploy [21:10:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2013.codfw.wmnet with OS bookworm [21:10:53] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10598277 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm [21:13:26] Ok great thank you @tgr_ ! [21:13:47] I'll start with the config patches [21:13:56] MatmaRex: they can go in one, right? [21:14:17] Ah, my config patch has a dependent patch [21:14:32] yeah, I'll do that afterwards [21:14:44] Ok [21:14:47] tgr_: yeah [21:14:49] scap was smart enough to notice [21:15:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122683 (owner: 10Bartosz Dziewoński) [21:15:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123766 (owner: 10Bartosz Dziewoński) [21:15:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123776 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza) [21:16:19] (03Merged) 10jenkins-bot: Remove unused config variable $wgJsonConfigInterwikiPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122683 (owner: 10Bartosz Dziewoński) [21:16:21] (03Merged) 10jenkins-bot: Fix inconsistent definitions for $wmgLocalServices['chart-renderer'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123766 (owner: 10Bartosz Dziewoński) [21:16:23] (03Merged) 10jenkins-bot: Set $wgCentralAuthSharedDomainCallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123776 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza) [21:16:40] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1122683|Remove unused config variable $wgJsonConfigInterwikiPrefix]], [[gerrit:1123766|Fix inconsistent definitions for $wmgLocalServices['chart-renderer']]], [[gerrit:1123776|Set $wgCentralAuthSharedDomainCallback (T387357)]] [21:16:43] T387357: SUL3 signup results in autocreation on all edge login domains - https://phabricator.wikimedia.org/T387357 [21:19:15] !log tgr@deploy2002 matmarex, tgr: Backport for [[gerrit:1122683|Remove unused config variable $wgJsonConfigInterwikiPrefix]], [[gerrit:1123766|Fix inconsistent definitions for $wmgLocalServices['chart-renderer']]], [[gerrit:1123776|Set $wgCentralAuthSharedDomainCallback (T387357)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:19:45] !log tgr@deploy2002 matmarex, tgr: Continuing with sync [21:19:54] (03PS4) 10Bking: relforge/elastic: repurpose relforge hosts as elastic [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) [21:21:04] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [21:21:18] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [21:21:32] (03CR) 10Gergő Tisza: [C:03+2] Use session storage for session tick events [extensions/WikimediaEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123734 (https://phabricator.wikimedia.org/T387400) (owner: 10Bernard Wang) [21:22:55] (03CR) 10Bking: [C:03+2] relforge/elastic: repurpose relforge hosts as elastic [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking) [21:23:31] Let me know when I can test! [21:23:39] (03CR) 10Bking: [C:03+2] "I made a small change after the +1 to remove relforge1004 as a master-eligible, since it wasn't set up that way in the first place." [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking) [21:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:24:09] bwang: do you want to test separately, or the two patches together? [21:25:01] Hm I think either works, as long as the config is synced last [21:26:46] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122683|Remove unused config variable $wgJsonConfigInterwikiPrefix]], [[gerrit:1123766|Fix inconsistent definitions for $wmgLocalServices['chart-renderer']]], [[gerrit:1123776|Set $wgCentralAuthSharedDomainCallback (T387357)]] (duration: 10m 06s) [21:26:49] T387357: SUL3 signup results in autocreation on all edge login domains - https://phabricator.wikimedia.org/T387357 [21:26:59] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [21:27:15] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [21:27:22] I'd just sync the two together then, saves some time [21:28:44] Ok [21:28:52] (03CR) 10Gergő Tisza: [C:03+1] docroot: Enable Chrome credential sharing on all open SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123810 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [21:29:04] (03CR) 10Kamila Součková: [C:03+1] Update validating-admission-policies for k8s >=1.30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [21:29:07] (03Merged) 10jenkins-bot: Use session storage for session tick events [extensions/WikimediaEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123734 (https://phabricator.wikimedia.org/T387400) (owner: 10Bernard Wang) [21:29:42] what server should I test on [21:29:53] just a sec [21:30:25] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:30:46] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [21:30:57] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [21:32:25] Change '1123448', project 'mediawiki/extensions/WikimediaEvents', branch 'master' not found in any deployed wikiversion. Deployed wikiversions: ['1.44.0-wmf.18'] [21:32:51] I guess this is just scap being confused about change IDs being shared across branches? [21:32:57] 1123448 is the master patch [21:33:36] let's see what happens if I override that warning [21:33:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 (https://phabricator.wikimedia.org/T387400) (owner: 10Bernard Wang) [21:34:14] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1181 - vriley@cumin1002" [21:34:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1181 - vriley@cumin1002" [21:34:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:34:37] (03Merged) 10jenkins-bot: Update experiment name for Search AB test french wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 (https://phabricator.wikimedia.org/T387400) (owner: 10Bernard Wang) [21:34:55] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1123734|Use session storage for session tick events (T387400)]], [[gerrit:1123449|Update experiment name for Search AB test french wiki (T387400)]] [21:34:58] T387400: SessionTick instrument should use sessionStorage instead of localStorage - https://phabricator.wikimedia.org/T387400 [21:35:26] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1181 [21:35:36] !log vriley@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-worker1181 [21:35:44] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1181 [21:35:51] !log vriley@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-worker1181 [21:36:55] !log bking@cumin2002 START - Cookbook sre.hosts.rename from relforge1005 to elastic1108 [21:37:19] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:37:34] !log tgr@deploy2002 bwang, tgr: Backport for [[gerrit:1123734|Use session storage for session tick events (T387400)]], [[gerrit:1123449|Update experiment name for Search AB test french wiki (T387400)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:38:54] !log bking@cumin2002 START - Cookbook sre.hosts.rename from relforge1006 to elastic1109 [21:39:03] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:39:13] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:39:18] bwang: you can use any of the WikimediaDebug server options. I think the standard one to use is k8s-mwdebug. [21:41:40] Ok im testing now [21:41:50] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:41:56] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:42:19] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:42:37] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:42:45] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:43:28] !log bking@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:43:37] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from relforge1005 to elastic1108 [21:44:31] !log bking@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [21:44:39] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from relforge1006 to elastic1109 [21:44:58] !log bking@cumin2002 START - Cookbook sre.hosts.rename from relforge1005 to elastic1108 [21:45:11] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:45:59] @tgr_ Thanks for running the deployment window, but I don't think it makes sense to start with my change anymore, unless we basically want to almost completely take over the Weekly Security deployment window [21:46:08] jouncebot: next [21:46:09] In 0 hour(s) and 13 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T2200) [21:46:47] we'll run into it by 10 min or so, should be fine [21:47:58] you sure? That change touches i18n files. In the past they took a long time to sync [21:48:09] but maybe that improved with k8s? [21:48:28] I think scap just syncs everything all the time these days [21:49:01] Alright, I'm here for it. Let's try it out when you're ready [21:50:39] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming relforge1005 to elastic1108 - bking@cumin2002" [21:51:09] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming relforge1005 to elastic1108 - bking@cumin2002" [21:51:10] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:51:10] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1108 [21:51:41] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1108 [21:51:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:52:21] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from relforge1005 to elastic1108 [21:53:48] Ok things are good! Tested [21:53:56] !log tgr@deploy2002 bwang, tgr: Continuing with sync [21:54:10] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1108.eqiad.wmnet with OS bullseye [21:54:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2013.codfw.wmnet with OS bookworm [21:54:40] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10598402 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm executed with errors: - backu... [21:55:47] !log bking@cumin2002 START - Cookbook sre.hosts.rename from relforge1006 to elastic1109 [21:56:00] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:56:16] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:56:24] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:57:08] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 168 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 247, active_shards: 327, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 164, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number [21:57:08] light_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 66.06060606060606 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:57:08] PROBLEM - OpenSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 168 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, discovered_master: True, active_primary_shards: 247, active_shards: 327, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 164, delayed_unassigned_shards: 0, number_of_pe [21:57:08] sks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 66.06060606060606 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:59:00] ^^ expected [21:59:50] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming relforge1006 to elastic1109 - bking@cumin2002" [22:00:05] Reedy, sbassett, Maryum, and manfredi: Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T2200). Please do the needful. [22:00:13] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on relforge[1003-1004,1006-1007].eqiad.wmnet with reason: T387782 [22:00:16] T387782: Repurpose relforge hosts back to Elastic - https://phabricator.wikimedia.org/T387782 [22:00:41] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming relforge1006 to elastic1109 - bking@cumin2002" [22:00:41] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:00:42] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1109 [22:00:49] we are running over with the backports a bit [22:00:55] let me know if that's a problem [22:00:58] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1109 [22:01:00] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123734|Use session storage for session tick events (T387400)]], [[gerrit:1123449|Update experiment name for Search AB test french wiki (T387400)]] (duration: 26m 04s) [22:01:03] T387400: SessionTick instrument should use sessionStorage instead of localStorage - https://phabricator.wikimedia.org/T387400 [22:01:36] Reedy, sbassett, Maryum, manfredi: are you using the window? if not, we have one more backport to go [22:01:39] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from relforge1006 to elastic1109 [22:02:56] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1109.eqiad.wmnet with OS bullseye [22:04:18] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [22:04:38] (03PS1) 10Aaron Schulz: Update Docker images of change-prop services to ones using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124191 (https://phabricator.wikimedia.org/T381588) [22:05:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2014.codfw.wmnet with OS bookworm [22:06:05] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10598441 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm [22:06:27] I'll take that as a no [22:06:49] 👍 [22:06:50] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:07:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124098 (https://phabricator.wikimedia.org/T387160) (owner: 10Michael Große) [22:07:10] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1181 [22:07:18] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1181 [22:07:52] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [22:08:00] !log bking@cumin2002 START - Cookbook sre.hosts.rename from relforge1007 to elastic1110 [22:08:27] (03PS6) 10Bartosz Dziewoński: Deduplicate JsonConfig config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 [22:10:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:11:45] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1181 - vriley@cumin1002" [22:12:03] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1181 - vriley@cumin1002" [22:12:04] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:12:49] !log bking@cumin2002 START - Cookbook sre.dns.netbox [22:13:12] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1181 [22:13:20] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1181 [22:15:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:15:39] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:18:15] (03PS2) 10CDobbins: geo-maps: update South America DCs (part 2) [dns] - 10https://gerrit.wikimedia.org/r/1124178 (https://phabricator.wikimedia.org/T387774) [22:19:27] (03Merged) 10jenkins-bot: feat(Surfacing): Add Change Tag for surfaced Add a Link [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124098 (https://phabricator.wikimedia.org/T387160) (owner: 10Michael Große) [22:19:43] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124098|feat(Surfacing): Add Change Tag for surfaced Add a Link (T387160)]] [22:19:46] T387160: Surfacing "Add a link" Structured Tasks: Edit Tag - https://phabricator.wikimedia.org/T387160 [22:20:01] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming relforge1007 to elastic1110 - bking@cumin2002" [22:20:07] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming relforge1007 to elastic1110 - bking@cumin2002" [22:20:07] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:20:08] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1110 [22:20:19] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1110 [22:20:59] (03CR) 10Ryan Kemper: wdqs: create new ui for wdqs legacy full (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:21:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from relforge1007 to elastic1110 [22:21:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:23:54] (03PS3) 10Ryan Kemper: wdqs: create new ui for wdqs legacy full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) [22:24:09] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1110.eqiad.wmnet with OS bullseye [22:24:25] (03PS4) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774) [22:26:53] (03CR) 10Ryan Kemper: wdqs: add routing for legacy full graph host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:27:28] mutante: just a heads-up, will be merging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1122678 to create a new `wdqs-legacy-full` ui in k8s miscweb [22:28:21] PROBLEM - Disk space on deploy2002 is CRITICAL: DISK CRITICAL - free space: /srv 9009 MB (3% inode=69%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [22:30:00] ryankemper: I dont know much about that but Jelto saying "mostly good" before gives me some levle of confidence ;) [22:30:41] that being said, the disk space on deployment host thing is a bit concerning [22:31:16] (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (https://phabricator.wikimedia.org/T387700) (owner: 10Brouberol) [22:31:31] let's see if there is something easy to do about that.. probably not .. but checking [22:32:21] yea, probably needs releng to delete old mw versions [22:32:51] !log ryankemper@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [22:32:53] !log ryankemper@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [22:33:24] oh just saw the disk space thing [22:34:28] (03PS4) 10Ryan Kemper: wdqs: create new ui for wdqs legacy full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) [22:35:12] !log tgr@deploy2002 migr, tgr: Backport for [[gerrit:1124098|feat(Surfacing): Add Change Tag for surfaced Add a Link (T387160)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:35:15] T387160: Surfacing "Add a link" Structured Tasks: Edit Tag - https://phabricator.wikimedia.org/T387160 [22:35:16] 169G docker, 29G mediawiki-staging, 57G deployment. I would fix stuff if it was on / but on /srv/ under actual deployment dirs I'd rather leave it alone [22:35:26] MichaelG_WMF: ^ [22:35:43] thanks, I'm testing [22:36:02] (03PS1) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) [22:36:09] 3% means 8.8G. I don't know if it's an issue right this moment. [22:36:21] can make a ticket either way [22:36:58] (03Abandoned) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [22:39:01] !log ryankemper@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [22:39:12] !log ryankemper@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [22:40:34] tgr_: it worked, I can see the new tag on https://test.wikipedia.org/w/index.php?title=The_Power_of_the_Dog_(film)&action=history [22:40:50] 06SRE, 06Release-Engineering-Team: deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796 (10Dzahn) 03NEW [22:41:10] !log tgr@deploy2002 migr, tgr: Continuing with sync [22:41:50] the time estimate was a bit off, but apparently no harm done :) [22:42:32] (03PS3) 10CDobbins: geo-maps: update South America DCs (part 2) [dns] - 10https://gerrit.wikimedia.org/r/1124178 (https://phabricator.wikimedia.org/T387774) [22:42:38] (03PS2) 10Ryan Kemper: wdqs: Create DNS entry for one full graph host [dns] - 10https://gerrit.wikimedia.org/r/1122676 (https://phabricator.wikimedia.org/T384422) [22:42:57] (03PS2) 10CDobbins: geo-maps: update South America DCs (part 1/2) [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) [22:43:04] 06SRE, 06Release-Engineering-Team: deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10598664 (10Dzahn) Also the part that if the only notification is an IRC line on the -operations channel it is not easily noticed anymore nowadays. Might it be better if that was an email... [22:43:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123810 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [22:43:56] (03CR) 10Ryan Kemper: "Test deployment to codfw looks good; going to merge and deploy to eqiad now" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:43:58] (03CR) 10Ryan Kemper: [C:03+2] wdqs: create new ui for wdqs legacy full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:44:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [skins/Vector] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123713 (https://phabricator.wikimedia.org/T358910) (owner: 10Jdlrobson) [22:45:31] !log ryankemper@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [22:46:10] !log ryankemper@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [22:46:19] !log T384422 k8s deployment of `wikidata-query-legacy-full-gui` release in codfw looks fine, proceeding to eqiad [22:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:22] T384422: Provide a low availability / scalability full graph endpoint to ease the transition to a split graph - https://phabricator.wikimedia.org/T384422 [22:46:53] (03CR) 10Ryan Kemper: [C:03+2] wdqs: Create DNS entry for one full graph host [dns] - 10https://gerrit.wikimedia.org/r/1122676 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:47:31] !log T384422 Merging DNS patch now https://gerrit.wikimedia.org/r/c/operations/dns/+/1122676 [22:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:40] !log ryankemper@dns1004 START - running authdns-update [22:49:47] !log ryankemper@dns1004 END - running authdns-update [22:51:12] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124098|feat(Surfacing): Add Change Tag for surfaced Add a Link (T387160)]] (duration: 31m 28s) [22:51:15] T387160: Surfacing "Add a link" Structured Tasks: Edit Tag - https://phabricator.wikimedia.org/T387160 [22:51:55] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:52:15] !log late UTC deploys done [22:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:33] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1108.eqiad.wmnet with OS bullseye [22:54:58] (03CR) 10Ryan Kemper: [C:03+2] wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper) [22:56:38] !log T384422 Deploying backend.yaml routing patch; after it's deployed we should theoretically be able to see a UI at https://query-legacy-full.wikidata.org/ [22:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:40] T384422: Provide a low availability / scalability full graph endpoint to ease the transition to a split graph - https://phabricator.wikimedia.org/T384422 [23:01:03] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1109.eqiad.wmnet with OS bullseye [23:01:05] PROBLEM - Ensure traffic_server is running for instance backend on cp5021 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:02:05] RECOVERY - Ensure traffic_server is running for instance backend on cp5021 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:02:22] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:02:34] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:06:56] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10598755 (10VRiley-WMF) [23:08:17] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10598761 (10VRiley-WMF) @BTullis is there a specific RAID that is supposed to be placed onto these servers? [23:13:01] (03PS1) 10Scott French: php8.1: Default display_startup_errors to "stderr" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124194 (https://phabricator.wikimedia.org/T377038) [23:16:50] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1108.eqiad.wmnet with OS bullseye [23:17:00] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1109.eqiad.wmnet with OS bullseye [23:20:10] (03PS1) 10Ryan Kemper: wdqs: create query-legacy-full.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/1124197 (https://phabricator.wikimedia.org/T384422) [23:21:25] (03PS2) 10Ryan Kemper: wdqs: create query-legacy-full.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/1124197 (https://phabricator.wikimedia.org/T384422) [23:22:30] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1110.eqiad.wmnet with OS bullseye [23:23:09] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1110.eqiad.wmnet with OS bullseye [23:23:36] (03PS1) 10Fabfur: systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124198 (https://phabricator.wikimedia.org/T387799) [23:24:15] (03Abandoned) 10Fabfur: systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124198 (https://phabricator.wikimedia.org/T387799) (owner: 10Fabfur) [23:24:50] 06SRE, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10598803 (10thcipriani) > Somehow needs cleaning up but since it's not OS but actual deployment data the question is what can be deleted. Probably old mw versions.. Old MW versions... [23:25:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2014.codfw.wmnet with OS bookworm [23:25:29] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:25:34] (03PS1) 10Fabfur: systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799) [23:25:36] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10598805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm executed with errors: - backu... [23:26:47] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10598810 (10Jhancock.wm) @Papaul both servers got stuck at the puppet certification part again. When you can, can you see if they are talking to the wrong server? thanks [23:32:31] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1108.eqiad.wmnet with reason: host reimage [23:32:47] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1109.eqiad.wmnet with reason: host reimage [23:36:21] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1108.eqiad.wmnet with reason: host reimage [23:38:40] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1110.eqiad.wmnet with reason: host reimage [23:40:16] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1109.eqiad.wmnet with reason: host reimage [23:41:55] 06SRE, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10598845 (10Dzahn) Yea, those are mediawiki-multiversion images. Some are 8GB. Example: ` docker-registry.discovery.wmnet/restricted/mediawiki-multiversion 2025-02-19-01... [23:42:29] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10598848 (10Dzahn) [23:44:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1110.eqiad.wmnet with reason: host reimage [23:45:34] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10598850 (10Dzahn) p:05Triage→03High With about 8.8GB space left and those images that can also be about 8.8G and sometimes multiple images per day ...I am... [23:48:24] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387802 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [23:48:30] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387802 (10ops-monitoring-bot) 03NEW [23:50:47] !log deleted local user_password from labswiki database (T104500 and T161859) [23:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:51] T104500: Old versions of sensitive user data (email, password hashes) can remain in database indefinitely due to local and global DB not being kept in sync - https://phabricator.wikimedia.org/T104500 [23:50:51] T161859: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859 [23:51:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:53:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1108.eqiad.wmnet with OS bullseye [23:56:37] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1109.eqiad.wmnet with OS bullseye [23:58:31] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387802#10598895 (10Pppery) →14Duplicate dup:03T382984 [23:58:33] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10598897 (10Pppery)