[00:00:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 834.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:10:15] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.032s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:15:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 980.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:31:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 834.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:31:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:36:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 805.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:38:56] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123811
[00:38:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123811 (owner: 10TrainBranchBot)
[00:57:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[01:03:40] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123811 (owner: 10TrainBranchBot)
[01:08:37] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123813
[01:08:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123813 (owner: 10TrainBranchBot)
[01:16:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.096s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:21:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.096s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:21:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:23:48] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:41:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 857.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:51:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 838.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:08:50] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:09:04] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:09:14] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:10:54] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53656 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:11:06] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:11:42] <wikibugs>	 (03PS1) 10Subramanya Sastry: Revert "Turn on Parsoid fragment support everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123815
[02:15:25] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123813 (owner: 10TrainBranchBot)
[02:16:44] <wikibugs>	 (03PS2) 10Subramanya Sastry: Revert "Turn on Parsoid fragment support everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123815 (https://phabricator.wikimedia.org/T387608)
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:44:58] <icinga-wm>	 PROBLEM - BFD status on cr1-magru is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:45:58] <icinga-wm>	 RECOVERY - BFD status on cr1-magru is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:51:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:18:25] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387692 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[03:18:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387692 (10ops-monitoring-bot) 03NEW
[03:21:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:41:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:50:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123802 (https://phabricator.wikimedia.org/T386719) (owner: 10Sbisson)
[03:57:51] <wikibugs>	 (03CR) 10KartikMistry: [C:03+1] Enable CX unified dashboard on sqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123802 (https://phabricator.wikimedia.org/T386719) (owner: 10Sbisson)
[04:11:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:31:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:57:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:01:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:03:22] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:03:24] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:11:22] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:11:28] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:11:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:15:58] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:17:14] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:23:48] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:37:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[05:41:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:42:30] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs2013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:03:15] <jinxer-wm>	 FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:08:15] <jinxer-wm>	 RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.375s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:09:15] <jinxer-wm>	 FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.286s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:11:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:18:30] <jinxer-wm>	 RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.22s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:24:35] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123938 (https://phabricator.wikimedia.org/T387547)
[06:34:45] <jinxer-wm>	 FIRING: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.22s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:38:30] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 872.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:41:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:55:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 922.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:05:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 847ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:08:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 896.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:10:29] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: inject the AIRFLOW_APPOWNER environment variable in all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123524 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol)
[07:13:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 802ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:14:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3586 MB (3% inode=98%): /tmp 3586 MB (3% inode=98%): /var/tmp 3586 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[07:15:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 882.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:16:20] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2025-03-03-041049-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123940 (https://phabricator.wikimedia.org/T369815)
[07:18:30] <moritzm>	 !log installing Linux 6.1.128 on Bookworm hosts
[07:18:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 821.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:21:01] <wikibugs>	 06SRE, 06DBA: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10594869 (10Marostegui) a:03Marostegui
[07:22:28] <wikibugs>	 06SRE, 06DBA: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10594871 (10Marostegui) Same problem as always:  ` ------------------------------------------------------------------------------- Record:      16 Date/Time:   03/02/2025 20:45:55 Source:      system Severity:    Critical...
[07:26:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10594873 (10Marostegui) Same issue as: T359940  T361968 T363119 T374215 I am going to write to the Dell thread.
[07:30:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.198s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:30:48] <wikibugs>	 (03PS1) 10Marostegui: db1246: Update notes [puppet] - 10https://gerrit.wikimedia.org/r/1124031 (https://phabricator.wikimedia.org/T387673)
[07:32:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1246: Update notes [puppet] - 10https://gerrit.wikimedia.org/r/1124031 (https://phabricator.wikimedia.org/T387673) (owner: 10Marostegui)
[07:32:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10594894 (10Marostegui) p:05Triage→03Medium
[07:33:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1233', diff saved to https://phabricator.wikimedia.org/P73923 and previous config saved to /var/cache/conftool/dbconfig/20250303-073358-root.json
[07:35:15] <jinxer-wm>	 FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.295s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:38:03] <wikibugs>	 (03CR) 10Nikerabbit: [C:04-1] metawiki: Enable Chinese variant translation for message bundles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122632 (https://phabricator.wikimedia.org/T387230) (owner: 10Abijeet Patro)
[07:40:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.295s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:41:16] <wikibugs>	 (03PS1) 10Marostegui: db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124036 (https://phabricator.wikimedia.org/T387673)
[07:42:30] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124036 (https://phabricator.wikimedia.org/T387673) (owner: 10Marostegui)
[07:45:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:45:19] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2164,2186].codfw.wmnet,db1172.eqiad.wmnet with reason: Rebuilding indexes
[07:45:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1172 db2164', diff saved to https://phabricator.wikimedia.org/P73925 and previous config saved to /var/cache/conftool/dbconfig/20250303-074525-marostegui.json
[07:45:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1172.eqiad.wmnet
[07:45:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2164.codfw.wmnet
[07:46:07] <Ammar>	 !log T387658 Ran mwscript-k8s --comment="T387658" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=bawiki --logwiki=metawiki 'Əkrəm Cəfər' 'Əkrəm'
[07:46:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:13] <stashbot>	 T387658: Unblock stuck global rename of Əkrəm - https://phabricator.wikimedia.org/T387658
[07:48:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2210 db1190', diff saved to https://phabricator.wikimedia.org/P73926 and previous config saved to /var/cache/conftool/dbconfig/20250303-074804-marostegui.json
[07:48:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1190.eqiad.wmnet
[07:48:44] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2210.codfw.wmnet
[07:49:43] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1123665 (owner: 10Muehlenhoff)
[07:50:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1027.eqiad.wmnet to cluster eqiad and group C
[07:50:16] <jinxer-wm>	 RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:51:06] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1027.eqiad.wmnet to cluster eqiad and group C
[07:52:05] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1172.eqiad.wmnet
[07:52:23] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2164.codfw.wmnet
[07:52:37] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Index rebuild
[07:52:49] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2164.codfw.wmnet with reason: Index rebuild
[07:53:16] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2210.codfw.wmnet
[07:53:41] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2210.codfw.wmnet with reason: Index rebuild
[07:55:19] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123593 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm)
[07:55:35] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1190.eqiad.wmnet
[08:00:06] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T0800).
[08:00:06] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:01:14] <kart_>	 Only my patch, I'll go ahead in a minute.
[08:03:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123802 (https://phabricator.wikimedia.org/T386719) (owner: 10Sbisson)
[08:03:59] <wikibugs>	 (03Merged) 10jenkins-bot: Enable CX unified dashboard on sqwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123802 (https://phabricator.wikimedia.org/T386719) (owner: 10Sbisson)
[08:04:23] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1123802|Enable CX unified dashboard on sqwiki (T386719)]]
[08:04:26] <stashbot>	 T386719: Deploy unified dashboard - https://phabricator.wikimedia.org/T386719
[08:08:49] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1190.eqiad.wmnet with reason: Index rebuild
[08:16:22] <logmsgbot>	 !log kartik@deploy2002 sbisson, kartik: Backport for [[gerrit:1123802|Enable CX unified dashboard on sqwiki (T386719)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:16:25] <stashbot>	 T386719: Deploy unified dashboard - https://phabricator.wikimedia.org/T386719
[08:18:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf for bwang - https://phabricator.wikimedia.org/T387614#10594978 (10MoritzMuehlenhoff) Requests to the wmf LDAP group are handled within Wikimedia IDM:  Can you please log into https://idm.wikimedia.org and request the group by following the steps listed at...
[08:19:19] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for chuckonwumelu - https://phabricator.wikimedia.org/T387627#10594981 (10MoritzMuehlenhoff) Requests to the wmf LDAP group are handled within Wikimedia IDM:  Can you please log into https://idm.wikimedia.org and request the group by following the steps l...
[08:20:49] <logmsgbot>	 !log kartik@deploy2002 sbisson, kartik: Continuing with sync
[08:23:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:29:56] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123802|Enable CX unified dashboard on sqwiki (T386719)]] (duration: 25m 32s)
[08:29:59] <stashbot>	 T386719: Deploy unified dashboard - https://phabricator.wikimedia.org/T386719
[08:30:35] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db2166.codfw.wmnet onto db2167.codfw.wmnet
[08:33:35] <wikibugs>	 (03CR) 10Volans: "Some additional thoughts." [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi)
[08:34:51] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3552 MB (3% inode=98%): /tmp 3552 MB (3% inode=98%): /var/tmp 3552 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[08:37:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10595023 (10Peachey88)
[08:37:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387692#10595025 (10Peachey88) →14Duplicate dup:03T382984
[08:40:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Add mszabo to analytics-privatedata-users and record new Kerberos access [puppet] - 10https://gerrit.wikimedia.org/r/1124038 (https://phabricator.wikimedia.org/T386918)
[08:40:39] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,prometheus: Enable IPIP on jobrunner@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295)
[08:40:40] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,prometheus: Enable IPIP on jobrunner@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295)
[08:41:38] <wikibugs>	 (03PS2) 10Elukey: kserve: allow Prometheus metrics to be fetched [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123692 (https://phabricator.wikimedia.org/T387580)
[08:42:17] <wikibugs>	 (03PS1) 10Federico Ceratto: clone.py: Fix fqdn variables [cookbooks] - 10https://gerrit.wikimedia.org/r/1124041 (https://phabricator.wikimedia.org/T387023)
[08:42:37] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] clone.py: Fix fqdn variables [cookbooks] - 10https://gerrit.wikimedia.org/r/1124041 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[08:43:15] <jinxer-wm>	 RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:43:39] <wikibugs>	 (03PS1) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042
[08:43:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (owner: 10Brouberol)
[08:44:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add mszabo to analytics-privatedata-users and record new Kerberos access [puppet] - 10https://gerrit.wikimedia.org/r/1124038 (https://phabricator.wikimedia.org/T386918) (owner: 10Muehlenhoff)
[08:44:52] <wikibugs>	 (03PS2) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042
[08:45:51] <wikibugs>	 (03PS3) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (https://phabricator.wikimedia.org/T387700)
[08:46:46] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez)
[08:46:55] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez)
[08:48:52] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] clone.py: Fix fqdn variables [cookbooks] - 10https://gerrit.wikimedia.org/r/1124041 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[08:49:07] <wikibugs>	 (03CR) 10Federico Ceratto: [V:03+1 C:03+2] clone.py: Fix fqdn variables [cookbooks] - 10https://gerrit.wikimedia.org/r/1124041 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[08:49:10] <wikibugs>	 (03CR) 10Federico Ceratto: [V:03+2 C:03+2] clone.py: Fix fqdn variables [cookbooks] - 10https://gerrit.wikimedia.org/r/1124041 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[08:49:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:49:25] <wikibugs>	 (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] Revert "Turn on Parsoid fragment support everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123815 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry)
[08:49:38] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123815 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry)
[08:50:39] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Máté Szabó - https://phabricator.wikimedia.org/T386918#10595045 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Sorry for the delay! I've just merged a patc...
[08:54:16] <jinxer-wm>	 RESOLVED: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:55:48] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1233.eqiad.wmnet onto db1246.eqiad.wmnet
[08:55:48] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1233.eqiad.wmnet onto db1246.eqiad.wmnet
[08:57:15] <jinxer-wm>	 FIRING: ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:59:24] <wikibugs>	 (03CR) 10Elukey: [C:03+1] hiera,docker_registry_ha: Enable IPIP on docker-registry@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123413 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez)
[09:01:12] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Remove upgrade checking and notice [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1123593 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm)
[09:03:38] <wikibugs>	 (03PS4) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (https://phabricator.wikimedia.org/T387700)
[09:05:51] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,docker_registry_ha: Enable IPIP on docker-registry@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123413 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez)
[09:05:54] <wikibugs>	 (03PS5) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (https://phabricator.wikimedia.org/T387700)
[09:07:09] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: docker_registry_ha::registry@eqiad
[09:09:58] <wikibugs>	 (03PS6) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (https://phabricator.wikimedia.org/T387700)
[09:09:59] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1123797 (https://phabricator.wikimedia.org/T385908) (owner: 10Andrew Bogott)
[09:10:30] <wikibugs>	 (03PS7) 10Brouberol: airflow: define an hadoop-shell deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (https://phabricator.wikimedia.org/T387700)
[09:10:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Melos to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/1124045 (https://phabricator.wikimedia.org/T386581)
[09:10:43] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[09:11:53] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[09:11:53] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: docker_registry_ha::registry@eqiad
[09:12:17] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "looks reasonable" [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1123642 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm)
[09:13:30] <wikibugs>	 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2025 Feb-Mar): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10595093 (10Nikerabbit)
[09:14:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add Melos to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/1124045 (https://phabricator.wikimedia.org/T386581) (owner: 10Muehlenhoff)
[09:17:37] <wikibugs>	 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2025 Feb-Mar): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10595097 (10Nikerabbit) 05In progress→03Stalled
[09:19:21] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to stewards-users for Melos - https://phabricator.wikimedia.org/T386581#10595101 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03MoritzMuehlenhoff @Melos Sorry for the delay! I've just merged a patch to enable your access. You s...
[09:20:44] <wikibugs>	 (03PS2) 10Vgutierrez: hiera,prometheus: Enable IPIP on jobrunner|videoscaler@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295)
[09:20:46] <wikibugs>	 (03PS2) 10Vgutierrez: hiera,prometheus: Enable IPIP on jobrunner|videoscaler@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295)
[09:22:15] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service idp2004:443 has failed probes (http_idp_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:22:24] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez)
[09:22:27] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez)
[09:23:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ep1c to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/1124047 (https://phabricator.wikimedia.org/T385808)
[09:23:48] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:28:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1233.eqiad.wmnet onto db1246.eqiad.wmnet
[09:28:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add ep1c to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/1124047 (https://phabricator.wikimedia.org/T385808) (owner: 10Muehlenhoff)
[09:31:47] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[09:32:17] <wikibugs>	 (03PS4) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294)
[09:33:53] <wikibugs>	 (03PS5) 10Vgutierrez: hiera,docker_registry_ha: Enable IPIP on docker-registry@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294)
[09:34:39] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10595124 (10brouberol)
[09:35:46] <wikibugs>	 (03CR) 10Elukey: [C:03+1] hiera,docker_registry_ha: Enable IPIP on docker-registry@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez)
[09:37:36] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for EP1C - https://phabricator.wikimedia.org/T385808#10595139 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03MoritzMuehlenhoff @EPIC  Sorry for the delay! I've just merged a patch to enable your access. You should now be able to SS...
[09:38:22] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,docker_registry_ha: Enable IPIP on docker-registry@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123414 (https://phabricator.wikimedia.org/T387294) (owner: 10Vgutierrez)
[09:38:37] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: docker_registry_ha::registry@codfw
[09:43:09] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10595179 (10MoritzMuehlenhoff)
[09:43:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1027.eqiad.wmnet to cluster eqiad and group A
[09:43:32] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1027.eqiad.wmnet to cluster eqiad and group A
[09:43:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1030.eqiad.wmnet to cluster eqiad and group A
[09:44:02] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[09:44:29] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:44:44] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1030.eqiad.wmnet to cluster eqiad and group A
[09:45:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:45:20] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707 (10MatthewVernon) 03NEW
[09:45:23] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707#10595196 (10MatthewVernon) p:05Triage→03High
[09:45:28] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[09:45:28] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: docker_registry_ha::registry@codfw
[09:46:50] <logmsgbot>	 !log mvernon@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ms-be1080.eqiad.wmnet with reason: disk failed
[09:46:56] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707#10595201 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=951f5a09-1cfc-43ed-af34-bcbbe604524f) set by mvernon@cumin1002 for 7 days, 0:00:00 on 1 host(s) and thei...
[09:47:06] <wikibugs>	 (03PS1) 10Elukey: knative-serving: add a variable in the templates for the Prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124050 (https://phabricator.wikimedia.org/T387580)
[09:47:19] <icinga-wm>	 RECOVERY - Disk space on ms-be1080 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1080&var-datasource=eqiad+prometheus/ops
[09:47:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:48:07] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:49:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Revert^2 "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051
[09:54:43] <wikibugs>	 (03PS1) 10Elukey: admin_ng: set a different Prometheus port for knative in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124052 (https://phabricator.wikimedia.org/T387580)
[09:54:56] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] mw-api-int: serve 10% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123700 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[09:57:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[09:59:05] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Revert^2 "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051
[10:01:45] <jinxer-wm>	 FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[10:05:27] <hashar>	 I am going to cut a new release of `scap` which I need to finish T303828
[10:05:28] <stashbot>	 T303828: Delete wmf branches from Gerrit repositories - https://phabricator.wikimedia.org/T303828
[10:06:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73930 and previous config saved to /var/cache/conftool/dbconfig/20250303-100603-root.json
[10:11:38] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] Enroll 100% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123694 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[10:11:44] <jinxer-wm>	 RESOLVED: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[10:16:34] <wikibugs>	 (03PS3) 10Vgutierrez: hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295)
[10:16:34] <wikibugs>	 (03PS3) 10Vgutierrez: hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295)
[10:19:10] <wikibugs>	 (03PS1) 10Ladsgroup: labs: Set thumbnail steps and ratio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124055 (https://phabricator.wikimedia.org/T360589)
[10:19:12] <hashar>	 jouncebot: now
[10:19:13] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 40 minute(s)
[10:19:19] <hashar>	 I am upgrading scap
[10:19:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] labs: Set thumbnail steps and ratio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124055 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[10:20:35] <wikibugs>	 (03PS2) 10Ladsgroup: labs: Set thumbnail steps and ratio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124055 (https://phabricator.wikimedia.org/T360589)
[10:20:45] <jinxer-wm>	 FIRING: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[10:21:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73931 and previous config saved to /var/cache/conftool/dbconfig/20250303-102109-root.json
[10:21:44] <logmsgbot>	 !log hashar@deploy2002 Installing scap version "4.139.0" for 204 host(s)
[10:21:45] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Revert^2 "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051
[10:21:47] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:22:24] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "woohoo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123694 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[10:23:05] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10595416 (10brouberol) a:03BTullis
[10:23:21] <Amir1>	 hashar: regarding your scap deploy, I'm merging and rebasing this beta cluster patch. Not deploying it so it shouldn't affect you, please tell if I need to stop https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1124055
[10:23:33] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "Makes sense, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123699 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[10:24:01] <hashar>	 if it is only for beta, in prod scap is nowadays smart enough to skip the patch :)
[10:24:21] <hashar>	 should be fine anyway, I am updating scap itself
[10:24:39] <hashar>	 and I am not sure how it is ugpraded on beta
[10:24:45] <Amir1>	 I guessed, just wanted to be sure
[10:24:56] <Amir1>	 for beta, I know, it's not in a rush though
[10:25:03] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] labs: Set thumbnail steps and ratio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124055 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[10:25:43] <wikibugs>	 (03Merged) 10jenkins-bot: labs: Set thumbnail steps and ratio [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124055 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[10:26:03] <Amir1>	 done, rebased on deploy200x
[10:26:15] <logmsgbot>	 !log hashar@deploy2002 Installation of scap version "4.139.0" completed for 204 hosts
[10:26:42] <hashar>	 !log Upgraded scap to 4.139.0 # T303828
[10:26:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:44] <stashbot>	 T303828: Delete wmf branches from Gerrit repositories - https://phabricator.wikimedia.org/T303828
[10:27:44] <wikibugs>	 (03PS1) 10Marostegui: check_depooled.sh: Add pc1-pc7 [software] - 10https://gerrit.wikimedia.org/r/1124058
[10:28:11] <wikibugs>	 (03CR) 10Marostegui: "This is a noop" [software] - 10https://gerrit.wikimedia.org/r/1124058 (owner: 10Marostegui)
[10:28:13] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] check_depooled.sh: Add pc1-pc7 [software] - 10https://gerrit.wikimedia.org/r/1124058 (owner: 10Marostegui)
[10:28:22] <wikibugs>	 (03PS12) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[10:28:32] <logmsgbot>	 !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) of db1248.eqiad.wmnet onto db1252.eqiad.wmnet
[10:28:40] <wikibugs>	 (03Merged) 10jenkins-bot: check_depooled.sh: Add pc1-pc7 [software] - 10https://gerrit.wikimedia.org/r/1124058 (owner: 10Marostegui)
[10:34:53] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1248.eqiad.wmnet onto db1252.eqiad.wmnet
[10:36:44] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Enable fixed Wikibase RDF everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118486 (https://phabricator.wikimedia.org/T384344)
[10:36:44] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Remove Wikibase fixed RDF feature flag again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118487 (https://phabricator.wikimedia.org/T384344)
[10:37:04] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,restbase: Enable IPIP on restbase-(backend|https)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124061 (https://phabricator.wikimedia.org/T387299)
[10:37:07] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,restbase: Enable IPIP on restbase-(backend|https)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124062 (https://phabricator.wikimedia.org/T387299)
[10:37:10] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Okay to deploy later today." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118486 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[10:37:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118486 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[10:37:44] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124061 (https://phabricator.wikimedia.org/T387299) (owner: 10Vgutierrez)
[10:37:53] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124062 (https://phabricator.wikimedia.org/T387299) (owner: 10Vgutierrez)
[10:38:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73933 and previous config saved to /var/cache/conftool/dbconfig/20250303-103820-root.json
[10:40:10] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.cf
[10:40:11] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.cf (exit_code=0)
[10:40:16] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng: set a different Prometheus port for knative in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124052 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey)
[10:40:54] <wikibugs>	 (03CR) 10Klausman: [C:03+1] knative-serving: add a variable in the templates for the Prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124050 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey)
[10:41:05] <wikibugs>	 (03CR) 10Klausman: [C:03+1] kserve: allow Prometheus metrics to be fetched [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123692 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey)
[10:42:07] <wikibugs>	 (03CR) 10Elukey: [C:03+2] kserve: allow Prometheus metrics to be fetched [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123692 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey)
[10:42:13] <wikibugs>	 (03CR) 10Elukey: [C:03+2] knative-serving: add a variable in the templates for the Prometheus port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124050 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey)
[10:42:18] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: set a different Prometheus port for knative in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124052 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey)
[10:43:39] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[10:44:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73934 and previous config saved to /var/cache/conftool/dbconfig/20250303-104438-root.json
[10:46:55] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] shellbox-media: serve 1/8 of requests on 8.1 with more logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123690 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[10:48:39] <jinxer-wm>	 RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2060-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[10:49:10] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez)
[10:49:16] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez)
[10:50:56] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-2] Add config needed to re-architecture mainstash away from x2 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup)
[10:51:38] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1233.eqiad.wmnet onto db1246.eqiad.wmnet
[10:51:44] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 #page on db2166 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3650.59 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:52:21] <jelto>	 !incidents
[10:52:21] <sirenbot>	 5708 (UNACKED)  db2166 (paged)/MariaDB Replica Lag: s8 (paged)
[10:52:21] <sirenbot>	 5707 (RESOLVED)  db1246 (paged)/MariaDB Replica IO: s2 (paged)
[10:52:21] <sirenbot>	 5706 (RESOLVED)  Host db1246 (paged) - PING  - Packet loss = 100%
[10:52:35] <jelto>	 !ack 5708
[10:52:36] <sirenbot>	 5708 (ACKED)  db2166 (paged)/MariaDB Replica Lag: s8 (paged)
[10:52:38] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,analytics_cluster: Enable IPIP on datahubsearch@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124064 (https://phabricator.wikimedia.org/T387306)
[10:52:43] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:53:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73935 and previous config saved to /var/cache/conftool/dbconfig/20250303-105325-root.json
[10:53:41] <jynus>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2166&var-port=9104
[10:54:30] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124064 (https://phabricator.wikimedia.org/T387306) (owner: 10Vgutierrez)
[10:54:37] <sobanski>	 This one doesn't seem to have recent maintenance happening
[10:54:48] <jynus>	 no one logged in recently
[10:54:58] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:55:15] <jynus>	 but something happened at 08:30
[10:55:31] <jelto>	 see also -data-persistence
[10:55:48] <jynus>	 lagging for 2h 23m 49s
[10:55:51] <jelto>	 and yes federico executed a cookbook today at 8:30 utc
[10:55:57] <jelto>	 "START - Cookbook sre.mysql.clone of db2166.codfw.wmnet onto db2167.codfw.wmnet"
[10:56:30] <jynus>	 ^ federico3
[10:56:38] <jelto>	 ah sorry, thats 2167 not 2166
[10:56:55] <jelto>	 or both
[10:57:08] <sobanski>	 Looks like both
[10:57:55] <federico3>	 they've been doing cloning and 66 is still catching up
[10:58:27] <jynus>	 was it pooled?
[10:58:28] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2166 - catching up replication
[10:58:33] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2166 - catching up replication
[10:58:54] <jinxer-wm>	 FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[10:58:57] <jelto>	  the lag is going down https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2166&var-port=9104&viewPanel=6 on db2166
[10:59:03] <wikibugs>	 (03PS1) 10Elukey: kserve: fix chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124065 (https://phabricator.wikimedia.org/T387580)
[10:59:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73936 and previous config saved to /var/cache/conftool/dbconfig/20250303-105943-root.json
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T1100)
[11:00:40] <jelto>	 lag is at ~8 minutes and going down fast
[11:00:55] <jelto>	 3 minutes
[11:01:14] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124039 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez)
[11:01:27] <federico3>	 db2166 was not pooled it (yet)
[11:01:35] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: mediawiki::jobrunner@codfw
[11:01:37] <jynus>	 ok, then no worries
[11:01:44] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 #page on db2166 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:01:52] <jynus>	 it was probably a downtime expiration or something like that
[11:01:58] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10595538 (10MoritzMuehlenhoff)
[11:02:15] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] hiera,restbase: Enable IPIP on restbase-(backend|https)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124061 (https://phabricator.wikimedia.org/T387299) (owner: 10Vgutierrez)
[11:02:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] hiera,restbase: Enable IPIP on restbase-(backend|https)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124062 (https://phabricator.wikimedia.org/T387299) (owner: 10Vgutierrez)
[11:02:38] <jynus>	 wherever it was setup, it has to be extended, or acked + removed
[11:02:39] <jelto>	 :) thanks for the help, I'm a bit surprised by the pag.e when it's not pooled but yes probably expired downtime (after 2 hours)
[11:03:27] <jynus>	 jelto: needs to be discussed, but the original reason is that it prevents accidental pool
[11:03:46] <jynus>	 better to alert before than after
[11:03:54] <jinxer-wm>	 FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[11:04:18] <wikibugs>	 (03CR) 10Elukey: [C:03+2] kserve: fix chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124065 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey)
[11:05:22] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[11:05:51] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:07:19] <wikibugs>	 (03CR) 10Aklapper: "I don't have permissions to +2 this one." [puppet] - 10https://gerrit.wikimedia.org/r/1101481 (https://phabricator.wikimedia.org/T309222) (owner: 10Aklapper)
[11:08:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2210 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73937 and previous config saved to /var/cache/conftool/dbconfig/20250303-110830-root.json
[11:08:54] <jinxer-wm>	 FIRING: [4x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[11:09:28] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Update to new upstream version 3.10.0 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1123642 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm)
[11:09:35] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus_ferm_mss.service on mw2278:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:09:57] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[11:10:13] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[11:11:47] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:11:51] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[11:11:51] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: mediawiki::jobrunner@codfw
[11:11:55] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[11:12:09] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[11:13:54] <jinxer-wm>	 FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[11:14:41] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:14:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73938 and previous config saved to /var/cache/conftool/dbconfig/20250303-111448-root.json
[11:17:05] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[11:17:22] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[11:18:50] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2166.codfw.wmnet onto db2167.codfw.wmnet
[11:18:54] <jinxer-wm>	 FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[11:20:38] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for Máté Szabó - https://phabricator.wikimedia.org/T386918#10595558 (10mszabo) Thanks!
[11:20:57] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123938 (https://phabricator.wikimedia.org/T387547) (owner: 10Kevin Bazira)
[11:23:54] <jinxer-wm>	 FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[11:25:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:25:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73939 and previous config saved to /var/cache/conftool/dbconfig/20250303-112548-root.json
[11:27:33] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus: Support py3 from buster on mss-ferm [puppet] - 10https://gerrit.wikimedia.org/r/1124068 (https://phabricator.wikimedia.org/T387295)
[11:28:30] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mw-(api-ext|web): scale next to 40% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123699 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[11:28:54] <jinxer-wm>	 FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1081-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[11:29:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10595564 (10Marostegui) db1246 has been cloned. I will repool it tomorrow. I am sure it will sooner or later crash again, but we need to see if it is again the same HW error.
[11:29:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73940 and previous config saved to /var/cache/conftool/dbconfig/20250303-112954-root.json
[11:30:06] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] prometheus: Support py3 from buster on mss-ferm [puppet] - 10https://gerrit.wikimedia.org/r/1124068 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez)
[11:30:18] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Revert^2 "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051 (owner: 10Giuseppe Lavagetto)
[11:30:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Add harroyo-wmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124069 (https://phabricator.wikimedia.org/T386922)
[11:31:05] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] prometheus: Support py3 from buster on mss-ferm [puppet] - 10https://gerrit.wikimedia.org/r/1124068 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez)
[11:31:34] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] prometheus: Support py3 from buster on mss-ferm [puppet] - 10https://gerrit.wikimedia.org/r/1124068 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez)
[11:32:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2206 db1249', diff saved to https://phabricator.wikimedia.org/P73941 and previous config saved to /var/cache/conftool/dbconfig/20250303-113225-root.json
[11:32:43] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639
[11:32:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2206.codfw.wmnet
[11:32:43] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225
[11:32:43] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547
[11:32:44] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548
[11:32:50] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1249.eqiad.wmnet
[11:33:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 (owner: 10Giuseppe Lavagetto)
[11:33:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto)
[11:33:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto)
[11:33:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto)
[11:35:42] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez)
[11:35:48] <wikibugs>	 (03PS4) 10Vgutierrez: hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295)
[11:36:28] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10595587 (10MatthewVernon) @Jhancock.wm So this system has had new backplane and controller cards fitted? From comments on this ticket it looks like maybe controller cards have b...
[11:36:41] <wikibugs>	 (03PS13) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[11:37:17] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,mediawiki: Enable IPIP on jobrunner|videoscaler@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124040 (https://phabricator.wikimedia.org/T387295) (owner: 10Vgutierrez)
[11:37:23] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus_ferm_mss.service on mw2278:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:37:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add harroyo-wmf to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124069 (https://phabricator.wikimedia.org/T386922) (owner: 10Muehlenhoff)
[11:37:35] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: mediawiki::jobrunner@eqiad
[11:38:23] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2206.codfw.wmnet
[11:38:50] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1249.eqiad.wmnet
[11:38:52] <wikibugs>	 (03PS14) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[11:38:54] <jinxer-wm>	 FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1100-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[11:40:15] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to Dashboards in Superset for harroyo-wmf - https://phabricator.wikimedia.org/T386922#10595594 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Sorry for the delay! I've just merged a patch to...
[11:40:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73942 and previous config saved to /var/cache/conftool/dbconfig/20250303-114054-root.json
[11:41:32] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[11:41:39] <wikibugs>	 (03PS15) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[11:42:23] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: prometheus_ferm_mss.service on mw2278:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:42:27] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:42:27] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:42:40] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[11:42:40] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: mediawiki::jobrunner@eqiad
[11:42:45] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db2166.codfw.wmnet onto db2167.codfw.wmnet
[11:43:07] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2166.codfw.wmnet onto db2167.codfw.wmnet
[11:44:06] <vgutierrez>	 ^^ that BGP alert was the pybal restart on lvs1019-lvs1020, should recover soon 
[11:44:44] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123938 (https://phabricator.wikimedia.org/T387547) (owner: 10Kevin Bazira)
[11:45:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73943 and previous config saved to /var/cache/conftool/dbconfig/20250303-114500-root.json
[11:45:45] <wikibugs>	 (03PS16) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[11:45:59] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123938 (https://phabricator.wikimedia.org/T387547) (owner: 10Kevin Bazira)
[11:48:01] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db2166.codfw.wmnet onto db2167.codfw.wmnet
[11:48:19] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] START helmfile.d/services/ipoid: apply
[11:48:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1123005 (https://phabricator.wikimedia.org/T387305) (owner: 10Vgutierrez)
[11:48:54] <jinxer-wm>	 FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2061-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[11:48:59] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1249.eqiad.wmnet with reason: Index rebuild
[11:49:11] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2206.codfw.wmnet with reason: Index rebuild
[11:49:43] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[11:50:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:50:13] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:52:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[11:52:46] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: restbase::production@codfw
[11:52:52] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,restbase: Enable IPIP on restbase-(backend|https)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124061 (https://phabricator.wikimedia.org/T387299) (owner: 10Vgutierrez)
[11:52:58] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' .
[11:53:54] <jinxer-wm>	 FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2061-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[11:56:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73944 and previous config saved to /var/cache/conftool/dbconfig/20250303-115559-root.json
[11:56:25] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124075
[11:56:40] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124075 (owner: 10PipelineBot)
[11:56:50] <wikibugs>	 (03CR) 10Jgiannelos: [V:03+2 C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124075 (owner: 10PipelineBot)
[11:56:52] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[11:57:02] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[11:57:17] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10595631 (10MoritzMuehlenhoff) @Ben.buchenau It doesn't appear you have created a Wikimedia Developer Account yet? At least I'm unable to find an account linked to ben.bu...
[11:57:54] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Setup temporary archival job to archive RT database [puppet] - 10https://gerrit.wikimedia.org/r/1124076 (https://phabricator.wikimedia.org/T385901)
[11:58:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dbbackups: Setup temporary archival job to archive RT database [puppet] - 10https://gerrit.wikimedia.org/r/1124076 (https://phabricator.wikimedia.org/T385901) (owner: 10Jcrespo)
[11:58:41] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[11:58:54] <jinxer-wm>	 FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[11:59:10] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Setup temporary archival job to archive RT database [puppet] - 10https://gerrit.wikimedia.org/r/1124076 (https://phabricator.wikimedia.org/T385901)
[11:59:39] <jayme>	 !log Imported helmfile 0.171.0-2 and helm-diff 3.10.0-1 to bullseye-wikimedia and bookworm-wikimedia - T341984 T387376 
[11:59:40] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[11:59:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:44] <stashbot>	 T341984: Update Kubernetes clusters to 1.31 - https://phabricator.wikimedia.org/T341984
[11:59:44] <stashbot>	 T387376: Respect kubeVersion constraints in deployment-charts CI - https://phabricator.wikimedia.org/T387376
[12:00:06] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[12:00:24] <wikibugs>	 (03CR) 10D3r1ck01: [C:03+1] Set $wgCentralAuthSharedDomainCallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123776 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza)
[12:00:36] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[12:00:36] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: restbase::production@codfw
[12:00:56] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[12:01:27] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[12:01:34] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[12:02:08] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[12:02:34] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,wmcs: Enable IPIP on labweb-ssl@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123005 (https://phabricator.wikimedia.org/T387305) (owner: 10Vgutierrez)
[12:02:42] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Setup temporary archival job to archive RT database [puppet] - 10https://gerrit.wikimedia.org/r/1124076 (https://phabricator.wikimedia.org/T385901) (owner: 10Jcrespo)
[12:03:09] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wmcs::openstack::eqiad1::cloudweb@eqiad
[12:07:29] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[12:08:25] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:08:27] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:08:37] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[12:08:37] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wmcs::openstack::eqiad1::cloudweb@eqiad
[12:09:25] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: restbase::production@eqiad
[12:09:41] <wikibugs>	 (03PS2) 10Vgutierrez: hiera,restbase: Enable IPIP on restbase-(backend|https)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124062 (https://phabricator.wikimedia.org/T387299)
[12:10:55] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,restbase: Enable IPIP on restbase-(backend|https)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124062 (https://phabricator.wikimedia.org/T387299) (owner: 10Vgutierrez)
[12:11:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73945 and previous config saved to /var/cache/conftool/dbconfig/20250303-121104-root.json
[12:11:28] <wikibugs>	 (03PS1) 10Jcrespo: Revert "dbbackups: Setup temporary archival job to archive RT database" [puppet] - 10https://gerrit.wikimedia.org/r/1124082
[12:13:50] <wikibugs>	 (03CR) 10Marostegui: clone.py, clone_test.py: Automate cloning (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[12:13:54] <jinxer-wm>	 FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[12:16:21] <wikibugs>	 (03CR) 10Clément Goubert: php: Allow provisioning MediaWiki with PHP 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis)
[12:16:23] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[12:16:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1237 with weight 0 T387557', diff saved to https://phabricator.wikimedia.org/P73946 and previous config saved to /var/cache/conftool/dbconfig/20250303-121623-root.json
[12:16:27] <stashbot>	 T387557: Switchover x1 master (db1220 -> db1237) - https://phabricator.wikimedia.org/T387557
[12:16:28] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] Revert "dbbackups: Setup temporary archival job to archive RT database" [puppet] - 10https://gerrit.wikimedia.org/r/1124082 (owner: 10Jcrespo)
[12:16:40] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 13 hosts with reason: Primary switchover x1 T387557
[12:17:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1237 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/1123615 (https://phabricator.wikimedia.org/T387557) (owner: 10Gerrit maintenance bot)
[12:17:30] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[12:17:30] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: restbase::production@eqiad
[12:17:38] <wikibugs>	 (03PS17) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[12:18:54] <jinxer-wm>	 FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1054-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[12:19:09] <jinxer-wm>	 FIRING: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1054-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[12:19:40] <vgutierrez>	 stevemunene, gehel ^^
[12:22:20] <dcausse>	 vgutierrez: re CirrusSearchNodeIndexingNotIncreasing: working on a fix
[12:22:47] <marostegui>	 !log Starting x1 eqiad failover from db1220 to db1237 - T387557
[12:22:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:49] <stashbot>	 T387557: Switchover x1 master (db1220 -> db1237) - https://phabricator.wikimedia.org/T387557
[12:23:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1237 to x1 primary T387557', diff saved to https://phabricator.wikimedia.org/P73947 and previous config saved to /var/cache/conftool/dbconfig/20250303-122304-root.json
[12:23:54] <jinxer-wm>	 FIRING: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1054-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[12:23:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[12:24:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1220 T387557', diff saved to https://phabricator.wikimedia.org/P73948 and previous config saved to /var/cache/conftool/dbconfig/20250303-122437-marostegui.json
[12:25:52] <wikibugs>	 (03PS4) 10Clément Goubert: Revert^2 "When executing cli scripts, wait for the service mesh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124051 (https://phabricator.wikimedia.org/T387208) (owner: 10Giuseppe Lavagetto)
[12:26:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1233 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73949 and previous config saved to /var/cache/conftool/dbconfig/20250303-122609-root.json
[12:28:54] <jinxer-wm>	 FIRING: [11x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1054-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[12:29:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1220.eqiad.wmnet
[12:33:54] <jinxer-wm>	 FIRING: [13x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1054-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[12:33:55] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1220.eqiad.wmnet
[12:36:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P73950 and previous config saved to /var/cache/conftool/dbconfig/20250303-123651-root.json
[12:38:54] <jinxer-wm>	 FIRING: [12x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1065-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[12:40:15] <wikibugs>	 (03CR) 10Kamila Součková: [C:04-1] "I need to fix consumer_group" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková)
[12:40:31] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1098 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 26 GB (1% inode=99%): /var/lib/hadoop/data/d 22 GB (1% inode=99%): /var/lib/hadoop/data/e 26 GB (1% inode=99%): /var/lib/hadoop/data/f 16 GB (0% inode=99%): /var/lib/hadoop/data/g 26 GB (1% inode=99%): /var/lib/hadoop/data/h 21 GB (1% inode=99%): /var/lib/hadoop/data/i 22 GB (1% inode=99%): /var/lib/hadoop/data/j 26 GB (1
[12:40:31] <icinga-wm>	 99%): /var/lib/hadoop/data/l 25 GB (1% inode=99%): /var/lib/hadoop/data/k 27 GB (1% inode=99%): /var/lib/hadoop/data/m 26 GB (1% inode=99%): /var/lib/hadoop/data/n 26 GB (1% inode=99%): /var/lib/hadoop/data/o 26 GB (1% inode=99%): /var/lib/hadoop/data/p 26 GB (1% inode=99%): /var/lib/hadoop/data/r 28 GB (1% inode=99%): /var/lib/hadoop/data/q 23 GB (1% inode=99%): /var/lib/hadoop/data/s 26 GB (1% inode=99%): /var/lib/hadoop/data/t 26 GB (1
[12:40:31] <icinga-wm>	 99%): /var/lib/hadoop/data/u 26 GB (1% inode=99%): /var/lib/hadoop/data/v 25 GB (1% inode=99%): /var/lib/hadoop/data/w 26 GB (1% inode=99%): /var/lib/hadoop/data/x 23 GB (1% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[12:45:01] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Remove $wmgUseGraphWithJsonNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122709 (https://phabricator.wikimedia.org/T124748)
[12:45:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123776 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza)
[12:46:53] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mediawiki: CronJob name as Job label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122563 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert)
[12:48:54] <jinxer-wm>	 FIRING: [11x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1065-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[12:50:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Add migr to analytics-privatedata-users (plus Kerberos) [puppet] - 10https://gerrit.wikimedia.org/r/1124091 (https://phabricator.wikimedia.org/T387114)
[12:51:51] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: CronJob name as Job label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122563 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert)
[12:51:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P73951 and previous config saved to /var/cache/conftool/dbconfig/20250303-125156-root.json
[12:52:10] <stevemunene>	 thanks dcausse !
[12:53:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[12:53:54] <jinxer-wm>	 FIRING: [9x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1072-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[12:55:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add migr to analytics-privatedata-users (plus Kerberos) [puppet] - 10https://gerrit.wikimedia.org/r/1124091 (https://phabricator.wikimedia.org/T387114) (owner: 10Muehlenhoff)
[12:55:47] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[12:55:56] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:56:03] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[12:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:58:38] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:58:44] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[12:58:47] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Analytics-Cluster + Kerberos for @Michael - https://phabricator.wikimedia.org/T387114#10595785 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Sorry for the delay! I've just merged a patch to enable your access (it...
[12:59:09] <jinxer-wm>	 FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1072-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[13:00:14] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to Analytics-Cluster + Kerberos for @Michael - https://phabricator.wikimedia.org/T387114#10595788 (10Michael) >>! In T387114#10595786, @MoritzMuehlenhoff wrote: > Sorry for the delay! I've just merged a patch to enable your access (it ta...
[13:01:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[13:01:13] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[13:02:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73952 and previous config saved to /var/cache/conftool/dbconfig/20250303-130247-root.json
[13:03:54] <jinxer-wm>	 RESOLVED: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1072-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[13:04:18] <wikibugs>	 10ops-eqiad, 06SRE, 10Ceph, 06cloud-services-team, and 2 others: evaluate new drives in cloudcephosd102[123] - https://phabricator.wikimedia.org/T386725#10595793 (10dcaro) I did a first check of the current values for the smartcl reported counters, all look good so far (no more Offline_Uncorrectable_Errors...
[13:06:07] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:06:31] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1098 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:07:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P73953 and previous config saved to /var/cache/conftool/dbconfig/20250303-130702-root.json
[13:07:21] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:07:39] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[13:10:25] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[13:10:37] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[13:10:54] <wikibugs>	 (03PS5) 10Kamila Součková: benthos: add input/output config to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214)
[13:11:54] <wikibugs>	 (03CR) 10Kamila Součková: benthos: add input/output config to chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková)
[13:12:42] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[13:12:49] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[13:13:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73954 and previous config saved to /var/cache/conftool/dbconfig/20250303-131329-root.json
[13:14:19] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[13:14:40] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[13:15:16] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[13:15:45] <jinxer-wm>	 RESOLVED: [3x] CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[13:16:30] <wikibugs>	 (03PS4) 10Clément Goubert: mediawiki: Change mwcron default concurrency policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116800
[13:17:30] <tgr_>	 !log undid arbcom_ruwiki block of CirrusSearch_Streaming_Updater via blockUser.php
[13:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73955 and previous config saved to /var/cache/conftool/dbconfig/20250303-131752-root.json
[13:18:52] <wikibugs>	 (03Abandoned) 10Samtar: IS: Enable wgUseCodexSpecialBlock on prod test.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1114324 (https://phabricator.wikimedia.org/T377121) (owner: 10Samtar)
[13:19:15] <wikibugs>	 (03PS1) 10JMeybohm: Build with CGO disabled, remove libc dependency [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124096 (https://phabricator.wikimedia.org/T341984)
[13:22:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P73956 and previous config saved to /var/cache/conftool/dbconfig/20250303-132207-root.json
[13:22:22] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1124097 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:22:29] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124096 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:22:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[13:23:48] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:24:22] <moritzm>	 !log failover Ganeti master in eqiad to ganeti1048 T382507
[13:24:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:25] <stashbot>	 T382507: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507
[13:24:31] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Build with CGO disabled, remove libc dependency [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1124097 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:24:50] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Build with CGO disabled, remove libc dependency [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124096 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:25:42] <wikibugs>	 (03PS7) 10Clément Goubert: mediawiki: Change mwcron default concurrency policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116800
[13:26:44] <wikibugs>	 (03PS1) 10Michael Große: feat(Surfacing): Add Change Tag for surfaced Add a Link [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124098 (https://phabricator.wikimedia.org/T387160)
[13:27:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124098 (https://phabricator.wikimedia.org/T387160) (owner: 10Michael Große)
[13:27:07] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:27:07] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:27:13] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti1028 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[13:27:57] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:27:57] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:28:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73957 and previous config saved to /var/cache/conftool/dbconfig/20250303-132834-root.json
[13:28:44] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mediawiki: Change mwcron default concurrency policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116800 (owner: 10Clément Goubert)
[13:31:18] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Change mwcron default concurrency policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116800 (owner: 10Clément Goubert)
[13:32:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73958 and previous config saved to /var/cache/conftool/dbconfig/20250303-133258-root.json
[13:35:36] <logmsgbot>	 !log cgoubert@deploy2002 Started scap sync-world: Deploying 1116800 1122563
[13:37:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1220 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P73959 and previous config saved to /var/cache/conftool/dbconfig/20250303-133713-root.json
[13:37:19] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Provide a pxe-bootable rescue image - https://phabricator.wikimedia.org/T78135#10595950 (10fgiunchedi) >>! In T78135#10591609, @jhathaway wrote: > I found this task while pondering similar functionality, as I have been using SystemRescue to troubleshoot some issues on our S...
[13:37:23] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:37:40] <logmsgbot>	 !log cgoubert@deploy2002 Finished scap sync-world: Deploying 1116800 1122563 (duration: 02m 15s)
[13:39:35] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:42:48] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10595957 (10Jhancock.wm) the controller card was replaced. the two backplanes were not. correct. I figured it was more likely the controller card since it was system wide and not...
[13:43:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73960 and previous config saved to /var/cache/conftool/dbconfig/20250303-134340-root.json
[13:44:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387597#10595959 (10Jhancock.wm) →14Duplicate dup:03T387431
[13:44:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10595961 (10Jhancock.wm)
[13:44:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387612#10595963 (10Jhancock.wm) →14Duplicate dup:03T387431
[13:44:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10595965 (10Jhancock.wm)
[13:45:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[13:47:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387597#10595969 (10fgiunchedi) @Dzahn for the record, I get why you are renaming tasks with the hostname, though that will create more work since duplicate tasks will be opened again. The related work to fix t...
[13:48:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73961 and previous config saved to /var/cache/conftool/dbconfig/20250303-134804-root.json
[13:50:23] <wikibugs>	 (03PS1) 10Fabfur: puppet: split puppet timer for calendar and startup run options [puppet] - 10https://gerrit.wikimedia.org/r/1124102 (https://phabricator.wikimedia.org/T383976)
[13:50:23] <wikibugs>	 06SRE, 10Observability-Logging, 10Wikimedia-Apache-configuration: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601#10595978 (10fgiunchedi) 05Open→03Declined >>! In T188601#10446174, @andrea.denisse wrote: > Is this related to T187434 ?  It is not, the work is t...
[13:50:45] <jinxer-wm>	 FIRING: [3x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[13:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:52:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] puppet: split puppet timer for calendar and startup run options [puppet] - 10https://gerrit.wikimedia.org/r/1124102 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur)
[13:52:31] <wikibugs>	 (03PS1) 10JMeybohm: Depend on helm or helm3 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124104 (https://phabricator.wikimedia.org/T341984)
[13:53:45] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Depend on helm or helm3 [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124104 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[13:54:52] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2166.codfw.wmnet onto db2167.codfw.wmnet
[13:55:26] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' .
[13:56:01] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' .
[13:56:22] <wikibugs>	 10ops-eqiad, 06SRE, 10Ceph, 06cloud-services-team, and 2 others: evaluate new drives in cloudcephosd102[123] - https://phabricator.wikimedia.org/T386725#10596009 (10dcaro) Just checked the number of operations/s (as a proxy for performance):  * For cloudcephosd1021, comparing with 1018, there's a bit of an...
[13:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:56:47] <wikibugs>	 10ops-eqiad, 06SRE, 10Ceph, 06cloud-services-team, and 2 others: evaluate new drives in cloudcephosd102[123] - https://phabricator.wikimedia.org/T386725#10596011 (10dcaro) 05Open→03Resolved
[13:57:21] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1068 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:57:22] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on an-worker1068 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387732 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:57:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1068 - https://phabricator.wikimedia.org/T387732 (10ops-monitoring-bot) 03NEW
[13:58:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73962 and previous config saved to /var/cache/conftool/dbconfig/20250303-135845-root.json
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T1400).
[14:00:05] <jouncebot>	 SD_hehua, MatmaRex, ihurbain, Lucas_WMDE, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:57] <Lucas_WMDE>	 o/
[14:01:16] <ihurbain>	 o/
[14:01:19] <Lucas_WMDE>	 I’m here but wouldn’t mind someone else doing most of the deployments tbh  ^^
[14:01:51] <SD_hehua>	 hello
[14:02:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73963 and previous config saved to /var/cache/conftool/dbconfig/20250303-140249-root.json
[14:03:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73964 and previous config saved to /var/cache/conftool/dbconfig/20250303-140309-root.json
[14:04:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73965 and previous config saved to /var/cache/conftool/dbconfig/20250303-140414-root.json
[14:06:04] <Lucas_WMDE>	 ok let’s start with SD_hehua’s change then
[14:06:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121622 (https://phabricator.wikimedia.org/T387055) (owner: 10SD hehua)
[14:06:18] <MatmaRex>	 hi. sorry i'm late
[14:06:20] <SD_hehua>	 ok
[14:06:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2162 db1203', diff saved to https://phabricator.wikimedia.org/P73966 and previous config saved to /var/cache/conftool/dbconfig/20250303-140638-marostegui.json
[14:06:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1203.eqiad.wmnet
[14:06:59] <wikibugs>	 (03Merged) 10jenkins-bot: Set Transwiki namespace on zhwikivoyage and zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121622 (https://phabricator.wikimedia.org/T387055) (owner: 10SD hehua)
[14:07:03] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2162.codfw.wmnet
[14:07:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 855.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:07:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1121622|Set Transwiki namespace on zhwikivoyage and zhwikiversity (T387055)]]
[14:07:19] <stashbot>	 T387055: Set Transwiki namespace on zhwikivoyage and zhwikiversity - https://phabricator.wikimedia.org/T387055
[14:07:43] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Rebuilding indexes
[14:08:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "For the benthos part, can't meaningfully comment on the k8s (yet!)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková)
[14:08:59] <ihurbain>	 Lucas_WMDE: if you finish this one and then maybe yours (so that it's done and you can go do something else :P ) i can take over the rest if you want me to
[14:09:57] <Lucas_WMDE>	 sounds good to me, thanks!
[14:10:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10596056 (10MoritzMuehlenhoff) Sure thing, just give me a brief headsup on IRC whenever it works for you and I'll depool the server.
[14:10:09] * ihurbain logs into STUFF then
[14:12:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 855.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[14:12:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, sdhehua: Backport for [[gerrit:1121622|Set Transwiki namespace on zhwikivoyage and zhwikiversity (T387055)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:12:29] <stashbot>	 T387055: Set Transwiki namespace on zhwikivoyage and zhwikiversity - https://phabricator.wikimedia.org/T387055
[14:12:35] <Lucas_WMDE>	 SD_hehua: please test
[14:13:03] <SD_hehua>	 ok
[14:13:27] <SD_hehua>	 no problem found
[14:13:37] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, sdhehua: Continuing with sync
[14:13:39] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1203.eqiad.wmnet
[14:13:40] <Lucas_WMDE>	 cool, thanks!
[14:13:43] <wikibugs>	 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387733 (10phaultfinder) 03NEW
[14:13:46] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 #page on db1248 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5908.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:13:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2164 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73968 and previous config saved to /var/cache/conftool/dbconfig/20250303-141350-root.json
[14:13:58] <marostegui>	 federico3: db1248 is yours?
[14:13:59] <jelto>	 !incidents
[14:13:59] <sirenbot>	 5709 (UNACKED)  db1248 (paged)/MariaDB Replica Lag: s4 (paged)
[14:13:59] <sirenbot>	 5708 (RESOLVED)  db2166 (paged)/MariaDB Replica Lag: s8 (paged)
[14:13:59] <sirenbot>	 5707 (RESOLVED)  db1246 (paged)/MariaDB Replica IO: s2 (paged)
[14:13:59] <sirenbot>	 5706 (RESOLVED)  Host db1246 (paged) - PING  - Packet loss = 100%
[14:14:06] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2162.codfw.wmnet
[14:14:06] <marostegui>	 !ack 5709
[14:14:07] <sirenbot>	 5709 (ACKED)  db1248 (paged)/MariaDB Replica Lag: s4 (paged)
[14:14:07] <jelto>	 !ack 5709
[14:14:07] <sirenbot>	 5709 (ACKED)  db1248 (paged)/MariaDB Replica Lag: s4 (paged)
[14:14:18] <sobanski>	 Split second finish
[14:14:36] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1203.eqiad.wmnet with reason: Index rebuild
[14:14:51] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2162.codfw.wmnet with reason: Index rebuild
[14:15:04] <fabfur>	 tnx jelto
[14:15:04] <federico3>	 marostegui: yes, it's the source of cloning, should recover
[14:15:22] <federico3>	 it's the same glitch from before (fixed in the CR)
[14:15:25] <marostegui>	 federico3: Why did it send a p4ge? 
[14:15:27] <marostegui>	 Ah ok
[14:15:44] <jelto>	 great then I'll wait, I can confirm replica lag is going down already
[14:15:49] <jelto>	 for db1248
[14:15:55] <federico3>	 it seems it does not recover replication lag quickly enough when being added back
[14:17:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Add mshilova to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124107 (https://phabricator.wikimedia.org/T386754)
[14:17:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73969 and previous config saved to /var/cache/conftool/dbconfig/20250303-141754-root.json
[14:18:03] <wikibugs>	 (03CR) 10Kgraessle: [C:03+1] Add MP event stream for MassDelete workflows [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123435 (https://phabricator.wikimedia.org/T382147) (owner: 10Jsn.sherman)
[14:19:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73970 and previous config saved to /var/cache/conftool/dbconfig/20250303-141919-root.json
[14:20:45] <jinxer-wm>	 RESOLVED: [2x] CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[14:21:19] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121622|Set Transwiki namespace on zhwikivoyage and zhwikiversity (T387055)]] (duration: 14m 02s)
[14:21:22] <stashbot>	 T387055: Set Transwiki namespace on zhwikivoyage and zhwikiversity - https://phabricator.wikimedia.org/T387055
[14:21:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1066* for ban elastic1066 to hopefully stop rejections - bking@cumin2002 - T387176
[14:21:37] <stashbot>	 T387176: Investigate eqiad Elastic cluster latency - https://phabricator.wikimedia.org/T387176
[14:21:38] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1066* for ban elastic1066 to hopefully stop rejections - bking@cumin2002 - T387176
[14:21:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118486 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[14:22:27] <wikibugs>	 (03Merged) 10jenkins-bot: Enable fixed Wikibase RDF everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118486 (https://phabricator.wikimedia.org/T384344) (owner: 10Lucas Werkmeister (WMDE))
[14:22:42] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1118486|Enable fixed Wikibase RDF everywhere (T384344)]]
[14:22:45] <stashbot>	 T384344: Wikibase/Wikidata and WDQS disagree about statement, reference and value namespace prefixes - https://phabricator.wikimedia.org/T384344
[14:23:46] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Remove unused config variable $wgJsonConfigInterwikiPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122683
[14:23:58] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Fix inconsistent definitions for $wmgLocalServices['chart-renderer'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123766
[14:24:09] <wikibugs>	 (03PS5) 10Bartosz Dziewoński: Deduplicate JsonConfig config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711
[14:25:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ...
[14:25:50] <jinxer-wm>	 CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow
[14:27:11] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1118486|Enable fixed Wikibase RDF everywhere (T384344)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:27:17] <Lucas_WMDE>	 testing…
[14:27:44] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ...
[14:27:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[14:28:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: fix git origin at bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1124109
[14:28:55] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync
[14:28:57] <Lucas_WMDE>	 lgtm
[14:29:45] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577)
[14:30:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10596196 (10phaultfinder)
[14:31:11] <wikibugs>	 (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] Change license for Russian Wikinews to CC-BY-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123495 (https://phabricator.wikimedia.org/T387279) (owner: 10Bartosz Dziewoński)
[14:31:48] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:33:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73971 and previous config saved to /var/cache/conftool/dbconfig/20250303-143259-root.json
[14:34:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73972 and previous config saved to /var/cache/conftool/dbconfig/20250303-143425-root.json
[14:35:31] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1118486|Enable fixed Wikibase RDF everywhere (T384344)]] (duration: 12m 49s)
[14:35:41] * Lucas_WMDE done deploying
[14:35:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ...
[14:35:49] <Lucas_WMDE>	 over to you ihurbain :)
[14:35:50] <ihurbain>	 MatmaRex: do you think we can do a single deploy for some of our patches? 
[14:35:50] <jinxer-wm>	 CirrusSearch consumer-cloudelastic@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow
[14:35:55] <ihurbain>	 Lucas_WMDE: ack :)
[14:35:57] <ihurbain>	 thank you :)
[14:36:14] <MatmaRex>	 ihurbain: certainly, even all of them if you wanted
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:36:53] <ihurbain>	 MatmaRex: like "all of them" -> the 6 of them or the 3 of them? :P
[14:37:00] <wikibugs>	 (03PS1) 10Federico Ceratto: db2167: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124112 (https://phabricator.wikimedia.org/T387660)
[14:37:21] <ihurbain>	 (i haven't looked at the last three tbh)
[14:37:26] <MatmaRex>	 ihurbain: heh, maybe just the three, i'm not that much of a maverick
[14:37:31] <ihurbain>	 :D
[14:37:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic1066.eqiad.wmnet
[14:37:41] <MatmaRex>	 it would probably be fine though ;)
[14:37:49] <ihurbain>	 okay, let's try that. (the three)
[14:38:01] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,druid: Enable IPIP on druid-public-broker@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307)
[14:38:04] <MatmaRex>	 ihurbain: also, https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1122709 will need a cleanupTitle.php maintenance run on mediawikiwiki. should take just a few seconds.
[14:38:12] <MatmaRex>	 cleanupTitles.php *
[14:38:31] <ihurbain>	 *ah*.
[14:38:36] <Lucas_WMDE>	 oh, I also meant to check cleanupTitles (or whatever the right script was) for the new namespaces on the zh wikis but forgot
[14:38:37] <ihurbain>	 that's a new thing and i had missed that.
[14:38:55] <Lucas_WMDE>	 namespaceDupes is the one I meant I think
[14:38:56] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307) (owner: 10Vgutierrez)
[14:39:20] <ihurbain>	 mmmph.
[14:39:24] <wikibugs>	 (03CR) 10Elukey: "test-cookbook for puppetserver2004:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey)
[14:39:33] <Lucas_WMDE>	 ok nothing to do there
[14:40:05] <ihurbain>	 MatmaRex: if that works for you i'll do the parsoid one and the ruwikinews one first, and then the other one for the jsonnamespace one, because i don't want to rush that
[14:40:12] <Lucas_WMDE>	 (https://phabricator.wikimedia.org/T387055#10596231)
[14:40:40] <wikibugs>	 (03PS2) 10Federico Ceratto: db2166: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124112 (https://phabricator.wikimedia.org/T387660)
[14:40:40] <ihurbain>	 s/parsoid/fragments/
[14:40:54] <MatmaRex>	 ihurbain: sure
[14:40:59] <ihurbain>	 okay, let's do that then.
[14:41:09] <wikibugs>	 (03PS3) 10Federico Ceratto: db2146: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124112 (https://phabricator.wikimedia.org/T387660)
[14:41:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123495 (https://phabricator.wikimedia.org/T387279) (owner: 10Bartosz Dziewoński)
[14:41:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123815 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry)
[14:42:37] <wikibugs>	 (03Merged) 10jenkins-bot: Change license for Russian Wikinews to CC-BY-4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123495 (https://phabricator.wikimedia.org/T387279) (owner: 10Bartosz Dziewoński)
[14:42:40] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Turn on Parsoid fragment support everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123815 (https://phabricator.wikimedia.org/T387608) (owner: 10Subramanya Sastry)
[14:42:44] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[14:42:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ...
[14:42:50] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[14:42:56] <logmsgbot>	 !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1123495|Change license for Russian Wikinews to CC-BY-4.0 (T387279)]], [[gerrit:1123815|Revert "Turn on Parsoid fragment support everywhere" (T387608)]]
[14:43:00] <stashbot>	 T387279: Change of default license for Russian Wikinews to CC-BY-4.0 - https://phabricator.wikimedia.org/T387279
[14:43:01] <stashbot>	 T387608: Refs inside {{efn}} are now outputting strip markers in Parsoid - https://phabricator.wikimedia.org/T387608
[14:44:05] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host puppetserver2004.codfw.wmnet with OS bookworm
[14:44:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10596271 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1002 for host puppetserver2004.codfw.wmnet with OS bookworm
[14:44:29] <wikibugs>	 (03PS2) 10Vgutierrez: hiera,druid: Enable IPIP on druid-public-broker@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307)
[14:44:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic1066.eqiad.wmnet
[14:45:07] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:45:37] <wikibugs>	 (03Abandoned) 10Federico Ceratto: db2146: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124112 (https://phabricator.wikimedia.org/T387660) (owner: 10Federico Ceratto)
[14:45:39] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307) (owner: 10Vgutierrez)
[14:45:57] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:46:42] <logmsgbot>	 !log ihurbain@deploy2002 matmarex, ssastry, ihurbain: Backport for [[gerrit:1123495|Change license for Russian Wikinews to CC-BY-4.0 (T387279)]], [[gerrit:1123815|Revert "Turn on Parsoid fragment support everywhere" (T387608)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:46:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[14:46:50] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[14:46:54] <ihurbain>	 MatmaRex: we can check stuff
[14:47:52] <MatmaRex>	 ihurbain: ruwikinews looks good (https://ru.wikinews.org/w/api.php?action=query&meta=siteinfo&siprop=rightsinfo)
[14:47:59] <ihurbain>	 and mine looks good too
[14:48:00] <jinxer-wm>	 FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[14:48:04] <logmsgbot>	 !log ihurbain@deploy2002 matmarex, ssastry, ihurbain: Continuing with sync
[14:48:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73973 and previous config saved to /var/cache/conftool/dbconfig/20250303-144805-root.json
[14:48:15] <ihurbain>	 wheeee!
[14:49:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73974 and previous config saved to /var/cache/conftool/dbconfig/20250303-144930-root.json
[14:51:45] <jinxer-wm>	 RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[14:51:46] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 #page on db1248 is OK: OK slave_sql_lag Replication lag: 50.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:52:22] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] [Growth] Add mediawiki.product_metrics.growth_product_interaction stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123607 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno)
[14:52:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad
[14:52:54] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad
[14:53:08] <wikibugs>	 (03PS1) 10Federico Ceratto: db1252.yaml, db2167.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124114
[14:53:39] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] db1252.yaml, db2167.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124114 (owner: 10Federico Ceratto)
[14:54:22] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] db1252.yaml, db2167.yaml: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124114 (owner: 10Federico Ceratto)
[14:54:36] <logmsgbot>	 !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123495|Change license for Russian Wikinews to CC-BY-4.0 (T387279)]], [[gerrit:1123815|Revert "Turn on Parsoid fragment support everywhere" (T387608)]] (duration: 11m 39s)
[14:54:41] <stashbot>	 T387279: Change of default license for Russian Wikinews to CC-BY-4.0 - https://phabricator.wikimedia.org/T387279
[14:54:41] <stashbot>	 T387608: Refs inside {{efn}} are now outputting strip markers in Parsoid - https://phabricator.wikimedia.org/T387608
[14:54:49] <ihurbain>	 all right
[14:55:13] <ihurbain>	 MatmaRex: i'm running the other one; while re-reading doc about running maintenance scripts :P
[14:55:45] <MatmaRex>	 thanks
[14:56:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122709 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński)
[14:56:32] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage
[14:56:53] <wikibugs>	 (03Merged) 10jenkins-bot: Remove $wmgUseGraphWithJsonNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122709 (https://phabricator.wikimedia.org/T124748) (owner: 10Bartosz Dziewoński)
[14:56:53] <MatmaRex>	 apparently it's more complicated post-kubernetes https://wikitech.wikimedia.org/wiki/Maintenance_scripts
[14:57:12] <logmsgbot>	 !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1122709|Remove $wmgUseGraphWithJsonNamespace (T124748)]]
[14:57:14] <stashbot>	 T124748: Deprecate Graph and Data namespaces on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748
[14:57:36] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "I'm not really persuaded this would ever be an issue, or it's worth the pain of reworking dashboards now.  But no real objection I guess, " [puppet] - 10https://gerrit.wikimedia.org/r/1122955 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi)
[14:58:04] <MatmaRex>	 but i think you'll need something like… this? mwscript-k8s -f --comment="T124748" -- cleanupTitles --wiki=mediawikiwiki
[14:58:20] <ihurbain>	 yeah i'm slowly reaching that conclusion :D
[14:59:50] <ihurbain>	 (aaaaa!) (it's fine. PROBABLY.)
[14:59:51] <logmsgbot>	 !log ihurbain@deploy2002 matmarex, ihurbain: Backport for [[gerrit:1122709|Remove $wmgUseGraphWithJsonNamespace (T124748)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:59:58] <ihurbain>	 MatmaRex: test servers on
[15:00:51] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed
[15:00:56] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed
[15:01:58] <MatmaRex>	 ihurbain: seems good
[15:02:37] <logmsgbot>	 !log ihurbain@deploy2002 matmarex, ihurbain: Continuing with sync
[15:03:00] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver2004.codfw.wmnet with reason: host reimage
[15:03:07] <ihurbain>	 vroom. 
[15:03:27] <ihurbain>	 note: backport window running a bit over, probably 10 minutes or so
[15:06:39] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1066-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:22] <ottomata>	 ihurbain: thank you!  i'd like to do a  non urgent deploy when it is finshed, please let me know
[15:08:00] <ihurbain>	 ack - sorry for the delay!
[15:08:50] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to Dashboards in Superset / Hive interfaces (like Hue) that do access private data for Mariya Shilova - https://phabricator.wikimedia.org/T386754#10596392 (10Ahoelzl) Approved.
[15:09:07] <logmsgbot>	 !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122709|Remove $wmgUseGraphWithJsonNamespace (T124748)]] (duration: 11m 55s)
[15:09:10] <stashbot>	 T124748: Deprecate Graph and Data namespaces on mediawiki.org and collab.wikimedia.org - https://phabricator.wikimedia.org/T124748
[15:09:15] <ihurbain>	 okay. now for the maintenance script.
[15:09:15] <ottomata>	 (no worries at all! take your time)
[15:09:42] <ihurbain>	 (doing a dry-run first because eh.)
[15:11:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Pooling in after cloning to db1252 T385141', diff saved to https://phabricator.wikimedia.org/P73976 and previous config saved to /var/cache/conftool/dbconfig/20250303-151103-fceratto.json
[15:11:07] <stashbot>	 T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141
[15:11:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73977 and previous config saved to /var/cache/conftool/dbconfig/20250303-151107-root.json
[15:11:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2206 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73978 and previous config saved to /var/cache/conftool/dbconfig/20250303-151113-root.json
[15:11:22] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed
[15:11:33] <ihurbain>	 MatmaRex: maintenance script is run, it says "Finished page... 2 of 1804456 rows updated"
[15:11:51] <MatmaRex>	 ihurbain: thanks
[15:11:52] <MatmaRex>	 hmm, 2?
[15:12:10] <ihurbain>	 https://phabricator.wikimedia.org/T124748#10588324 i think that's expected?
[15:12:49] <ihurbain>	 mmh
[15:12:53] <MatmaRex>	 oh, it had a talk page
[15:12:59] <MatmaRex>	 yes, all good
[15:13:08] <MatmaRex>	 i expected only this one: https://www.mediawiki.org/wiki/Broken/NS486:Json:Wikicon i didn't realize it had a talk too
[15:13:09] <ihurbain>	 aha.
[15:13:13] <ihurbain>	 amazing.
[15:13:26] <MatmaRex>	 which is now https://www.mediawiki.org/wiki/Broken/NS487:Json:Wikicon
[15:13:33] <MatmaRex>	 thanks for deploying!
[15:13:43] <ihurbain>	 then: deployment window is over, you still have your three "if we have time" to move to another one, and we're done - ottomata you're free to go!
[15:13:57] <ihurbain>	 MatmaRex: thank you for having a patch that taught me something :D
[15:14:09] <wikibugs>	 (03PS3) 10Vgutierrez: hiera,druid: Enable IPIP on druid-public-broker@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307)
[15:16:30] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307) (owner: 10Vgutierrez)
[15:16:39] <jinxer-wm>	 RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1066-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[15:18:28] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver2004.codfw.wmnet with OS bookworm
[15:18:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install puppetserver2004 - https://phabricator.wikimedia.org/T381274#10596416 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1002 for host puppetserver2004.codfw.wmnet with OS bookworm completed: -...
[15:20:56] <ihurbain>	 should https://wikitech.wikimedia.org/wiki/Maintenance_scripts be linked from https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers somewhere and/or " (At some point, this will be obsoleted by mwscript-k8s, currently under development at T341553.)" be updated on that page? (I'm happy to do this, I just don't want to be "too early" - my understanding is "that This (mwscript-k8s) Is Now The Way", but double 
[15:20:56] <ihurbain>	 checking before I touch stuff :D (yeah yeah i know the wiki way)
[15:20:57] <stashbot>	 T341553: Allow running one-off scripts manually - https://phabricator.wikimedia.org/T341553
[15:21:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[15:21:47] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:21:50] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[15:23:26] <ihurbain>	 (bah. i'll update doc, if someone disagrees they can fix it.)
[15:23:45] <claime>	 ihurbain: yeah it should probably be updated to use mwscript-k8s by default, at least we would catch more potential bugs and issues
[15:23:55] <wikibugs>	 (03PS16) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231)
[15:23:55] <wikibugs>	 (03CR) 10Tiziano Fogli: "It will work for pro4x PDUs out of the box. However, some modifications to the netbox-hiera outputs will be needed to include the PDU mode" [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[15:24:03] <wikibugs>	 (03PS7) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231)
[15:24:13] <ihurbain>	 oops i forgot to log the end of the window
[15:24:21] <ihurbain>	 !log UTC afternoon deploys done
[15:24:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:24] <ihurbain>	 (hopla.)
[15:28:37] <ihurbain>	 https://wikitech.wikimedia.org/w/index.php?title=Backport_windows%2FDeployers&diff=2279007&oldid=2238500 hop.
[15:32:37] <wikibugs>	 (03PS17) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb)
[15:33:28] <wikibugs>	 (03CR) 10Eevans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1124109 (owner: 10Filippo Giunchedi)
[15:35:16] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2166 gradually with 4 steps - Cloned db2166 to db2167
[15:35:31] <ottomata>	 ihurbain: thank you!
[15:36:16] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[15:36:25] <wikibugs>	 (03PS18) 10Federico Ceratto: sre.mysql.pool: sanity check for depool operations [cookbooks] - 10https://gerrit.wikimedia.org/r/1084813 (https://phabricator.wikimedia.org/T378572) (owner: 10Arnaudb)
[15:36:34] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[15:36:38] <ottomata>	 !log deploying eventgate-logging-external to bump to node20 - T383814
[15:36:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:41] <stashbot>	 T383814: Upgrade eventgate-wikimedia to node20 - https://phabricator.wikimedia.org/T383814
[15:36:44] <jinxer-wm>	 RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[15:37:23] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply
[15:37:44] <wikibugs>	 (03PS4) 10JMeybohm: Respect kubeVersion constraints in charts and admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123006 (https://phabricator.wikimedia.org/T387376)
[15:37:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix git origin at bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/1124109 (owner: 10Filippo Giunchedi)
[15:37:54] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123006 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm)
[15:38:36] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply
[15:40:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: improve pontoon-wait-puppet [puppet] - 10https://gerrit.wikimedia.org/r/1124122
[15:40:44] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2167 gradually with 4 steps - Cloned db2166 to db2167
[15:41:44] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1248 gradually with 4 steps - Cloning db1252.eqiad.wmnet completed
[15:41:46] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[15:42:12] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[15:43:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add mshilova to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124107 (https://phabricator.wikimedia.org/T386754) (owner: 10Muehlenhoff)
[15:43:24] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123690 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[15:43:25] <wikibugs>	 (03CR) 10Scott French: [C:03+2] shellbox-media: serve 1/8 of requests on 8.1 with more logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123690 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[15:44:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: improve pontoon-wait-puppet [puppet] - 10https://gerrit.wikimedia.org/r/1124122 (owner: 10Filippo Giunchedi)
[15:45:04] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 13Patch-For-Review: Requesting access to Dashboards in Superset / Hive interfaces (like Hue) that do access private data for Mariya Shilova - https://phabricator.wikimedia.org/T386754#10596562 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzM...
[15:45:08] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-media: serve 1/8 of requests on 8.1 with more logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123690 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[15:47:08] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users ssh access and Kerberos identity for YLiou_WMF - https://phabricator.wikimedia.org/T385220#10596605 (10MoritzMuehlenhoff) 05Open→03Stalled
[15:47:30] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[15:47:58] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[15:48:26] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387748 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[15:48:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387748 (10ops-monitoring-bot) 03NEW
[15:50:28] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[15:50:33] <wikibugs>	 (03PS1) 10JMeybohm: Set DH_GOLANG_BUILDPKG in debian/rules [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124136
[15:50:38] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,eventschemas: Enable IPIP on schema@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1124137 (https://phabricator.wikimedia.org/T387308)
[15:50:39] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,eventschemas: Enable IPIP on schema@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124138 (https://phabricator.wikimedia.org/T387308)
[15:50:43] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: add pontoonctl wait-puppet command [puppet] - 10https://gerrit.wikimedia.org/r/1124123
[15:50:43] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[15:50:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add pontoonctl wait-puppet command [puppet] - 10https://gerrit.wikimedia.org/r/1124123 (owner: 10Filippo Giunchedi)
[15:51:24] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124138 (https://phabricator.wikimedia.org/T387308) (owner: 10Vgutierrez)
[15:51:29] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124137 (https://phabricator.wikimedia.org/T387308) (owner: 10Vgutierrez)
[15:51:29] <swfrench-wmf>	 !log started shellbox-media PHP 8.1 pilot with increased logging - T377038
[15:51:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:32] <stashbot>	 T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038
[15:51:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:52:56] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: misc ctl improvements [puppet] - 10https://gerrit.wikimedia.org/r/1124124
[15:53:56] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2155,2187].codfw.wmnet with reason: Rebuilding indexes
[15:54:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1247 db2155', diff saved to https://phabricator.wikimedia.org/P73985 and previous config saved to /var/cache/conftool/dbconfig/20250303-155447-marostegui.json
[15:54:58] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1247.eqiad.wmnet
[15:55:06] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2155.codfw.wmnet
[15:55:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: misc ctl improvements [puppet] - 10https://gerrit.wikimedia.org/r/1124124 (owner: 10Filippo Giunchedi)
[15:56:07] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: misc ctl improvements [puppet] - 10https://gerrit.wikimedia.org/r/1124124
[15:56:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: misc ctl improvements [puppet] - 10https://gerrit.wikimedia.org/r/1124124 (owner: 10Filippo Giunchedi)
[15:58:04] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1252 gradually with 4 steps - Cloned db124 to db1252
[15:58:06] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1252 gradually with 4 steps - Cloned db124 to db1252
[15:58:42] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1252 gradually with 4 steps - Cloned db124 to db1252
[15:58:44] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1252 gradually with 4 steps - Cloned db124 to db1252
[15:59:35] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset for harroyo-wmf - https://phabricator.wikimedia.org/T386922#10596705 (10hector.arroyo) Thanks!
[16:00:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Set DH_GOLANG_BUILDPKG in debian/rules [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124136 (owner: 10JMeybohm)
[16:00:50] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1247.eqiad.wmnet
[16:01:14] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2155.codfw.wmnet
[16:01:20] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1247.eqiad.wmnet with reason: Index rebuild
[16:01:37] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Index rebuild
[16:03:58] <wikibugs>	 (03PS1) 10Scott French: Revert "shellbox-media: serve 1/8 of requests on 8.1 with more logging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124140 (https://phabricator.wikimedia.org/T377038)
[16:06:33] <wikibugs>	 (03CR) 10Scott French: [C:03+2] Revert "shellbox-media: serve 1/8 of requests on 8.1 with more logging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124140 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[16:08:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "shellbox-media: serve 1/8 of requests on 8.1 with more logging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124140 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French)
[16:10:00] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[16:10:06] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[16:10:19] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[16:10:23] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[16:10:49] <swfrench-wmf>	 !log finished shellbox-media PHP 8.1 pilot - T377038
[16:10:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:51] <stashbot>	 T377038: Migrate production Shellbox variants to PHP 8.1 - https://phabricator.wikimedia.org/T377038
[16:13:34] <wikibugs>	 (03PS18) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[16:13:54] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: add config set/get to base, rework tests [puppet] - 10https://gerrit.wikimedia.org/r/1124125
[16:14:10] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: prompt the user for host prefix and save it to config [puppet] - 10https://gerrit.wikimedia.org/r/1124126
[16:14:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: add config set/get to base, rework tests [puppet] - 10https://gerrit.wikimedia.org/r/1124125 (owner: 10Filippo Giunchedi)
[16:17:09] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: refactor controller class into its own file [puppet] - 10https://gerrit.wikimedia.org/r/1124127
[16:17:11] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: turn relative imports into absolute [puppet] - 10https://gerrit.wikimedia.org/r/1124128
[16:17:12] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: improve enroll experience [puppet] - 10https://gerrit.wikimedia.org/r/1124129
[16:17:13] <wikibugs>	 (03PS3) 10Filippo Giunchedi: pontoon: rework bootstrap instructions in README.md [puppet] - 10https://gerrit.wikimedia.org/r/1124130
[16:18:04] <moritzm>	 !log depool maps2009 T387431
[16:18:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:07] <stashbot>	 T387431: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431
[16:18:13] <wikibugs>	 (03PS1) 10Tiziano Fogli: network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231)
[16:18:13] <wikibugs>	 (03CR) 10Tiziano Fogli: "I tried the updated GraphQL query manually, but I didn't test it with test-cookbook." [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[16:18:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: prompt the user for host prefix and save it to config [puppet] - 10https://gerrit.wikimedia.org/r/1124126 (owner: 10Filippo Giunchedi)
[16:19:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] pontoon: refactor controller class into its own file [puppet] - 10https://gerrit.wikimedia.org/r/1124127 (owner: 10Filippo Giunchedi)
[16:19:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: turn relative imports into absolute [puppet] - 10https://gerrit.wikimedia.org/r/1124128 (owner: 10Filippo Giunchedi)
[16:19:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: improve enroll experience [puppet] - 10https://gerrit.wikimedia.org/r/1124129 (owner: 10Filippo Giunchedi)
[16:19:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: rework bootstrap instructions in README.md [puppet] - 10https://gerrit.wikimedia.org/r/1124130 (owner: 10Filippo Giunchedi)
[16:20:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[16:20:50] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2166 gradually with 4 steps - Cloned db2166 to db2167
[16:21:08] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:21:08] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:23:08] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 8.411 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:23:08] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53658 bytes in 8.603 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:25:00] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10596835 (10MatthewVernon) I still don't see that Dell can claim we're using the drives incorrectly given they sold us this setup?  I think I'd tend to try swapping the backplane...
[16:26:20] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2167 gradually with 4 steps - Cloned db2166 to db2167
[16:26:32] <wikibugs>	 (03PS2) 10Ottomata: eventgate-logging-external - upgrade to node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122159 (https://phabricator.wikimedia.org/T383814)
[16:26:32] <wikibugs>	 (03PS1) 10Ottomata: eventgate-analytics-external - upgrade to node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124144 (https://phabricator.wikimedia.org/T383814)
[16:28:13] <wikibugs>	 (03CR) 10Ottomata: [V:03+2 C:03+2] eventgate-logging-external - upgrade to node20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122159 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata)
[16:28:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Record LDAP access for dhardy [puppet] - 10https://gerrit.wikimedia.org/r/1124145 (https://phabricator.wikimedia.org/T387157)
[16:29:25] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to  ldap/wmf  for DHardy-WMF - https://phabricator.wikimedia.org/T387157#10596861 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Access has been granted via Wikimedia IDM.
[16:29:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for dhardy [puppet] - 10https://gerrit.wikimedia.org/r/1124145 (https://phabricator.wikimedia.org/T387157) (owner: 10Muehlenhoff)
[16:30:05] <jouncebot>	 jan_drewniak: That opportune time for a Wikimedia Portals Update deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T1630).
[16:31:26] <ottomata>	 jan_drewniak: I have an unrelated deployment to do, mind if I go now, or should I wait?  
[16:32:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Track LDAP access for chuckonwumelu [puppet] - 10https://gerrit.wikimedia.org/r/1124146 (https://phabricator.wikimedia.org/T387627)
[16:33:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387748#10596880 (10Pppery) →14Duplicate dup:03T382984
[16:33:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10596882 (10Pppery)
[16:33:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Track LDAP access for chuckonwumelu [puppet] - 10https://gerrit.wikimedia.org/r/1124146 (https://phabricator.wikimedia.org/T387627) (owner: 10Muehlenhoff)
[16:34:13] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmf for chuckonwumelu - https://phabricator.wikimedia.org/T387627#10596885 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Access has been enabled via Wikimedia IDM.
[16:34:23] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1248.eqiad.wmnet onto db1252.eqiad.wmnet
[16:34:29] <wikibugs>	 06SRE: Ops-monitoring-bot creating duplicate tasks for the same RAID failure - https://phabricator.wikimedia.org/T387754 (10Pppery) 03NEW
[16:37:23] <jinxer-wm>	 FIRING: ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:37:32] <ottomata>	 jan_drewniak: i'm assuming it is okay! proceeding.
[16:37:36] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[16:38:04] <ottomata>	 !log deploying eventgate-logging-external to ACTUALLY bump to node20 - T383814
[16:38:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:07] <stashbot>	 T383814: Upgrade eventgate-wikimedia to node20 - https://phabricator.wikimedia.org/T383814
[16:38:10] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10596921 (10MoritzMuehlenhoff) @HCoplin-WMF This needs approval by your manager.
[16:38:14] <wikibugs>	 (03CR) 10BryanDavis: php: Allow provisioning MediaWiki with PHP 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis)
[16:38:20] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[16:41:18] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for hswan - https://phabricator.wikimedia.org/T387522#10596943 (10MoritzMuehlenhoff) @HSwan-WMF: Requests to the wmf and logstash-accress LDAP groups are handled within Wikimedia IDM: Can you please log into https://idm.wikimedia.org and request the groups by...
[16:42:03] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply
[16:42:23] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:42:49] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply
[16:43:11] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[16:46:53] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:47:06] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:48:05] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[16:49:02] <wikibugs>	 (03CR) 10Elukey: "The code seems not working for UEFI, since "PXE" seems only valid in the "Legacy" domain. I am going to check other UEFI nodes to figure o" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577) (owner: 10Elukey)
[16:49:35] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:49:49] <wikibugs>	 (03PS1) 10Vgutierrez: site,hiera: Reimage lvs5006 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1124150 (https://phabricator.wikimedia.org/T384477)
[16:50:37] <wikibugs>	 (03PS2) 10Vgutierrez: site,hiera: Reimage lvs5006 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1124150 (https://phabricator.wikimedia.org/T384477)
[16:50:54] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124150 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[16:52:18] <jinxer-wm>	 FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported
[16:52:23] <jinxer-wm>	 RESOLVED: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[16:53:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10597002 (10MoritzMuehlenhoff) @Jhancock.wm maps2009 is ready
[16:53:24] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10597003 (10elukey) I was able to run `restart` (the command is not visible in the help, but available) and the output was:  ` elukey@ms...
[16:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:59:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf for bwang - https://phabricator.wikimedia.org/T387614#10597013 (10bwang)
[16:59:51] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf for bwang - https://phabricator.wikimedia.org/T387614#10597015 (10bwang) @MoritzMuehlenhoff Sorry I realized i already have wmf access, i need access to 'analytics-privatedata-users' for private data on superset
[17:02:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123607 (https://phabricator.wikimedia.org/T387286) (owner: 10Sergio Gimeno)
[17:02:43] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Set DH_GOLANG_BUILDPKG in debian/rules [debs/helm-diff] - 10https://gerrit.wikimedia.org/r/1124136 (owner: 10JMeybohm)
[17:03:09] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, and 2 others: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10597032 (10Jclark-ctr) @fnegri @VRiley-WMF   did this need to be reopened.   idrac shows  The system inlet temperature is greater than the...
[17:03:13] <wikibugs>	 (03PS1) 10JMeybohm: Use goccy/go-yaml instead of gopkg.in/yaml.v2 by default [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1124151 (https://phabricator.wikimedia.org/T341984)
[17:05:22] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to wmf for bwang - https://phabricator.wikimedia.org/T387614#10597057 (10Jdrewniak) >  - access request (or expansion) has sign off of WMF sponsor/manager (sponsor for volunteers, manager for wmf staff) Just confirming that @bwang needs access to  'analytics-priv...
[17:06:07] <wikibugs>	 (03CR) 10Volans: [C:03+1] "The addition LGTM, nothing major and the model can surely be useful to simplify the related prometheus code I see in the other CRs related" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[17:08:31] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-2] "to be merged on 2025-03-04" [puppet] - 10https://gerrit.wikimedia.org/r/1124150 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[17:10:52] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1124151 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[17:11:50] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.139.0" for 204 host(s)
[17:13:42] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] site,hiera: Reimage lvs5006 as liberica [puppet] - 10https://gerrit.wikimedia.org/r/1124150 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez)
[17:16:43] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.139.0" completed for 204 hosts
[17:17:57] <wikibugs>	 (03CR) 10Krinkle: Add config needed to re-architecture mainstash away from x2 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123447 (https://phabricator.wikimedia.org/T383327) (owner: 10Ladsgroup)
[17:19:09] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Use goccy/go-yaml instead of gopkg.in/yaml.v2 by default [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1124151 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[17:21:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:23:48] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:28:54] <wikibugs>	 (03PS1) 10Elukey: knative-serving: add a default value for config-observability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124155 (https://phabricator.wikimedia.org/T387580)
[17:29:35] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[17:31:24] <wikibugs>	 (03CR) 10Klausman: [C:03+1] knative-serving: add a default value for config-observability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124155 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey)
[17:34:10] <wikibugs>	 (03CR) 10Elukey: [C:03+2] knative-serving: add a default value for config-observability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124155 (https://phabricator.wikimedia.org/T387580) (owner: 10Elukey)
[17:40:35] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[17:42:07] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[17:42:23] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[17:44:35] <jinxer-wm>	 RESOLVED: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[17:48:25] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387769 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[17:48:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387769 (10ops-monitoring-bot) 03NEW
[17:50:44] <icinga-wm>	 PROBLEM - Host maps2009 is DOWN: PING CRITICAL - Packet loss = 100%
[17:51:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:54:14] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577)
[17:54:24] <vgutierrez>	 maps2009 just went down?
[17:54:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:54:37] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[17:54:53] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[17:54:56] <moritzm>	 vgutierrez: Jennifer is fixing things with the mgmt
[17:55:03] <vgutierrez>	 thx 
[17:56:32] <icinga-wm>	 RECOVERY - Host maps2009 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms
[17:59:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: prometheus-pg-replication-lag.service on maps2006:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:00:04] <jouncebot>	 swfrench-wmf: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T1800).
[18:00:05] <jouncebot>	 ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T1800).
[18:00:07] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10597479 (10Jhancock.wm) rebooted the server and drained power. on power up, confirmed that mgmt and network ip were pingable.
[18:00:10] <moritzm>	 !log repool maps2009 T387431
[18:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:13] <stashbot>	 T387431: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431
[18:00:33] <swfrench-wmf>	 o/ I'll get started on at least one of my two planned changes shortly
[18:01:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10597483 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[18:01:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387733#10597499 (10Jhancock.wm) →14Duplicate dup:03T387431
[18:01:24] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387431#10597501 (10Jhancock.wm)
[18:01:38] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[18:01:45] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123700 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:01:48] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[18:02:02] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123006 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm)
[18:02:16] <wikibugs>	 (03PS3) 10Elukey: sre.hosts.provision: add logic to set PXE for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1124110 (https://phabricator.wikimedia.org/T387577)
[18:02:35] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[18:02:46] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[18:03:05] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[18:03:32] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10597510 (10Jhancock.wm) i agree. I'll get those backplanes replaced and we can try that. (honestly, been trying to figure how to do it since they're behind a lot of other parts)...
[18:07:34] <wikibugs>	 (03CR) 10Scott French: "Thank you both!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123699 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:07:37] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): scale next to 40% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123699 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:09:22] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(api-ext|web): scale next to 40% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123699 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:12:14] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[18:12:30] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[18:13:19] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[18:13:31] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[18:14:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73993 and previous config saved to /var/cache/conftool/dbconfig/20250303-181451-root.json
[18:15:17] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[18:15:33] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[18:16:30] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[18:16:42] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[18:17:23] <swfrench-wmf>	 !log scaled mw-(api-ext|web) next deployments to 40% of main size - T383845
[18:17:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:25] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[18:21:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123694 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:21:58] <wikibugs>	 (03Merged) 10jenkins-bot: Enroll 100% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123694 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:22:18] <logmsgbot>	 !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1123694|Enroll 100% of client sessions in PHP 8.1 (T383845)]]
[18:24:56] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1123694|Enroll 100% of client sessions in PHP 8.1 (T383845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[18:24:58] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[18:25:35] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Respect kubeVersion constraints in charts and admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123006 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm)
[18:26:28] <logmsgbot>	 !log swfrench@deploy2002 swfrench: Continuing with sync
[18:28:11] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: scale up consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124166 (https://phabricator.wikimedia.org/T386935)
[18:29:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73994 and previous config saved to /var/cache/conftool/dbconfig/20250303-182956-root.json
[18:30:26] <swfrench-wmf>	 jouncebot: nowandnext
[18:30:26] <jouncebot>	 For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T1800)
[18:30:27] <jouncebot>	 In 2 hour(s) and 29 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T2100)
[18:33:22] <logmsgbot>	 !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123694|Enroll 100% of client sessions in PHP 8.1 (T383845)]] (duration: 11m 03s)
[18:33:25] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[18:34:43] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] cirrus-streaming-updater: scale up consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124166 (https://phabricator.wikimedia.org/T386935) (owner: 10DCausse)
[18:37:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73995 and previous config saved to /var/cache/conftool/dbconfig/20250303-183721-root.json
[18:38:25] <wikibugs>	 (03PS1) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774)
[18:43:09] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: scale up consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124166 (https://phabricator.wikimedia.org/T386935) (owner: 10DCausse)
[18:43:36] <wikibugs>	 (03Merged) 10jenkins-bot: Respect kubeVersion constraints in charts and admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123006 (https://phabricator.wikimedia.org/T387376) (owner: 10JMeybohm)
[18:43:51] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[18:43:54] <wikibugs>	 (03CR) 10Ssingh: geo-maps: update South America DCs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins)
[18:43:58] <wikibugs>	 (03PS3) 10JMeybohm: Update validating-admission-policies for k8s >=1.30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984)
[18:44:01] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:44:08] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[18:44:12] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[18:44:13] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[18:45:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73996 and previous config saved to /var/cache/conftool/dbconfig/20250303-184501-root.json
[18:49:32] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for chuckonwumelu - https://phabricator.wikimedia.org/T387627#10597817 (10Aklapper) 05Resolved→03Open Reopening as @Chuckonwumelu did not get added to https://phabricator.wikimedia.org/project/members/61/ per steps on https://wikitech.wikimedia.org/wi...
[18:49:32] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to  ldap/wmf  for DHardy-WMF - https://phabricator.wikimedia.org/T387157#10597815 (10Aklapper) 05Resolved→03Open Reopening as @Dillon did not get added to https://phabricator.wikimedia.org/project/members/61/ per steps on https://wikitech.wikimedia.org/wiki/SRE/C...
[18:50:35] <wikibugs>	 (03PS16) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945)
[18:51:01] <wikibugs>	 (03CR) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[18:51:18] <wikibugs>	 (03CR) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[18:51:30] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[18:52:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73997 and previous config saved to /var/cache/conftool/dbconfig/20250303-185227-root.json
[18:53:43] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-api-int: serve 10% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123700 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:55:23] <dcausse>	 helm-lint jenkins job seems stuck and block gate-and-submit for deployment-chart (https://integration.wikimedia.org/ci/job/helm-lint/23283/console) :(
[18:57:06] <dcausse>	 swfrench-wmf: yours ran in ~2min but it's now blocked by mine :/
[18:58:57] <swfrench-wmf>	 dcausse: ah, yeah I see this one is taking a while
[18:59:21] <dcausse>	 first time I see this job being so slow...
[18:59:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10597860 (10phaultfinder)
[19:00:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73998 and previous config saved to /var/cache/conftool/dbconfig/20250303-190007-root.json
[19:01:19] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: scale up consumer-cloudelastic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124166 (https://phabricator.wikimedia.org/T386935) (owner: 10DCausse)
[19:01:21] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-int: serve 10% of traffic on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123700 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[19:01:56] <swfrench-wmf>	 interesting ... well, those eventually merged :)
[19:01:58] <dancy>	 dcausse: I see another recent complaint about that in #-releng
[19:02:18] <dcausse>	 dancy: oh, thanks good to know
[19:02:52] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[19:03:04] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:03:20] <swfrench-wmf>	 FYI, going slightly over on the UTC-late infra window today - ETA 5-10m
[19:05:02] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[19:05:03] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[19:05:19] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[19:05:35] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[19:05:59] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[19:06:12] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[19:07:02] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[19:07:15] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[19:07:30] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[19:07:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P73999 and previous config saved to /var/cache/conftool/dbconfig/20250303-190732-root.json
[19:07:42] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[19:08:23] <swfrench-wmf>	 !log serving 10% of mw-api-int traffic on PHP 8.1 - T383845
[19:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:25] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[19:08:59] <swfrench-wmf>	 alright, unless anything goes sideways in the interim, I believe I am done for now
[19:15:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1247 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74000 and previous config saved to /var/cache/conftool/dbconfig/20250303-191513-root.json
[19:18:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123776 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza)
[19:21:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:22:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74001 and previous config saved to /var/cache/conftool/dbconfig/20250303-192237-root.json
[19:23:51] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10597949 (10Jclark-ctr) @BTullis if you get a chance can you update Update the operations/puppet repo - this should include updates to preseed.yaml, and site.pp with role...
[19:24:14] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10597952 (10Jclark-ctr) a:05Jclark-ctr→03BTullis
[19:26:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: reforge1005*,relforge1006*,relforge1007* for ban hosts prior to revert  - bking@cumin2002 - T387176
[19:26:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: reforge1005*,relforge1006*,relforge1007* for ban hosts prior to revert  - bking@cumin2002 - T387176
[19:26:05] <stashbot>	 T387176: Investigate eqiad Elastic cluster latency - https://phabricator.wikimedia.org/T387176
[19:26:55] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Enable $wgCampaignEventsSeparateOngoingEvents by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124172 (https://phabricator.wikimedia.org/T386427)
[19:30:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74002 and previous config saved to /var/cache/conftool/dbconfig/20250303-193038-root.json
[19:30:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124172 (https://phabricator.wikimedia.org/T386427) (owner: 10Daimona Eaytoy)
[19:31:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74003 and previous config saved to /var/cache/conftool/dbconfig/20250303-193136-root.json
[19:37:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2155 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74004 and previous config saved to /var/cache/conftool/dbconfig/20250303-193742-root.json
[19:42:42] <wikibugs>	 (03CR) 10Scott French: php: Allow provisioning MediaWiki with PHP 8.1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis)
[19:45:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74005 and previous config saved to /var/cache/conftool/dbconfig/20250303-194543-root.json
[19:46:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74006 and previous config saved to /var/cache/conftool/dbconfig/20250303-194641-root.json
[19:47:51] <dancy>	 dcausse: I filed T387781
[19:47:52] <stashbot>	 T387781: Several recent slow (>15 minute) helm-lint job runs - https://phabricator.wikimedia.org/T387781
[19:51:08] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mobile for WRai-WMF - https://phabricator.wikimedia.org/T387786 (10WRai-WMF) 03NEW
[19:51:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:55:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mobile for WRai-WMF - https://phabricator.wikimedia.org/T387786#10598073 (10WRai-WMF)
[19:56:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10598078 (10VRiley-WMF)
[19:57:17] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to releasers-mobile for WRai-WMF - https://phabricator.wikimedia.org/T387786#10598079 (10WRai-WMF)
[20:00:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74007 and previous config saved to /var/cache/conftool/dbconfig/20250303-200048-root.json
[20:01:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74008 and previous config saved to /var/cache/conftool/dbconfig/20250303-200146-root.json
[20:02:48] <wikibugs>	 (03PS1) 10Bking: relforge/elastic: repurpose relforge hosts as elastic [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782)
[20:03:07] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking)
[20:06:46] <wikibugs>	 (03PS2) 10Bking: relforge/elastic: repurpose relforge hosts as elastic [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782)
[20:09:10] <wikibugs>	 (03PS3) 10Bking: relforge/elastic: repurpose relforge hosts as elastic [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782)
[20:09:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10598104 (10phaultfinder)
[20:13:53] <wikibugs>	 (03PS2) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774)
[20:15:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74009 and previous config saved to /var/cache/conftool/dbconfig/20250303-201554-root.json
[20:16:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74010 and previous config saved to /var/cache/conftool/dbconfig/20250303-201652-root.json
[20:17:36] <wikibugs>	 (03PS3) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774)
[20:18:25] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387787 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[20:18:30] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629#10598113 (10dancy) >>! In T288629#10582102, @JMeybohm wrote: > I stumbled upon this again recently and I think the current con...
[20:18:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387787 (10ops-monitoring-bot) 03NEW
[20:19:12] <wikibugs>	 (03CR) 10CDobbins: geo-maps: update South America DCs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins)
[20:20:03] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 10Release-Engineering-Team (Priority Backlog 📥): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629#10598121 (10dancy) 05Open→03Resolved a:03dancy ` After the build process creates the restricted mediawiki-multiversi...
[20:21:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387597#10598124 (10Dzahn) I can stop renaming tickets, no problem. The alternative to duplicate tasks would be a single task for unrelated hosts though. And that seemed worse to me than closing duplicates, ftr.
[20:22:36] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] "seems reasonable, might want to double check how elastic handles shrinking the number of masters, but worst case we can nuke relforge (the" [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking)
[20:26:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - maps2009 - https://phabricator.wikimedia.org/T387597#10598127 (10Dzahn) Or maybe I'm wrong and it doesn't create a single task anymore if it happens for different hosts.  No worries either way, I will leave it to dcops how they prefer to handle those.
[20:27:19] <wikibugs>	 (03PS1) 10CDobbins: geo-maps: update South America DCs (part 2) [dns] - 10https://gerrit.wikimedia.org/r/1124178 (https://phabricator.wikimedia.org/T387774)
[20:28:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387787#10598133 (10Pppery) →14Duplicate dup:03T382984
[20:29:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10598135 (10Pppery)
[20:29:09] <wikibugs>	 (03PS1) 10Herron: aux-k8s codfw: enable worker ingress [puppet] - 10https://gerrit.wikimedia.org/r/1124179 (https://phabricator.wikimedia.org/T381417)
[20:29:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387769#10598137 (10Pppery)
[20:29:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10598140 (10Pppery) →14Duplicate dup:03T387769
[20:29:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10598142 (10Pppery) 05Duplicate→03Open
[20:29:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387769#10598144 (10Pppery) →14Duplicate dup:03T382984
[20:29:34] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10598146 (10Pppery)
[20:29:45] <wikibugs>	 (03PS7) 10Herron: aux-k8s-worker: deploy role to codfw workers [puppet] - 10https://gerrit.wikimedia.org/r/1123434 (https://phabricator.wikimedia.org/T381417)
[20:31:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74011 and previous config saved to /var/cache/conftool/dbconfig/20250303-203100-root.json
[20:31:03] <wikibugs>	 (03CR) 10Bking: "Good call, we need to remove the nodes as masters before removing them from the cluster. Since this is relforge I'll go ahead and one-off " [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking)
[20:31:31] <wikibugs>	 (03PS1) 10Herron: aux-k8s codfw: enable worker ingress [puppet] - 10https://gerrit.wikimedia.org/r/1124179 (https://phabricator.wikimedia.org/T381417)
[20:31:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1203 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74012 and previous config saved to /var/cache/conftool/dbconfig/20250303-203158-root.json
[20:32:13] <wikibugs>	 (03PS4) 10Bernard Wang: Update experiment name for Search AB test french wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 (https://phabricator.wikimedia.org/T387400)
[20:32:37] <wikibugs>	 (03PS3) 10Bernard Wang: Deploy Search AB test to everywhere but English wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122684 (https://phabricator.wikimedia.org/T386849)
[20:40:15] <wikibugs>	 (03PS17) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945)
[20:43:27] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[20:46:25] <wikibugs>	 (03PS18) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945)
[20:46:34] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[20:48:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123807 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[20:48:25] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387788 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[20:48:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387788 (10ops-monitoring-bot) 03NEW
[20:51:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10598216 (10VRiley-WMF) an-worker1178  A2 U27 CableID 3891 Port 26  an-worker1179 B7 U1 CableID 4884 Port 16  an-workwr1180  C7 U1 CableID  5100 Port 42  an-worker1181 E1 U7 CableID 230304500065...
[20:52:18] <jinxer-wm>	 FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported
[20:57:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387788#10598234 (10Pppery) →14Duplicate dup:03T382984
[20:57:28] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10598236 (10Pppery)
[20:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:59:14] <wikibugs>	 (03PS3) 10Gergő Tisza: Update CentralAuth multi-DC rules for SUL3, attempt 2 [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695)
[20:59:57] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707#10598241 (10VRiley-WMF) a:03VRiley-WMF
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T2100).
[21:00:05] <jouncebot>	 bwang, MichaelG_WMF, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:12] <MichaelG_WMF>	 o/
[21:00:21] <bwang>	 Hello!
[21:00:25] <MatmaRex>	 hi
[21:02:07] <bwang>	 Who is deploying?
[21:02:54] <MichaelG_WMF>	 I'd hope someone of RoanKattouw, cjming, TheresNoTime, or kindrobot 
[21:08:26] <MichaelG_WMF>	 I've asked in slack
[21:10:16] <wikibugs>	 (03Abandoned) 10Fabfur: systemd: added option to remain after exit [puppet] - 10https://gerrit.wikimedia.org/r/1112193 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur)
[21:10:25] <wikibugs>	 (03Abandoned) 10Fabfur: puppet: split puppet timer for calendar and startup run options [puppet] - 10https://gerrit.wikimedia.org/r/1124102 (https://phabricator.wikimedia.org/T383976) (owner: 10Fabfur)
[21:10:45] <tgr_>	 I can deploy
[21:10:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2013.codfw.wmnet with OS bookworm
[21:10:53] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10598277 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm
[21:13:26] <bwang>	 Ok great thank you @tgr_ !
[21:13:47] <tgr_>	 I'll start with the config patches
[21:13:56] <tgr_>	 MatmaRex: they can go in one, right?
[21:14:17] <bwang>	 Ah, my config patch has a dependent patch
[21:14:32] <tgr_>	 yeah, I'll do that afterwards
[21:14:44] <bwang>	 Ok
[21:14:47] <MatmaRex>	 tgr_: yeah
[21:14:49] <tgr_>	 scap was smart enough to notice
[21:15:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122683 (owner: 10Bartosz Dziewoński)
[21:15:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123766 (owner: 10Bartosz Dziewoński)
[21:15:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123776 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza)
[21:16:19] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused config variable $wgJsonConfigInterwikiPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122683 (owner: 10Bartosz Dziewoński)
[21:16:21] <wikibugs>	 (03Merged) 10jenkins-bot: Fix inconsistent definitions for $wmgLocalServices['chart-renderer'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123766 (owner: 10Bartosz Dziewoński)
[21:16:23] <wikibugs>	 (03Merged) 10jenkins-bot: Set $wgCentralAuthSharedDomainCallback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123776 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza)
[21:16:40] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1122683|Remove unused config variable $wgJsonConfigInterwikiPrefix]], [[gerrit:1123766|Fix inconsistent definitions for $wmgLocalServices['chart-renderer']]], [[gerrit:1123776|Set $wgCentralAuthSharedDomainCallback (T387357)]]
[21:16:43] <stashbot>	 T387357: SUL3 signup results in autocreation on all edge login domains - https://phabricator.wikimedia.org/T387357
[21:19:15] <logmsgbot>	 !log tgr@deploy2002 matmarex, tgr: Backport for [[gerrit:1122683|Remove unused config variable $wgJsonConfigInterwikiPrefix]], [[gerrit:1123766|Fix inconsistent definitions for $wmgLocalServices['chart-renderer']]], [[gerrit:1123776|Set $wgCentralAuthSharedDomainCallback (T387357)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:19:45] <logmsgbot>	 !log tgr@deploy2002 matmarex, tgr: Continuing with sync
[21:19:54] <wikibugs>	 (03PS4) 10Bking: relforge/elastic: repurpose relforge hosts as elastic [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782)
[21:21:04] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[21:21:18] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[21:21:32] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Use session storage for session tick events [extensions/WikimediaEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123734 (https://phabricator.wikimedia.org/T387400) (owner: 10Bernard Wang)
[21:22:55] <wikibugs>	 (03CR) 10Bking: [C:03+2] relforge/elastic: repurpose relforge hosts as elastic [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking)
[21:23:31] <bwang>	 Let me know when I can test!
[21:23:39] <wikibugs>	 (03CR) 10Bking: [C:03+2] "I made a small change after the +1 to remove relforge1004 as a master-eligible, since it wasn't set up that way in the first place." [puppet] - 10https://gerrit.wikimedia.org/r/1124176 (https://phabricator.wikimedia.org/T387782) (owner: 10Bking)
[21:23:48] <jinxer-wm>	 FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:24:09] <tgr_>	 bwang: do you want to test separately, or the two patches together?
[21:25:01] <bwang>	 Hm I think either works, as long as the config is synced last
[21:26:46] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122683|Remove unused config variable $wgJsonConfigInterwikiPrefix]], [[gerrit:1123766|Fix inconsistent definitions for $wmgLocalServices['chart-renderer']]], [[gerrit:1123776|Set $wgCentralAuthSharedDomainCallback (T387357)]] (duration: 10m 06s)
[21:26:49] <stashbot>	 T387357: SUL3 signup results in autocreation on all edge login domains - https://phabricator.wikimedia.org/T387357
[21:26:59] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[21:27:15] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[21:27:22] <tgr_>	 I'd just sync the two together then, saves some time
[21:28:44] <bwang>	 Ok
[21:28:52] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] docroot: Enable Chrome credential sharing on all open SUL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123810 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle)
[21:29:04] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] Update validating-admission-policies for k8s >=1.30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120628 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[21:29:07] <wikibugs>	 (03Merged) 10jenkins-bot: Use session storage for session tick events [extensions/WikimediaEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123734 (https://phabricator.wikimedia.org/T387400) (owner: 10Bernard Wang)
[21:29:42] <bwang>	 what server should I test on
[21:29:53] <tgr_>	 just a sec
[21:30:25] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[21:30:46] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[21:30:57] <logmsgbot>	 !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[21:32:25] <tgr_>	 Change '1123448', project 'mediawiki/extensions/WikimediaEvents', branch 'master' not found in any deployed wikiversion. Deployed wikiversions: ['1.44.0-wmf.18']
[21:32:51] <tgr_>	 I guess this is just scap being confused about change IDs being shared across branches?
[21:32:57] <tgr_>	 1123448 is the master patch
[21:33:36] <tgr_>	 let's see what happens if I override that warning
[21:33:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 (https://phabricator.wikimedia.org/T387400) (owner: 10Bernard Wang)
[21:34:14] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  an-worker1181 - vriley@cumin1002"
[21:34:31] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  an-worker1181 - vriley@cumin1002"
[21:34:31] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:34:37] <wikibugs>	 (03Merged) 10jenkins-bot: Update experiment name for Search AB test french wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 (https://phabricator.wikimedia.org/T387400) (owner: 10Bernard Wang)
[21:34:55] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1123734|Use session storage for session tick events (T387400)]], [[gerrit:1123449|Update experiment name for Search AB test french wiki (T387400)]]
[21:34:58] <stashbot>	 T387400: SessionTick instrument should use sessionStorage instead of localStorage - https://phabricator.wikimedia.org/T387400
[21:35:26] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1181
[21:35:36] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-worker1181
[21:35:44] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1181
[21:35:51] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-worker1181
[21:36:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from relforge1005 to elastic1108
[21:37:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[21:37:34] <logmsgbot>	 !log tgr@deploy2002 bwang, tgr: Backport for [[gerrit:1123734|Use session storage for session tick events (T387400)]], [[gerrit:1123449|Update experiment name for Search AB test french wiki (T387400)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:38:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from relforge1006 to elastic1109
[21:39:03] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:39:13] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:39:18] <tgr_>	 bwang: you can use any of the WikimediaDebug server options. I think the standard one to use is k8s-mwdebug.
[21:41:40] <bwang>	 Ok im testing now
[21:41:50] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:41:56] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:42:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[21:42:37] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:42:45] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:43:28] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[21:43:37] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from relforge1005 to elastic1108
[21:44:31] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[21:44:39] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from relforge1006 to elastic1109
[21:44:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from relforge1005 to elastic1108
[21:45:11] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[21:45:59] <MichaelG_WMF>	 @tgr_ Thanks for running the deployment window, but I don't think it makes sense to start with my change anymore, unless we basically want to almost completely take over the Weekly Security deployment window
[21:46:08] <MichaelG_WMF>	 jouncebot: next
[21:46:09] <jouncebot>	 In 0 hour(s) and 13 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T2200)
[21:46:47] <tgr_>	 we'll run into it by 10 min or so, should be fine
[21:47:58] <MichaelG_WMF>	 you sure? That change touches i18n files. In the past they took a long time to sync 
[21:48:09] <MichaelG_WMF>	 but maybe that improved with k8s?
[21:48:28] <tgr_>	 I think scap just syncs everything all the time these days
[21:49:01] <MichaelG_WMF>	 Alright, I'm here for it. Let's try it out when you're ready
[21:50:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming relforge1005 to elastic1108 - bking@cumin2002"
[21:51:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming relforge1005 to elastic1108 - bking@cumin2002"
[21:51:10] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:51:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1108
[21:51:41] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1108
[21:51:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:52:21] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from relforge1005 to elastic1108
[21:53:48] <bwang>	 Ok things are good! Tested
[21:53:56] <logmsgbot>	 !log tgr@deploy2002 bwang, tgr: Continuing with sync
[21:54:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1108.eqiad.wmnet with OS bullseye
[21:54:34] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2013.codfw.wmnet with OS bookworm
[21:54:40] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10598402 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm executed with errors: - backu...
[21:55:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from relforge1006 to elastic1109
[21:56:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[21:56:16] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:56:24] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on relforge1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:57:08] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 168 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 247, active_shards: 327, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 164, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number
[21:57:08] <icinga-wm>	 light_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 66.06060606060606 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:57:08] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on relforge1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 168 threshold =0.15 breach: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, discovered_master: True, active_primary_shards: 247, active_shards: 327, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 164, delayed_unassigned_shards: 0, number_of_pe
[21:57:08] <icinga-wm>	 sks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 66.06060606060606 https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:59:00] <inflatador>	 ^^ expected
[21:59:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming relforge1006 to elastic1109 - bking@cumin2002"
[22:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250303T2200). Please do the needful.
[22:00:13] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on relforge[1003-1004,1006-1007].eqiad.wmnet with reason: T387782
[22:00:16] <stashbot>	 T387782: Repurpose relforge hosts back to Elastic  - https://phabricator.wikimedia.org/T387782
[22:00:41] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming relforge1006 to elastic1109 - bking@cumin2002"
[22:00:41] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:00:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1109
[22:00:49] <tgr_>	 we are running over with the backports a bit
[22:00:55] <tgr_>	 let me know if that's a problem
[22:00:58] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1109
[22:01:00] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123734|Use session storage for session tick events (T387400)]], [[gerrit:1123449|Update experiment name for Search AB test french wiki (T387400)]] (duration: 26m 04s)
[22:01:03] <stashbot>	 T387400: SessionTick instrument should use sessionStorage instead of localStorage - https://phabricator.wikimedia.org/T387400
[22:01:36] <tgr_>	 Reedy, sbassett, Maryum, manfredi: are you using the window? if not, we have one more backport to go
[22:01:39] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from relforge1006 to elastic1109
[22:02:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1109.eqiad.wmnet with OS bullseye
[22:04:18] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[22:04:38] <wikibugs>	 (03PS1) 10Aaron Schulz: Update Docker images of change-prop services to ones using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124191 (https://phabricator.wikimedia.org/T381588)
[22:05:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2014.codfw.wmnet with OS bookworm
[22:06:05] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10598441 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm
[22:06:27] <tgr_>	 I'll take that as a no
[22:06:49] <MichaelG_WMF>	 👍
[22:06:50] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:07:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124098 (https://phabricator.wikimedia.org/T387160) (owner: 10Michael Große)
[22:07:10] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1181
[22:07:18] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1181
[22:07:52] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.dns.netbox
[22:08:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from relforge1007 to elastic1110
[22:08:27] <wikibugs>	 (03PS6) 10Bartosz Dziewoński: Deduplicate JsonConfig config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711
[22:10:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:11:45] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  an-worker1181 - vriley@cumin1002"
[22:12:03] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt  an-worker1181 - vriley@cumin1002"
[22:12:04] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:12:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[22:13:12] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1181
[22:13:20] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1181
[22:15:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:15:39] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:18:15] <wikibugs>	 (03PS2) 10CDobbins: geo-maps: update South America DCs (part 2) [dns] - 10https://gerrit.wikimedia.org/r/1124178 (https://phabricator.wikimedia.org/T387774)
[22:19:27] <wikibugs>	 (03Merged) 10jenkins-bot: feat(Surfacing): Add Change Tag for surfaced Add a Link [extensions/GrowthExperiments] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124098 (https://phabricator.wikimedia.org/T387160) (owner: 10Michael Große)
[22:19:43] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124098|feat(Surfacing): Add Change Tag for surfaced Add a Link (T387160)]]
[22:19:46] <stashbot>	 T387160: Surfacing "Add a link" Structured Tasks: Edit Tag - https://phabricator.wikimedia.org/T387160
[22:20:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming relforge1007 to elastic1110 - bking@cumin2002"
[22:20:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming relforge1007 to elastic1110 - bking@cumin2002"
[22:20:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:20:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host elastic1110
[22:20:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host elastic1110
[22:20:59] <wikibugs>	 (03CR) 10Ryan Kemper: wdqs: create new ui for wdqs legacy full (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:21:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from relforge1007 to elastic1110
[22:21:47] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:23:54] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: create new ui for wdqs legacy full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422)
[22:24:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1110.eqiad.wmnet with OS bullseye
[22:24:25] <wikibugs>	 (03PS4) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774)
[22:26:53] <wikibugs>	 (03CR) 10Ryan Kemper: wdqs: add routing for legacy full graph host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:27:28] <ryankemper>	 mutante: just a heads-up, will be merging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1122678 to create a new `wdqs-legacy-full` ui in k8s miscweb
[22:28:21] <icinga-wm>	 PROBLEM - Disk space on deploy2002 is CRITICAL: DISK CRITICAL - free space: /srv 9009 MB (3% inode=69%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops
[22:30:00] <mutante>	 ryankemper: I dont know much about that but Jelto saying "mostly good" before gives me some levle of confidence ;)
[22:30:41] <mutante>	 that being said, the disk space on deployment host thing is a bit concerning
[22:31:16] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124042 (https://phabricator.wikimedia.org/T387700) (owner: 10Brouberol)
[22:31:31] <mutante>	 let's see if there is something easy to do about that.. probably not .. but checking
[22:32:21] <mutante>	 yea, probably needs releng to delete old mw versions 
[22:32:51] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[22:32:53] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[22:33:24] <ryankemper>	 oh just saw the disk space thing
[22:34:28] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: create new ui for wdqs legacy full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422)
[22:35:12] <logmsgbot>	 !log tgr@deploy2002 migr, tgr: Backport for [[gerrit:1124098|feat(Surfacing): Add Change Tag for surfaced Add a Link (T387160)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:35:15] <stashbot>	 T387160: Surfacing "Add a link" Structured Tasks: Edit Tag - https://phabricator.wikimedia.org/T387160
[22:35:16] <mutante>	 169G docker, 29G mediawiki-staging, 57G deployment. I would fix stuff if it was on /  but on /srv/ under actual deployment dirs I'd rather leave it alone
[22:35:26] <tgr_>	 MichaelG_WMF: ^
[22:35:43] <MichaelG_WMF>	 thanks, I'm testing
[22:36:02] <wikibugs>	 (03PS1) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774)
[22:36:09] <mutante>	 3% means 8.8G. I don't know if it's an issue right this moment.
[22:36:21] <mutante>	 can make a ticket either way
[22:36:58] <wikibugs>	 (03Abandoned) 10CDobbins: geo-maps: update South America DCs [dns] - 10https://gerrit.wikimedia.org/r/1124169 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins)
[22:39:01] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply
[22:39:12] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply
[22:40:34] <MichaelG_WMF>	 tgr_: it worked, I can see the new tag on https://test.wikipedia.org/w/index.php?title=The_Power_of_the_Dog_(film)&action=history
[22:40:50] <wikibugs>	 06SRE, 06Release-Engineering-Team: deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796 (10Dzahn) 03NEW
[22:41:10] <logmsgbot>	 !log tgr@deploy2002 migr, tgr: Continuing with sync
[22:41:50] <tgr_>	 the time estimate was a bit off, but apparently no harm done :)
[22:42:32] <wikibugs>	 (03PS3) 10CDobbins: geo-maps: update South America DCs (part 2) [dns] - 10https://gerrit.wikimedia.org/r/1124178 (https://phabricator.wikimedia.org/T387774)
[22:42:38] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: Create DNS entry for one full graph host [dns] - 10https://gerrit.wikimedia.org/r/1122676 (https://phabricator.wikimedia.org/T384422)
[22:42:57] <wikibugs>	 (03PS2) 10CDobbins: geo-maps: update South America DCs (part 1/2) [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774)
[22:43:04] <wikibugs>	 06SRE, 06Release-Engineering-Team: deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10598664 (10Dzahn) Also the part that if the only notification is an IRC line on the -operations channel it is not easily noticed anymore nowadays.  Might it be better if that was an email...
[22:43:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123810 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle)
[22:43:56] <wikibugs>	 (03CR) 10Ryan Kemper: "Test deployment to codfw looks good; going to merge and deploy to eqiad now" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:43:58] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: create new ui for wdqs legacy full [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122678 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:44:37] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, March 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [skins/Vector] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123713 (https://phabricator.wikimedia.org/T358910) (owner: 10Jdlrobson)
[22:45:31] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply
[22:46:10] <logmsgbot>	 !log ryankemper@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply
[22:46:19] <ryankemper>	 !log T384422 k8s deployment of `wikidata-query-legacy-full-gui` release in codfw looks fine, proceeding to eqiad
[22:46:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:46:22] <stashbot>	 T384422: Provide a low availability / scalability full graph endpoint to ease the transition to a split graph - https://phabricator.wikimedia.org/T384422
[22:46:53] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: Create DNS entry for one full graph host [dns] - 10https://gerrit.wikimedia.org/r/1122676 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:47:31] <ryankemper>	 !log T384422 Merging DNS patch now https://gerrit.wikimedia.org/r/c/operations/dns/+/1122676
[22:47:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:40] <logmsgbot>	 !log ryankemper@dns1004 START - running authdns-update
[22:49:47] <logmsgbot>	 !log ryankemper@dns1004 END - running authdns-update
[22:51:12] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124098|feat(Surfacing): Add Change Tag for surfaced Add a Link (T387160)]] (duration: 31m 28s)
[22:51:15] <stashbot>	 T387160: Surfacing "Add a link" Structured Tasks: Edit Tag - https://phabricator.wikimedia.org/T387160
[22:51:55] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[22:52:15] <tgr_>	 !log late UTC deploys done
[22:52:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:52:33] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1108.eqiad.wmnet with OS bullseye
[22:54:58] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: add routing for legacy full graph host [puppet] - 10https://gerrit.wikimedia.org/r/1121726 (https://phabricator.wikimedia.org/T384422) (owner: 10Ryan Kemper)
[22:56:38] <ryankemper>	 !log T384422 Deploying backend.yaml routing patch; after it's deployed we should theoretically be able to see a UI at https://query-legacy-full.wikidata.org/
[22:56:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:40] <stashbot>	 T384422: Provide a low availability / scalability full graph endpoint to ease the transition to a split graph - https://phabricator.wikimedia.org/T384422
[23:01:03] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1109.eqiad.wmnet with OS bullseye
[23:01:05] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance backend on cp5021 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[23:02:05] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance backend on cp5021 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[23:02:22] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:02:34] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:06:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10598755 (10VRiley-WMF)
[23:08:17] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10598761 (10VRiley-WMF) @BTullis is there a specific RAID that is supposed to be placed onto these servers?
[23:13:01] <wikibugs>	 (03PS1) 10Scott French: php8.1: Default display_startup_errors to "stderr" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1124194 (https://phabricator.wikimedia.org/T377038)
[23:16:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1108.eqiad.wmnet with OS bullseye
[23:17:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1109.eqiad.wmnet with OS bullseye
[23:20:10] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: create query-legacy-full.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/1124197 (https://phabricator.wikimedia.org/T384422)
[23:21:25] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: create query-legacy-full.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/1124197 (https://phabricator.wikimedia.org/T384422)
[23:22:30] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1110.eqiad.wmnet with OS bullseye
[23:23:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1110.eqiad.wmnet with OS bullseye
[23:23:36] <wikibugs>	 (03PS1) 10Fabfur: systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124198 (https://phabricator.wikimedia.org/T387799)
[23:24:15] <wikibugs>	 (03Abandoned) 10Fabfur: systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124198 (https://phabricator.wikimedia.org/T387799) (owner: 10Fabfur)
[23:24:50] <wikibugs>	 06SRE, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10598803 (10thcipriani) > Somehow needs cleaning up but since it's not OS but actual deployment data the question is what can be deleted. Probably old mw versions..  Old MW versions...
[23:25:28] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2014.codfw.wmnet with OS bookworm
[23:25:29] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[23:25:34] <wikibugs>	 (03PS1) 10Fabfur: systemd: add path unit type [puppet] - 10https://gerrit.wikimedia.org/r/1124200 (https://phabricator.wikimedia.org/T387799)
[23:25:36] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10598805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm executed with errors: - backu...
[23:26:47] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10598810 (10Jhancock.wm) @Papaul both servers got stuck at the puppet certification part again. When you can, can you see if they are talking to the wrong server? thanks
[23:32:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1108.eqiad.wmnet with reason: host reimage
[23:32:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1109.eqiad.wmnet with reason: host reimage
[23:36:21] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1108.eqiad.wmnet with reason: host reimage
[23:38:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1110.eqiad.wmnet with reason: host reimage
[23:40:16] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1109.eqiad.wmnet with reason: host reimage
[23:41:55] <wikibugs>	 06SRE, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10598845 (10Dzahn) Yea, those are mediawiki-multiversion images.  Some are 8GB.  Example:   ` docker-registry.discovery.wmnet/restricted/mediawiki-multiversion         2025-02-19-01...
[23:42:29] <wikibugs>	 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10598848 (10Dzahn)
[23:44:11] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1110.eqiad.wmnet with reason: host reimage
[23:45:34] <wikibugs>	 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): deployment server - low disk space on /srv - https://phabricator.wikimedia.org/T387796#10598850 (10Dzahn) p:05Triage→03High With about 8.8GB space left and those images that can also be about 8.8G and sometimes multiple images per day ...I am...
[23:48:24] <icinga-wm>	 ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387802 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[23:48:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387802 (10ops-monitoring-bot) 03NEW
[23:50:47] <Amir1>	 !log deleted local user_password from labswiki database (T104500 and T161859)
[23:50:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:50:51] <stashbot>	 T104500: Old versions of sensitive user data (email, password hashes) can remain in database indefinitely due to local and global DB not being kept in sync - https://phabricator.wikimedia.org/T104500
[23:50:51] <stashbot>	 T161859: Make Wikitech an SUL wiki - https://phabricator.wikimedia.org/T161859
[23:51:47] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:53:25] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1108.eqiad.wmnet with OS bullseye
[23:56:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1109.eqiad.wmnet with OS bullseye
[23:58:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387802#10598895 (10Pppery) →14Duplicate dup:03T382984
[23:58:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10598897 (10Pppery)