[00:02:21] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2014.codfw.wmnet with OS bookworm
[00:02:28] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603562 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm executed with errors: - backu...
[00:03:54] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Use namespaced Title and Html classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010)
[00:03:54] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:06:46] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Use namespaced Title and Html classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010)
[00:10:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[00:10:54] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 647.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:11:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2013']
[00:11:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[00:11:50] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['backup2013']
[00:14:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2013.codfw.wmnet with OS bookworm
[00:14:10] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603626 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm
[00:19:15] <wikibugs>	 (03PS1) 10Daimona Eaytoy: officewiki: Disable the event-organizer user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124549 (https://phabricator.wikimedia.org/T387943)
[00:19:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124549 (https://phabricator.wikimedia.org/T387943) (owner: 10Daimona Eaytoy)
[00:20:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) (owner: 10Daimona Eaytoy)
[00:24:10] <wikibugs>	 (03CR) 10Zabe: [C:03+1] Use namespaced Title and Html classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) (owner: 10Daimona Eaytoy)
[00:32:38] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2013.codfw.wmnet with reason: host reimage
[00:34:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3528 MB (3% inode=98%): /tmp 3528 MB (3% inode=98%): /var/tmp 3528 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[00:36:17] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2013.codfw.wmnet with reason: host reimage
[00:38:40] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1124550
[00:38:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1124550 (owner: 10TrainBranchBot)
[00:42:37] <wikibugs>	 (03PS1) 10Arlolra: Turn on Parsoid Read Views for 42 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505)
[00:43:01] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10603665 (10Ladsgroup) I forgot to mention: This will be done as part of {T360589} First, we start serving 250px thumbnails gradually but sized to 220px,...
[00:50:30] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1124550 (owner: 10TrainBranchBot)
[00:55:49] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[00:57:21] <wikibugs>	 (03CR) 10Arlolra: "Good point but, yeah, we can do that as part of the exercise of figuring what's left to do" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505) (owner: 10Arlolra)
[00:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:59:09] <jinxer-wm>	 FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[01:00:17] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[01:00:18] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2013.codfw.wmnet with OS bookworm
[01:00:25] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm completed: - backup2013 (**PA...
[01:01:30] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[01:08:34] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1124553
[01:08:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1124553 (owner: 10TrainBranchBot)
[01:12:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[01:18:23] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2014.codfw.wmnet with OS bookworm
[01:18:33] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm
[01:28:24] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1124553 (owner: 10TrainBranchBot)
[01:36:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2014.codfw.wmnet with reason: host reimage
[01:40:24] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2014.codfw.wmnet with reason: host reimage
[01:51:28] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/94060a3722501301746a3e179221819b7849ebe36f7ec016b239e19d7bf89883/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:51:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[01:54:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3523 MB (3% inode=98%): /tmp 3523 MB (3% inode=98%): /var/tmp 3523 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[02:00:14] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:05:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10603755 (10Jhancock.wm) @Papaul i found a weird little thing. I racked ganeti2049 in B5, U40. There are three other servers in the same set of 4 on the switch. two...
[02:05:43] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[02:05:44] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2014.codfw.wmnet with OS bookworm
[02:05:57] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603756 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm completed: - backup2014 (...
[02:05:59] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603757 (10Jhancock.wm) 05Open→03Resolved
[02:06:36] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603760 (10Jhancock.wm) @jcrespo this is complete
[02:11:28] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:16:26] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 29.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:34:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3594 MB (3% inode=98%): /tmp 3594 MB (3% inode=98%): /var/tmp 3594 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:51:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:21:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:27:40] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:29:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10603809 (10phaultfinder)
[03:34:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3547 MB (3% inode=98%): /tmp 3547 MB (3% inode=98%): /var/tmp 3547 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[03:59:54] <wikibugs>	 (03CR) 10VolkerE: [C:03+1] Use namespaced Title and Html classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) (owner: 10Daimona Eaytoy)
[04:05:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:28:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10603867 (10Papaul) @Jhancock.wm thanks for checking. I see in netbox that ganetti2049 is rack in B4 and U41 and not U40 like you mentioned so i am guessing that you...
[04:54:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3489 MB (3% inode=98%): /tmp 3489 MB (3% inode=98%): /var/tmp 3489 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[04:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[04:59:09] <jinxer-wm>	 FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[05:41:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:11:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:24:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2166 db1226', diff saved to https://phabricator.wikimedia.org/P74066 and previous config saved to /var/cache/conftool/dbconfig/20250305-062402-marostegui.json
[06:24:32] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2166.codfw.wmnet
[06:24:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1226.eqiad.wmnet
[06:25:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1244 with weight 0 T387816', diff saved to https://phabricator.wikimedia.org/P74067 and previous config saved to /var/cache/conftool/dbconfig/20250305-062554-marostegui.json
[06:25:58] <stashbot>	 T387816: Switchover s4 master (db1160 -> db1244) - https://phabricator.wikimedia.org/T387816
[06:26:12] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s4 T387816
[06:26:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1244 from API/vslow/dump T387816', diff saved to https://phabricator.wikimedia.org/P74068 and previous config saved to /var/cache/conftool/dbconfig/20250305-062629-marostegui.json
[06:26:48] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1244 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1124335 (https://phabricator.wikimedia.org/T387816) (owner: 10Gerrit maintenance bot)
[06:29:33] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1226.eqiad.wmnet
[06:30:13] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2166.codfw.wmnet
[06:30:24] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1226.eqiad.wmnet with reason: Index rebuild
[06:30:51] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2166.codfw.wmnet with reason: Index rebuild
[06:30:58] <marostegui>	 !log Starting s4 eqiad failover from db1160 to db1244 - T387816
[06:31:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:01] <stashbot>	 T387816: Switchover s4 master (db1160 -> db1244) - https://phabricator.wikimedia.org/T387816
[06:31:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1244 to s4 primary T387816', diff saved to https://phabricator.wikimedia.org/P74069 and previous config saved to /var/cache/conftool/dbconfig/20250305-063124-marostegui.json
[06:32:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1160 T387816', diff saved to https://phabricator.wikimedia.org/P74070 and previous config saved to /var/cache/conftool/dbconfig/20250305-063216-marostegui.json
[06:35:36] <wikibugs>	 (03PS1) 10Marostegui: db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124576
[06:35:58] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1160.eqiad.wmnet
[06:36:02] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124576 (owner: 10Marostegui)
[06:39:43] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1250 [puppet] - 10https://gerrit.wikimedia.org/r/1124641 (https://phabricator.wikimedia.org/T385141)
[06:40:24] <wikibugs>	 (03CR) 10Marostegui: "Starting to clone this host, will eventually become a master, but not yet. That's why the master lines are critical are commented out." [puppet] - 10https://gerrit.wikimedia.org/r/1124641 (https://phabricator.wikimedia.org/T385141) (owner: 10Marostegui)
[06:42:30] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1160.eqiad.wmnet
[06:45:14] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1160.eqiad.wmnet with reason: Rebuilding index
[06:59:43] <wikibugs>	 (03PS2) 10Anzx: sewikimedia: update wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124547 (https://phabricator.wikimedia.org/T377921)
[06:59:47] <wikibugs>	 (03PS2) 10Anzx: Lift IP cap for edit-a-thon (Illinois Tech) on 2024-03-27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124541 (https://phabricator.wikimedia.org/T387568)
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T0700)
[07:00:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124541 (https://phabricator.wikimedia.org/T387568) (owner: 10Anzx)
[07:00:21] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124547 (https://phabricator.wikimedia.org/T377921) (owner: 10Anzx)
[07:00:54] <wikibugs>	 06SRE, 06DBA, 07Datacenter-Switchover: Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default) - https://phabricator.wikimedia.org/T207385#10604058 (10Marostegui) 05Open→03Declined No longer...
[07:02:16] <wikibugs>	 (03PS1) 10Marostegui: db1246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124650 (https://phabricator.wikimedia.org/T387673)
[07:03:08] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124650 (https://phabricator.wikimedia.org/T387673) (owner: 10Marostegui)
[07:03:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74071 and previous config saved to /var/cache/conftool/dbconfig/20250305-070321-root.json
[07:03:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10604064 (10Marostegui) I am repooling this host.
[07:18:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74072 and previous config saved to /var/cache/conftool/dbconfig/20250305-071827-root.json
[07:23:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Instance looks good, thank you Cole" [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite)
[07:24:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10604071 (10phaultfinder)
[07:27:40] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:33:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74073 and previous config saved to /var/cache/conftool/dbconfig/20250305-073333-root.json
[07:38:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Add hcoplin to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124708 (https://phabricator.wikimedia.org/T387459)
[07:40:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add hcoplin to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124708 (https://phabricator.wikimedia.org/T387459) (owner: 10Muehlenhoff)
[07:41:37] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache,haproxy: create tmpfile configuration for tls [puppet] - 10https://gerrit.wikimedia.org/r/1124403 (https://phabricator.wikimedia.org/T387826) (owner: 10Fabfur)
[07:43:02] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), No backups: 1 (backup1013), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:46:46] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10604083 (10MoritzMuehlenhoff) 05Open→03Resolved @HCoplin-WMF I've...
[07:46:57] <jynus>	 checking backups
[07:48:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74074 and previous config saved to /var/cache/conftool/dbconfig/20250305-074838-root.json
[07:49:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Add wrai to releasers-mobile [puppet] - 10https://gerrit.wikimedia.org/r/1124709 (https://phabricator.wikimedia.org/T387786)
[07:55:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add wrai to releasers-mobile [puppet] - 10https://gerrit.wikimedia.org/r/1124709 (https://phabricator.wikimedia.org/T387786) (owner: 10Muehlenhoff)
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T0800).
[08:00:05] <jouncebot>	 anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:09] <anzx>	 o/
[08:00:16] <hashar>	 o/
[08:00:17] <hashar>	 good morning
[08:00:26] <anzx>	 good morning 
[08:00:35] <hashar>	 let me check the server logs before we start :)
[08:01:54] <hashar>	 looks like those servers are not doing much over night
[08:02:33] <hashar>	 anzx: can we really specify IP range as `'192.42.83.144 - 192.42.83.159` ?
[08:03:38] <anzx>	 hashar: it was done on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1124541
[08:03:42] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mobile for WRai-WMF - https://phabricator.wikimedia.org/T387786#10604094 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @WRai-WMF I've just enabled your access, you should now be able to log into release...
[08:03:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74075 and previous config saved to /var/cache/conftool/dbconfig/20250305-080343-root.json
[08:04:00] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10604097 (10MoritzMuehlenhoff) 05Open→03Stalled
[08:04:05] <hashar>	 and I swear I did review/wrote the code handling IP addresses :)
[08:04:46] <hashar>	 I ll deploy both at the same time
[08:05:09] <anzx>	 ok one minute i will update commit message 
[08:05:16] <hashar>	 sure
[08:05:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:05:46] <wikibugs>	 (03PS3) 10Anzx: Lift IP cap for edit-a-thon (Illinois Tech) on March 12, 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124541 (https://phabricator.wikimedia.org/T387568)
[08:06:00] <anzx>	 hashar: done
[08:06:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124541 (https://phabricator.wikimedia.org/T387568) (owner: 10Anzx)
[08:06:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124547 (https://phabricator.wikimedia.org/T377921) (owner: 10Anzx)
[08:07:31] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10604098 (10Marostegui) Repooled @wiki_willy I emailed Dell about this host (in the existing thread we have with them) but so far there's been no reply. Do you want to keep this ticket open and...
[08:07:35] <wikibugs>	 (03Merged) 10jenkins-bot: Lift IP cap for edit-a-thon (Illinois Tech) on March 12, 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124541 (https://phabricator.wikimedia.org/T387568) (owner: 10Anzx)
[08:07:40] <wikibugs>	 (03Merged) 10jenkins-bot: sewikimedia: update wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124547 (https://phabricator.wikimedia.org/T377921) (owner: 10Anzx)
[08:07:48] <hashar>	 the sewikimedia logo would need some url wouldn't it?
[08:08:35] <anzx>	 i dont think so
[08:08:40] <logmsgbot>	 !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1124541|Lift IP cap for edit-a-thon (Illinois Tech) on March 12, 2025 (T387568)]], [[gerrit:1124547|sewikimedia: update wordmark and tagline (T377921)]]
[08:08:42] <hashar>	 :)
[08:08:44] <stashbot>	 T387568: Request list off IP cap Illinois Institute of Technology March 12, 2025 - https://phabricator.wikimedia.org/T387568
[08:08:44] <stashbot>	 T377921: Wikimedia Sverige logo distorted by new skin - https://phabricator.wikimedia.org/T377921
[08:11:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:13:43] <hashar>	  08:13:05 Started check-testservers
[08:13:57] <logmsgbot>	 !log hashar@deploy2002 hashar, anzx: Backport for [[gerrit:1124541|Lift IP cap for edit-a-thon (Illinois Tech) on March 12, 2025 (T387568)]], [[gerrit:1124547|sewikimedia: update wordmark and tagline (T377921)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:14:01] <stashbot>	 T387568: Request list off IP cap Illinois Institute of Technology March 12, 2025 - https://phabricator.wikimedia.org/T387568
[08:14:01] <stashbot>	 T377921: Wikimedia Sverige logo distorted by new skin - https://phabricator.wikimedia.org/T377921
[08:14:02] <anzx>	 hashar: logo looks good 
[08:14:05] <logmsgbot>	 !log hashar@deploy2002 hashar, anzx: Continuing with sync
[08:14:09] <hashar>	 you are fast :)
[08:15:46] <anzx>	 i opened link and refreshed page trying in Firefox , since wikimediadebug was not working on chrome logo was updated
[08:17:05] <hashar>	 what is broken with WikimediaDebug?  We did some changes recently :)
[08:17:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10604115 (10wiki_willy) Hi @Marostegui - thanks for checking.  When I look back at previous email from Dell Support sent in November, MarcoAntonio says //"we can temporarily archive the case, an...
[08:19:20] <anzx>	 hashar: it still not available in chrome, but on Firefox it's available 
[08:19:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10604117 (10Marostegui) Thanks @wiki_willy - I thought the email was just a thread and not handled via some internal ticketing system.  Let's leave this open for now so we don't forget. If there...
[08:20:43] <logmsgbot>	 !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124541|Lift IP cap for edit-a-thon (Illinois Tech) on March 12, 2025 (T387568)]], [[gerrit:1124547|sewikimedia: update wordmark and tagline (T377921)]] (duration: 12m 02s)
[08:20:47] <stashbot>	 T387568: Request list off IP cap Illinois Institute of Technology March 12, 2025 - https://phabricator.wikimedia.org/T387568
[08:20:47] <stashbot>	 T377921: Wikimedia Sverige logo distorted by new skin - https://phabricator.wikimedia.org/T377921
[08:22:05] <anzx>	 hashar: logo change looks good, with wmdebug turnoff , thanks for deploying 
[08:22:22] <hashar>	 thank you for taking care of those
[08:22:39] <hashar>	 anzx: for WikimediaDebug that is because we have done a major migration of its code base (manifest v2 to v3)
[08:22:48] <wikibugs>	 (03CR) 10DCausse: "lgtm but the chart version might need to be updated" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546 (owner: 10Ebernhardson)
[08:23:04] <anzx>	 ok
[08:23:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Bitu: Also point to idm-help@w.o for password resets [software/bitu] - 10https://gerrit.wikimedia.org/r/1124718
[08:23:08] <hashar>	 the new version is under review and somehow the old one got flagged for removal cause it is "obsolete"
[08:23:26] <hashar>	 https://phabricator.wikimedia.org/T387822#10603735
[08:23:41] <wikibugs>	 (03PS1) 10Volans: reports/librenms: fix f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124719
[08:23:47] <hashar>	 and the large task is "upgrade to manifest v3" https://phabricator.wikimedia.org/T312694
[08:24:07] <hashar>	 I don't have a workaround for Chrome short of loading the extension from source
[08:24:10] <hashar>	 else use Firefox :-]
[08:27:21] <anzx>	 i didn't try load from source, i saw task and tried it on Firefox instead 
[08:27:39] <hashar>	 sounds good :)
[08:27:54] <hashar>	 hopefully the extension will be published in the Chrome store soonish
[08:30:52] <wikibugs>	 06SRE, 07LDAP, 13Patch-For-Review: ldap-admins POSIX group does not actually give any permissions to its members - https://phabricator.wikimedia.org/T386472#10604136 (10MoritzMuehlenhoff) >>! In T386472#10573065, @Urbanecm wrote: > Noting @jrbs was added to the group in T220860, in order to be able to run ch...
[08:33:22] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[08:34:13] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Prepare backup2013 to take over codfw backups of es* dbs [puppet] - 10https://gerrit.wikimedia.org/r/1124720 (https://phabricator.wikimedia.org/T387892)
[08:34:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10604144 (10Ben.buchenau) Thanks @MoritzMuehlenhoff , confused my Phabricator with the developer account. Just created a developer account, named Ben.buchenau (ssh access...
[08:34:55] <wikibugs>	 (03CR) 10DCausse: [C:03+1] "thanks, good catch!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122151 (https://phabricator.wikimedia.org/T375520) (owner: 10Bking)
[08:40:16] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Prepare backup2013 to take over codfw backups of es* dbs [puppet] - 10https://gerrit.wikimedia.org/r/1124720 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[08:41:14] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] reports/librenms: fix f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124719 (owner: 10Volans)
[08:41:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[08:42:42] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[08:44:51] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604156 (10MoritzMuehlenhoff)
[08:44:53] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[08:44:58] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10604157 (10Volans) AFAICS we are still missing the AAAA record on all of the hosts listed in the task description.
[08:45:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet
[08:45:46] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604160 (10ops-monitoring-bot) Draining ganeti1032.eqiad.wmnet of running VMs
[08:46:14] <wikibugs>	 (03PS1) 10Volans: reports/network: update no AAAA records list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124722
[08:46:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet
[08:46:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] reports/network: update no AAAA records list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124722 (owner: 10Volans)
[08:46:59] <wikibugs>	 (03CR) 10Volans: reports/network: update no AAAA records list (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124722 (owner: 10Volans)
[08:47:03] <wikibugs>	 (03CR) 10Volans: [C:03+2] reports/librenms: fix f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124719 (owner: 10Volans)
[08:47:48] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM. We probably need to clean up that page a bit, it's getting a little messy." [software/bitu] - 10https://gerrit.wikimedia.org/r/1124718 (owner: 10Muehlenhoff)
[08:48:24] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:49:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd1003.eqiad.wmnet to drbd
[08:50:08] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[08:50:15] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[08:50:46] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604164 (10ops-monitoring-bot) VM aux-k8s-etcd1003.eqiad.wmnet switching disk type to drbd
[08:50:49] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[08:51:09] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[08:51:30] <wikibugs>	 (03Merged) 10jenkins-bot: network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[08:51:30] <wikibugs>	 (03Merged) 10jenkins-bot: reports/librenms: fix f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124719 (owner: 10Volans)
[08:52:54] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[08:53:05] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[08:54:46] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "network_devices: adding device model - tappof@cumin1002 - T387231"
[08:54:51] <stashbot>	 T387231: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231
[08:55:09] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1124641 (https://phabricator.wikimedia.org/T385141) (owner: 10Marostegui)
[08:55:52] <logmsgbot>	 !log tappof@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "network_devices: adding device model - tappof@cumin1002 - T387231"
[08:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:58:39] <wikibugs>	 (03CR) 10Ayounsi: pdu_config_netbox: add new module to grab PDUs from netbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[08:59:09] <jinxer-wm>	 FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[08:59:31] <wikibugs>	 (03CR) 10Ayounsi: pdu_config_netbox: add new module to grab PDUs from netbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[09:00:05] <jouncebot>	 hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T0900)
[09:00:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Bitu: Also point to idm-help@w.o for password resets [software/bitu] - 10https://gerrit.wikimedia.org/r/1124718 (owner: 10Muehlenhoff)
[09:00:45] <wikibugs>	 10SRE-swift-storage: IPv6 records inconsistent on the ms-be hosts - https://phabricator.wikimedia.org/T320947#10604194 (10Volans) As of today `ms-be2057` is the only host left without AAAA record, all the others have it. It would be great if it could be fixed.
[09:01:24] <wikibugs>	 (03Abandoned) 10Volans: reports/network: update no AAAA records list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124722 (owner: 10Volans)
[09:02:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[09:03:39] <hashar>	 ok
[09:03:45] * hashar flexes fingers muscles
[09:04:07] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[09:04:14] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db1250 [puppet] - 10https://gerrit.wikimedia.org/r/1124641 (https://phabricator.wikimedia.org/T385141) (owner: 10Marostegui)
[09:04:55] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[09:05:05] <wikibugs>	 (03CR) 10Ayounsi: "Make sure to update https://github.com/wikimedia/operations-puppet/blob/08eefaa046b24853b51919047bf7515c315af28c/modules/netbox/types/devi" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[09:05:16] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124726 (https://phabricator.wikimedia.org/T386214)
[09:05:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124726 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot)
[09:05:23] <hashar>	 tchou tchou
[09:05:43] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: cloning
[09:06:03] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124726 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot)
[09:07:04] <marostegui>	 !log Stop db1217:3321 to clone db1250 T385141
[09:07:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:07] <stashbot>	 T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141
[09:07:24] <wikibugs>	 (03PS1) 10Federico Ceratto: sre.mysql.pool: fix hostname check logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1124728 (https://phabricator.wikimedia.org/T378572)
[09:07:41] <wikibugs>	 (03CR) 10Volans: [C:03+1] "My bad, I hadn't notice we already had the slug available." [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[09:08:58] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1202 gradually with 4 steps - Cloned db1202 to db1253
[09:09:03] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1202 gradually with 4 steps - Cloned db1202 to db1253
[09:09:41] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] sre.mysql.pool: fix hostname check logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1124728 (https://phabricator.wikimedia.org/T378572) (owner: 10Federico Ceratto)
[09:09:48] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:10:14] <marostegui>	 ^expected
[09:10:34] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[09:12:47] <wikibugs>	 (03PS1) 10Volans: CI: future-proof prospector config [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124729
[09:13:56] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:14:01] <wikibugs>	 10SRE-swift-storage: IPv6 records inconsistent on the ms-be hosts - https://phabricator.wikimedia.org/T320947#10604227 (10MatthewVernon) I expect it to be refreshed in Q1 or maybe Q2 (purchase date was 2020-08-11).
[09:14:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd1003.eqiad.wmnet to drbd
[09:14:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet
[09:14:46] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] pcs: Enable more wikis for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos)
[09:14:49] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604229 (10ops-monitoring-bot) Draining ganeti1032.eqiad.wmnet of running VMs
[09:15:02] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms
[09:15:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet
[09:15:12] <godog>	 !log upgrade to karma 0.120 - T353457
[09:15:15] <stashbot>	 godog: Failed to log message to wiki. Somebody should check the error logs.
[09:15:16] <stashbot>	 T353457: Karma UI shows duplicate alerts - https://phabricator.wikimedia.org/T353457
[09:15:33] <logmsgbot>	 !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.19  refs T386214
[09:15:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti1032 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1124730
[09:15:36] <stashbot>	 T386214: 1.44.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T386214
[09:15:47] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add exporter port to gNMI metrics instance label [puppet] - 10https://gerrit.wikimedia.org/r/1122955 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi)
[09:15:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd1003.eqiad.wmnet to plain
[09:16:10] <wikibugs>	 (03Merged) 10jenkins-bot: pcs: Enable more wikis for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos)
[09:16:14] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604235 (10ops-monitoring-bot) VM aux-k8s-etcd1003.eqiad.wmnet switching disk type to plain
[09:16:24] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[09:16:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd1003.eqiad.wmnet to plain
[09:17:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to drbd
[09:17:45] <jinxer-wm>	 FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[09:17:55] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604237 (10ops-monitoring-bot) VM dse-k8s-etcd1001.eqiad.wmnet switching disk type to drbd
[09:18:53] <jynus>	 !log deploy new backup grants for es1036,es1040 T387892
[09:18:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:56] <stashbot>	 T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892
[09:19:22] <wikibugs>	 (03CR) 10DCausse: [C:03+1] cloudelastic: begin transition to opensearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking)
[09:19:39] <wikibugs>	 (03PS1) 10Tiziano Fogli: Revert "network_devices: adding device model" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124731
[09:19:56] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[09:20:42] <icinga-wm>	 PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[09:20:59] <hashar>	 logs are happy at least
[09:22:35] <marostegui>	 federico3: your changes to db1202 aren't committed
[09:22:38] <marostegui>	 federico3: can you check?
[09:22:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access to logstash for cn=wmf [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790)
[09:22:44] <marostegui>	 (the alert above)
[09:22:45] <jinxer-wm>	 FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[09:23:05] <federico3>	 looking
[09:23:08] <jynus>	 !log deploy new backup grants for es2036,es2040 T387892
[09:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:10] <federico3>	 yes, the pooling-in cookbook just tripped on the comma again
[09:25:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: remove 'default' receiver when duplicated [puppet] - 10https://gerrit.wikimedia.org/r/1124733 (https://phabricator.wikimedia.org/T353457)
[09:26:27] <marostegui>	 federico3: you can commit them manually to clear the alert for now if you like
[09:26:58] <federico3>	 give me 1 minute
[09:27:28] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] CI: future-proof prospector config [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124729 (owner: 10Volans)
[09:27:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to drbd
[09:27:34] <icinga-wm>	 PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:27:45] <jinxer-wm>	 FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[09:28:07] <wikibugs>	 (03CR) 10Volans: [C:03+2] CI: future-proof prospector config [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124729 (owner: 10Volans)
[09:28:19] <icinga-wm>	 RECOVERY - Host dse-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms
[09:28:38] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124396 (owner: 10Volans)
[09:29:17] <wikibugs>	 (03PS1) 10Tiziano Fogli: fix: netbox network_devices type [puppet] - 10https://gerrit.wikimedia.org/r/1124734
[09:29:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff)
[09:30:05] <wikibugs>	 (03Merged) 10jenkins-bot: CI: future-proof prospector config [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124729 (owner: 10Volans)
[09:30:22] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans)
[09:30:39] <wikibugs>	 (03PS3) 10Hnowlan: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858)
[09:30:44] <wikibugs>	 (03PS2) 10Federico Ceratto: sre.mysql.pool: allow merging unexpected changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124728 (https://phabricator.wikimedia.org/T378572)
[09:30:53] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1202 gradually with 4 steps - Cloned db1202 to db1253
[09:31:13] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] sre.mysql.pool: allow merging unexpected changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124728 (https://phabricator.wikimedia.org/T378572) (owner: 10Federico Ceratto)
[09:31:31] <federico3>	 marostegui: committing manually
[09:31:35] <logmsgbot>	 !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1202 gradually with 4 steps - Cloned db1202 to db1253
[09:32:09] <marostegui>	 federico3: thanks
[09:32:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Cloned db1202 to db1253', diff saved to https://phabricator.wikimedia.org/P74077 and previous config saved to /var/cache/conftool/dbconfig/20250305-093249-fceratto.json
[09:32:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74078 and previous config saved to /var/cache/conftool/dbconfig/20250305-093254-root.json
[09:32:58] <wikibugs>	 (03CR) 10Muehlenhoff: Remove access to logstash for cn=wmf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff)
[09:33:02] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1202 gradually with 4 steps - Cloned db1202 to db1253
[09:34:19] <moritzm>	 ^ the puppet alert from above is being worked on 
[09:34:36] <wikibugs>	 (03PS1) 10Volans: sre.puppet.sync-netbox-hiera: add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/1124735
[09:34:57] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[09:35:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet
[09:35:33] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604341 (10ops-monitoring-bot) Draining ganeti1032.eqiad.wmnet of running VMs
[09:35:43] <icinga-wm>	 RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs
[09:35:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Remove access to logstash for cn=wmf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff)
[09:35:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet
[09:36:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to plain
[09:36:44] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604354 (10ops-monitoring-bot) VM dse-k8s-etcd1001.eqiad.wmnet switching disk type to plain
[09:36:45] <wikibugs>	 (03PS2) 10Tiziano Fogli: fix: netbox network_devices type [puppet] - 10https://gerrit.wikimedia.org/r/1124734
[09:36:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to plain
[09:37:31] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] sre.puppet.sync-netbox-hiera: add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/1124735 (owner: 10Volans)
[09:37:35] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove access to logstash for cn=wmf [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790)
[09:37:52] <wikibugs>	 (03CR) 10Muehlenhoff: Remove access to logstash for cn=wmf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff)
[09:38:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM! Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff)
[09:38:40] <wikibugs>	 (03PS1) 10Slyngshede: Show existing approvals on permission approval pages [software/bitu] - 10https://gerrit.wikimedia.org/r/1124736
[09:38:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet
[09:38:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet
[09:39:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet
[09:39:15] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604369 (10ops-monitoring-bot) Draining ganeti1032.eqiad.wmnet of running VMs
[09:39:18] <wikibugs>	 (03PS3) 10Tiziano Fogli: fix: netbox network_devices type [puppet] - 10https://gerrit.wikimedia.org/r/1124734
[09:39:43] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] miscweb: add support for external-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123738 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto)
[09:39:54] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] validating-admission-policies: Be more explicit in tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124415 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm)
[09:41:01] <wikibugs>	 (03Merged) 10jenkins-bot: validating-admission-policies: Be more explicit in tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124415 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm)
[09:42:52] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] sre.mysql.pool: allow merging unexpected changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124728 (https://phabricator.wikimedia.org/T378572) (owner: 10Federico Ceratto)
[09:42:58] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.pool: allow merging unexpected changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124728 (https://phabricator.wikimedia.org/T378572) (owner: 10Federico Ceratto)
[09:46:54] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] fix: netbox network_devices type [puppet] - 10https://gerrit.wikimedia.org/r/1124734 (owner: 10Tiziano Fogli)
[09:47:07] <wikibugs>	 (03CR) 10Volans: [C:03+2] sre.puppet.sync-netbox-hiera: add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/1124735 (owner: 10Volans)
[09:47:17] <icinga-wm>	 PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[09:47:59] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] fix: netbox network_devices type [puppet] - 10https://gerrit.wikimedia.org/r/1124734 (owner: 10Tiziano Fogli)
[09:48:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74081 and previous config saved to /var/cache/conftool/dbconfig/20250305-094759-root.json
[09:53:11] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10604396 (10Aklapper) @Ben.buchenau Feel free to [connect](https://phabricator.wikimedia.org/settings/panel/external/) your LDAP/developer account to [your Phab account](...
[09:54:22] <wikibugs>	 (03Merged) 10jenkins-bot: sre.puppet.sync-netbox-hiera: add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/1124735 (owner: 10Volans)
[09:55:13] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Migrate es backups from backup[12]02 to backup[12]13 [puppet] - 10https://gerrit.wikimedia.org/r/1124738 (https://phabricator.wikimedia.org/T387892)
[09:55:44] <logmsgbot>	 !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1202 gradually with 4 steps - Cloned db1202 to db1253
[09:56:14] <wikibugs>	 (03Abandoned) 10Tiziano Fogli: Revert "network_devices: adding device model" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124731 (owner: 10Tiziano Fogli)
[09:56:49] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:56:55] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:57:51] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[09:58:05] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[09:58:09] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[09:58:37] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[09:58:46] <wikibugs>	 (03PS24) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[09:59:11] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10604417 (10MoritzMuehlenhoff) @Ben.buchenau You don't seem to have an NDA on record yet. I'm adding @KFrancis from the Wikimedia Legal department to set this up.
[09:59:21] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10604418 (10MoritzMuehlenhoff) p:05Triage→03Medium
[10:00:17] <icinga-wm>	 RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[10:02:27] <wikibugs>	 (03CR) 10Jcrespo: "heads up of this migration, will test it before the end of today ^" [puppet] - 10https://gerrit.wikimedia.org/r/1124738 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[10:02:45] <jinxer-wm>	 FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:03:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74083 and previous config saved to /var/cache/conftool/dbconfig/20250305-100304-root.json
[10:03:28] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Migrate es backups from backup[12]02 to backup[12]13 [puppet] - 10https://gerrit.wikimedia.org/r/1124738 (https://phabricator.wikimedia.org/T387892)
[10:05:11] <icinga-wm>	 PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 14912MiB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[10:05:22] <wikibugs>	 (03PS25) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023)
[10:05:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[10:05:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:06:32] <wikibugs>	 (03CR) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto)
[10:06:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: "The alert is firing atm for ctrl and worker for k8s-aux (https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=" [alerts] - 10https://gerrit.wikimedia.org/r/1124453 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron)
[10:12:45] <jinxer-wm>	 FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:13:46] <wikibugs>	 (03PS1) 10Federico Ceratto: instances.yaml, db1253.yaml, db1254.yaml, site.pp: clone db1253 and db1254 [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141)
[10:17:45] <jinxer-wm>	 RESOLVED: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:18:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74084 and previous config saved to /var/cache/conftool/dbconfig/20250305-101810-root.json
[10:20:26] <wikibugs>	 (03PS2) 10JMeybohm: mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478
[10:23:03] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: split envoy rules into separate groups [puppet] - 10https://gerrit.wikimedia.org/r/1124743 (https://phabricator.wikimedia.org/T387965)
[10:23:05] <wikibugs>	 (03PS12) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231)
[10:23:08] <wikibugs>	 (03PS25) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231)
[10:25:39] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: haproxy: don't log normal connections [puppet] - 10https://gerrit.wikimedia.org/r/1124744
[10:26:22] <effie>	 hashar: may I use the rest of your window for a shallbox change?
[10:26:27] <effie>	 shellbox, lol 
[10:26:37] <wikibugs>	 (03CR) 10David Caro: [C:03+2] toolforge: haproxy: don't log normal connections [puppet] - 10https://gerrit.wikimedia.org/r/1124744 (owner: 10Arturo Borrero Gonzalez)
[10:27:11] <Lucas_WMDE>	 “you shallbox pass”? ^^
[10:27:29] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] CommonSettings.php: Remove $wgSecurePollGPGCommand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124514 (owner: 10Reedy)
[10:28:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: haproxy: don't log normal connections [puppet] - 10https://gerrit.wikimedia.org/r/1124744 (owner: 10Arturo Borrero Gonzalez)
[10:29:35] <wikibugs>	 (03PS1) 10Máté Szabó: Remove unused $wgSecurePollGPGCommand setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124745 (https://phabricator.wikimedia.org/T380441)
[10:30:57] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet
[10:31:15] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2008.codfw.wmnet
[10:32:51] <elukey>	 !log restart kube-apiserver on ml-staging-ctrl200[12] after the move to containerd (some issues regisstered)
[10:32:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74085 and previous config saved to /var/cache/conftool/dbconfig/20250305-103316-root.json
[10:37:24] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:38:18] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: trial moving k8s-mlstaging to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1124747 (https://phabricator.wikimedia.org/T383232)
[10:38:24] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2008.codfw.wmnet
[10:38:26] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet
[10:38:35] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:44:12] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "The different math approaches have been discussed already, I have no strong opinions towards one or the other approach, so I think that ov" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan)
[10:45:30] <wikibugs>	 (03PS3) 10JMeybohm: mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478
[10:45:39] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan)
[10:45:56] <wikibugs>	 (03CR) 10Hnowlan: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan)
[10:49:09] <jinxer-wm>	 RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown
[10:51:17] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan)
[10:51:31] <wikibugs>	 (03CR) 10Hnowlan: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan)
[10:51:47] <wikibugs>	 (03PS4) 10Hnowlan: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858)
[10:53:09] <wikibugs>	 (03PS5) 10Hnowlan: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858)
[10:54:03] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan)
[10:55:28] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan)
[10:55:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74086 and previous config saved to /var/cache/conftool/dbconfig/20250305-105534-root.json
[10:56:01] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[10:56:11] <jinxer-wm>	 FIRING: Temperature: Temp issue on wdqs1021:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1021 - https://alerts.wikimedia.org/?q=alertname%3DTemperature
[10:56:13] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[10:57:06] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[10:57:19] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[10:57:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: add sync-data script [puppet] - 10https://gerrit.wikimedia.org/r/1124749 (https://phabricator.wikimedia.org/T383232)
[10:57:52] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[10:57:59] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[10:57:59] <wikibugs>	 (03PS4) 10JMeybohm: mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1100)
[11:01:11] <jinxer-wm>	 RESOLVED: Temperature: Temp issue on wdqs1021:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1021 - https://alerts.wikimedia.org/?q=alertname%3DTemperature
[11:07:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: replace prometheus::migration with prometheus-sync-data [puppet] - 10https://gerrit.wikimedia.org/r/1124751 (https://phabricator.wikimedia.org/T383232)
[11:07:49] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqiad [reason: no reason specified, no task ID specified]
[11:07:59] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: no reason specified, no task ID specified]
[11:09:00] <wikibugs>	 (03PS2) 10Elukey: profile::dns::auth::discovery-map: prefer codfw over eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122627 (https://phabricator.wikimedia.org/T380858)
[11:09:30] <wikibugs>	 (03PS1) 10Effie Mouzeli: shellbox-video: disable debug logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124752
[11:10:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I'll be merging this early next week" [puppet] - 10https://gerrit.wikimedia.org/r/1124747 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[11:10:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74087 and previous config saved to /var/cache/conftool/dbconfig/20250305-111040-root.json
[11:11:35] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::dns::auth::discovery-map: prefer codfw over eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122627 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey)
[11:13:31] <wikibugs>	 (03PS2) 10Tiziano Fogli: network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124741 (https://phabricator.wikimedia.org/T387231)
[11:16:07] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124741 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[11:20:45] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:23:26] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124741 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[11:23:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1168 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:25:11] <wikibugs>	 (03PS5) 10Jelto: services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836)
[11:25:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74088 and previous config saved to /var/cache/conftool/dbconfig/20250305-112545-root.json
[11:27:40] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:29:13] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[11:29:15] <wikibugs>	 (03CR) 10Btullis: "I agree with elukey here. We don't need a new partition recipe for the dse-k8s control plane nodes, so you can simply abandon this change." [puppet] - 10https://gerrit.wikimedia.org/r/1121335 (https://phabricator.wikimedia.org/T386900) (owner: 10Stevemunene)
[11:29:25] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[11:29:38] <wikibugs>	 (03Merged) 10jenkins-bot: network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124741 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[11:29:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10604726 (10phaultfinder)
[11:31:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74089 and previous config saved to /var/cache/conftool/dbconfig/20250305-113126-root.json
[11:32:44] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet
[11:34:35] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:34:48] <logmsgbot>	 !log tappof@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "network_devices: adding device model - tappof@cumin1002 - T387231"
[11:34:52] <stashbot>	 T387231: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231
[11:35:04] <logmsgbot>	 !log tappof@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "network_devices: adding device model - tappof@cumin1002 - T387231"
[11:37:24] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:38:05] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet
[11:38:35] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:38:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto)
[11:40:52] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74090 and previous config saved to /var/cache/conftool/dbconfig/20250305-114051-root.json
[11:46:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74091 and previous config saved to /var/cache/conftool/dbconfig/20250305-114632-root.json
[11:50:06] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478 (owner: 10JMeybohm)
[11:50:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:55:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74092 and previous config saved to /var/cache/conftool/dbconfig/20250305-115557-root.json
[12:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1100)
[12:00:05] <jouncebot>	 mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1200).
[12:01:31] <wikibugs>	 (03PS2) 10Slyngshede: Upgrade idp-test to 7.1.4 [dns] - 10https://gerrit.wikimedia.org/r/1124376
[12:01:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74093 and previous config saved to /var/cache/conftool/dbconfig/20250305-120138-root.json
[12:02:52] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Upgrade idp-test to 7.1.4 [dns] - 10https://gerrit.wikimedia.org/r/1124376 (owner: 10Slyngshede)
[12:03:07] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[12:05:15] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[12:07:17] <wikibugs>	 (03Abandoned) 10Stevemunene: Create dse-k8s control panel partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/1121335 (https://phabricator.wikimedia.org/T386900) (owner: 10Stevemunene)
[12:07:26] <wikibugs>	 (03CR) 10Stevemunene: "Ack, Thanks  @btullis@wikimedia.org and @ltoscano@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1121335 (https://phabricator.wikimedia.org/T386900) (owner: 10Stevemunene)
[12:09:15] <wikibugs>	 (03PS1) 10Elukey: profile::dns::auth::discovery-map: fix eqiad private config [puppet] - 10https://gerrit.wikimedia.org/r/1124755 (https://phabricator.wikimedia.org/T380858)
[12:09:28] <nemo-yiannis>	 heads up, i am planning to deploy changeprop for T387277
[12:09:29] <stashbot>	 T387277: Rollout more wikis after week 1 of testing with production traffic - https://phabricator.wikimedia.org/T387277
[12:09:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] profile::dns::auth::discovery-map: fix eqiad private config [puppet] - 10https://gerrit.wikimedia.org/r/1124755 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey)
[12:10:17] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: haproxy: check ingress workers with the /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1124756 (https://phabricator.wikimedia.org/T387959)
[12:10:32] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply
[12:10:45] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[12:11:09] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Update policy for K8s BGP to allow a wider range of v4 prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/1121438 (https://phabricator.wikimedia.org/T375845) (owner: 10Cathal Mooney)
[12:11:26] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::dns::auth::discovery-map: fix eqiad private config [puppet] - 10https://gerrit.wikimedia.org/r/1124755 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey)
[12:12:44] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply
[12:13:24] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[12:13:32] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[12:13:42] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[12:13:46] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[12:14:44] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[12:15:03] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "thank you! I can't believe I'm seeing this day <3" [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff)
[12:16:13] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:16:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74094 and previous config saved to /var/cache/conftool/dbconfig/20250305-121643-root.json
[12:16:46] <wikibugs>	 (03CR) 10David Caro: [C:03+1] toolforge: haproxy: check ingress workers with the /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1124756 (https://phabricator.wikimedia.org/T387959) (owner: 10Arturo Borrero Gonzalez)
[12:17:46] <wikibugs>	 (03PS1) 10Gergő Tisza: CentralAuth: Enable SUL3 signup on group 0 (attempt 4) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124757 (https://phabricator.wikimedia.org/T384007)
[12:17:57] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122999 (owner: 10PipelineBot)
[12:19:24] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122999 (owner: 10PipelineBot)
[12:19:56] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply
[12:20:24] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[12:20:40] <tgr_>	 rzl: I finally got around to fixing the CentralAuth multi-DC patch ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/1123029 ). Should I schedule it in a puppet or infra window, or can it go through normal code review? In the latter case, do you know who I should add as a reviewer?
[12:21:20] <jinxer-wm>	 FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[12:21:38] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply
[12:21:45] <wikibugs>	 (03PS1) 10Slyngshede: Add missing secrets for OIDC in IDP [labs/private] - 10https://gerrit.wikimedia.org/r/1124758
[12:22:02] <wikibugs>	 (03PS1) 10Ladsgroup: Enable thumbnail steps in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124759 (https://phabricator.wikimedia.org/T360589)
[12:22:19] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Add missing secrets for OIDC in IDP [labs/private] - 10https://gerrit.wikimedia.org/r/1124758 (owner: 10Slyngshede)
[12:22:37] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Add missing secrets for OIDC in IDP [labs/private] - 10https://gerrit.wikimedia.org/r/1124758 (owner: 10Slyngshede)
[12:22:39] <wikibugs>	 (03PS1) 10Ladsgroup: maintenance: Also check for utf-8 encoding in findBadBlobs [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124761 (https://phabricator.wikimedia.org/T351953)
[12:22:42] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[12:22:52] <wikibugs>	 (03PS1) 10Ladsgroup: maintenance: Also check for utf-8 encoding in findBadBlobs [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124762 (https://phabricator.wikimedia.org/T351953)
[12:23:24] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply
[12:23:24] <Amir1>	 jouncebot: nownandnext
[12:23:33] <Amir1>	 jouncebot: nowandnext
[12:23:34] <jouncebot>	 For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1100)
[12:23:34] <jouncebot>	 For the next 0 hour(s) and 36 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1200)
[12:23:34] <jouncebot>	 In 1 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1400)
[12:23:55] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] maintenance: Also check for utf-8 encoding in findBadBlobs [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124761 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup)
[12:23:55] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[12:24:01] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5029/console" [puppet] - 10https://gerrit.wikimedia.org/r/1115801 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede)
[12:24:06] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] maintenance: Also check for utf-8 encoding in findBadBlobs [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124762 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup)
[12:24:37] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[12:24:53] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5030/console" [puppet] - 10https://gerrit.wikimedia.org/r/1115801 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede)
[12:25:11] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[12:25:44] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5031/co" [puppet] - 10https://gerrit.wikimedia.org/r/1115801 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede)
[12:26:27] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5032/console" [puppet] - 10https://gerrit.wikimedia.org/r/1115801 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede)
[12:27:25] <Dreamy_Jazz>	 jouncebot: nowandnext
[12:27:26] <jouncebot>	 For the next 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1100)
[12:27:26] <jouncebot>	 For the next 0 hour(s) and 32 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1200)
[12:27:26] <jouncebot>	 In 1 hour(s) and 32 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1400)
[12:27:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1124479 (owner: 10Ahmon Dancy)
[12:27:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] envoy: Update examples [puppet] - 10https://gerrit.wikimedia.org/r/1124479 (owner: 10Ahmon Dancy)
[12:27:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Add component/lshw on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1124763 (https://phabricator.wikimedia.org/T380295)
[12:27:47] <wikibugs>	 (03PS1) 10Muehlenhoff: Install lshw backport from component/lshw [puppet] - 10https://gerrit.wikimedia.org/r/1124764 (https://phabricator.wikimedia.org/T380295)
[12:27:59] <hnowlan>	 please avoid doing any scap deploys during this window
[12:28:55] <wikibugs>	 (03PS3) 10Slyngshede: C:apereo_cas Specify encryption algorithms for CAS 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1115801 (https://phabricator.wikimedia.org/T372892)
[12:30:50] <Dreamy_Jazz>	 Sure.
[12:31:10] <Dreamy_Jazz>	 Thanks for the info
[12:31:12] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[12:31:41] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: fix hostnames for citoid requests [puppet] - 10https://gerrit.wikimedia.org/r/1124766 (https://phabricator.wikimedia.org/T361576)
[12:31:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74095 and previous config saved to /var/cache/conftool/dbconfig/20250305-123149-root.json
[12:31:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops, 07Kubernetes: Remove `.cluster.local.` suffix in PTR responses - https://phabricator.wikimedia.org/T376762#10604979 (10MoritzMuehlenhoff)
[12:32:32] <wikibugs>	 (03CR) 10Mvolz: [C:03+1] trafficserver: fix hostnames for citoid requests [puppet] - 10https://gerrit.wikimedia.org/r/1124766 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan)
[12:32:45] <wikibugs>	 06SRE, 06SRE Observability, 13Patch-For-Review: etcd: adapt etcd-backup.py for etcd 3.4 - https://phabricator.wikimedia.org/T385727#10604981 (10MoritzMuehlenhoff)
[12:33:20] <wikibugs>	 (03PS1) 10Effie Mouzeli: shellbox-media: switch main to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124767 (https://phabricator.wikimedia.org/T377038)
[12:33:34] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: sqlite::db can get stuck on zero byte file database - https://phabricator.wikimedia.org/T387112#10604983 (10MoritzMuehlenhoff)
[12:35:20] <Emperor>	 !log restart envoy/swift on ms-fe2010
[12:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:59] <wikibugs>	 (03PS1) 10Dreamy Jazz: Temporarily unset temporary-account-viewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124768 (https://phabricator.wikimedia.org/T387205)
[12:36:40] <wikibugs>	 (03CR) 10Klausman: [C:03+1] prometheus: trial moving k8s-mlstaging to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1124747 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[12:36:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124768 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz)
[12:36:44] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122997 (owner: 10PipelineBot)
[12:36:59] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122927 (owner: 10PipelineBot)
[12:37:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124761 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup)
[12:37:08] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122976 (owner: 10PipelineBot)
[12:37:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124762 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup)
[12:37:24] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:38:39] <wikibugs>	 (03PS1) 10Btullis: Replace the production SSH key for btullis [puppet] - 10https://gerrit.wikimedia.org/r/1124769 (https://phabricator.wikimedia.org/T385943)
[12:38:40] <wikibugs>	 (03Merged) 10jenkins-bot: maintenance: Also check for utf-8 encoding in findBadBlobs [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124761 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup)
[12:38:45] <wikibugs>	 (03Merged) 10jenkins-bot: maintenance: Also check for utf-8 encoding in findBadBlobs [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124762 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup)
[12:39:16] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]]
[12:39:20] <stashbot>	 T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953
[12:39:35] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:40:04] <Dreamy_Jazz>	 Amir1: hnowlan asked for no deploys during this window.
[12:40:14] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124764 (https://phabricator.wikimedia.org/T380295) (owner: 10Muehlenhoff)
[12:41:20] <jinxer-wm>	 RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh
[12:42:40] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:44:15] <hnowlan>	 Amir1: please wait for 15 minutes or so if possible 
[12:45:57] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:49:52] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] "Eurgh. Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) (owner: 10Daimona Eaytoy)
[12:50:01] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[12:50:07] <Amir1>	 shit
[12:50:20] <Amir1>	 aborted
[12:54:24] <wikibugs>	 (03PS1) 10Elukey: Revert "profile::dns::auth::discovery-map: prefer codfw over eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1124772
[12:54:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "profile::dns::auth::discovery-map: prefer codfw over eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1124772 (owner: 10Elukey)
[12:55:04] <wikibugs>	 (03Abandoned) 10Elukey: Revert "profile::dns::auth::discovery-map: prefer codfw over eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1124772 (owner: 10Elukey)
[12:55:40] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1124769 (https://phabricator.wikimedia.org/T385943) (owner: 10Btullis)
[12:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:01:15] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] "Suggestions have been applied in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1124741." [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[13:02:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "SSH has been confirmed via out-of-band channel (Slack)" [puppet] - 10https://gerrit.wikimedia.org/r/1124769 (https://phabricator.wikimedia.org/T385943) (owner: 10Btullis)
[13:02:51] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Replace the production SSH key for btullis [puppet] - 10https://gerrit.wikimedia.org/r/1124769 (https://phabricator.wikimedia.org/T385943) (owner: 10Btullis)
[13:03:44] <wikibugs>	 (03PS13) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231)
[13:03:44] <wikibugs>	 (03PS26) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231)
[13:03:45] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478 (owner: 10JMeybohm)
[13:03:50] <wikibugs>	 (03PS1) 10Elukey: Revert "profile::dns::auth::discovery-map: fix eqiad private config" [puppet] - 10https://gerrit.wikimedia.org/r/1124773
[13:03:58] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Revert "profile::dns::auth::discovery-map: fix eqiad private config" [puppet] - 10https://gerrit.wikimedia.org/r/1124773 (owner: 10Elukey)
[13:04:22] <wikibugs>	 (03PS1) 10Elukey: Revert "profile::dns::auth::discovery-map: prefer codfw over eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1124774
[13:05:34] <wikibugs>	 (03PS1) 10Hnowlan: mw-(web|api-int|api-ext): scale down, correct messages after test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124775 (https://phabricator.wikimedia.org/T380858)
[13:05:45] <wikibugs>	 (03PS2) 10Hnowlan: mw-(web|api-int|api-ext): scale down, correct messages after test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124775 (https://phabricator.wikimedia.org/T380858)
[13:06:11] <elukey>	 hey folks, I am reverting back eqiad to its pooled state, should be ready in 10 mins
[13:06:21] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478 (owner: 10JMeybohm)
[13:06:30] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: limit netbox reports alerts to eqiad and codfw [alerts] - 10https://gerrit.wikimedia.org/r/1124776 (https://phabricator.wikimedia.org/T350694)
[13:06:51] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mw-(web|api-int|api-ext): scale down, correct messages after test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124775 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan)
[13:07:36] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw-(web|api-int|api-ext): scale down, correct messages after test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124775 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan)
[13:08:11] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::dns::auth::discovery-map: prefer codfw over eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1122627 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey)
[13:08:33] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Revert "profile::dns::auth::discovery-map: prefer codfw over eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1124774 (owner: 10Elukey)
[13:09:13] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(web|api-int|api-ext): scale down, correct messages after test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124775 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan)
[13:10:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet
[13:11:07] <wikibugs>	 (03CR) 10Tiziano Fogli: "This will be tested on Pontoon." [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[13:11:52] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]]
[13:11:56] <stashbot>	 T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953
[13:12:57] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1032.eqiad.wmnet with reason: remove from cluster for reimage
[13:13:03] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10605054 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=836a9ab9-c457-4a78-ab8b-24d0332b99af) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[13:13:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1032 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1124730 (owner: 10Muehlenhoff)
[13:15:04] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:16:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122632 (https://phabricator.wikimedia.org/T387230) (owner: 10Abijeet Patro)
[13:16:32] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[13:17:06] <wikibugs>	 (03PS1) 10Gergő Tisza: CentralAuthIdLookup: Reuse cached object on single-value lookup [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124781 (https://phabricator.wikimedia.org/T379909)
[13:17:12] <wikibugs>	 (03PS1) 10Gergő Tisza: CentralAuthIdLookup: Use primary DB after writes [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124782 (https://phabricator.wikimedia.org/T379909)
[13:17:16] <wikibugs>	 (03PS1) 10Gergő Tisza: Use UserOptionsManager for SUL3 rollout flag [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124783 (https://phabricator.wikimedia.org/T384549)
[13:17:18] <wikibugs>	 (03PS1) 10Gergő Tisza: Make SUL3 global preference optional and simplify logic [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124784
[13:17:18] <wikibugs>	 (03PS1) 10Gergő Tisza: Add passive central domain to edge login list [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124785 (https://phabricator.wikimedia.org/T375796)
[13:17:20] <wikibugs>	 (03PS1) 10Gergő Tisza: SUL3: Use a central wiki for autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124786 (https://phabricator.wikimedia.org/T387357)
[13:17:24] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:18:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1032.eqiad.wmnet
[13:19:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124781 (https://phabricator.wikimedia.org/T379909) (owner: 10Gergő Tisza)
[13:19:46] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124782 (https://phabricator.wikimedia.org/T379909) (owner: 10Gergő Tisza)
[13:20:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124783 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza)
[13:20:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124784 (owner: 10Gergő Tisza)
[13:20:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] shellbox-media: switch main to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124767 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli)
[13:20:35] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124785 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza)
[13:20:48] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] shellbox-media: switch main to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124767 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli)
[13:20:51] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124786 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza)
[13:20:57] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver: fix hostnames for citoid requests [puppet] - 10https://gerrit.wikimedia.org/r/1124766 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan)
[13:22:07] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: Repool eqiad after maintenance, no task ID specified]
[13:22:42] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: Repool eqiad after maintenance, no task ID specified]
[13:23:03] <wikibugs>	 (03PS2) 10Effie Mouzeli: shellbox-media: switch main to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124767 (https://phabricator.wikimedia.org/T377038)
[13:23:10] <elukey>	 eqiad is back into serving traffic, maintenance finished, thanks all!
[13:23:11] <wikibugs>	 (03CR) 10Effie Mouzeli: shellbox-media: switch main to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124767 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli)
[13:23:22] <wikibugs>	 (03PS1) 10Filippo Giunchedi: data-engineering: remove legacy eventlogging alerts [alerts] - 10https://gerrit.wikimedia.org/r/1124787 (https://phabricator.wikimedia.org/T238230)
[13:23:24] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]] (duration: 11m 31s)
[13:23:27] <stashbot>	 T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953
[13:24:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Please review if I got everything, or we can nuke all eventlogging alerts altogether?" [alerts] - 10https://gerrit.wikimedia.org/r/1124787 (https://phabricator.wikimedia.org/T238230) (owner: 10Filippo Giunchedi)
[13:24:41] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-media: switch main to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124767 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli)
[13:25:47] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[13:26:16] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[13:26:33] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[13:26:52] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[13:27:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: deploy thumbor alerts to prometheus k8s [alerts] - 10https://gerrit.wikimedia.org/r/1124788 (https://phabricator.wikimedia.org/T379559)
[13:27:57] <logmsgbot>	 !log klausman@deploy2002 conftool action : set/pooled=yes; selector: name=inference
[13:28:10] <logmsgbot>	 !log klausman@deploy2002 conftool action : set/pooled=yes; selector: name=inference-staging
[13:34:35] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:34:54] <wikibugs>	 07sre-alert-triage, 10SRE Observability (FY2024/2025-Q3): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#10605171 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolved in the meantime
[13:35:26] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1124776 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi)
[13:36:12] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:39:35] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:40:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: open tasks for long standing lint problems [alerts] - 10https://gerrit.wikimedia.org/r/1124790 (https://phabricator.wikimedia.org/T309182)
[13:40:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] sre: limit netbox reports alerts to eqiad and codfw [alerts] - 10https://gerrit.wikimedia.org/r/1124776 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi)
[13:41:55] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Duplicate gNMI BGP session state to metric with peer_descr as instance [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi)
[13:42:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet
[13:46:27] <jinxer-wm>	 FIRING: [2x] HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[13:48:28] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Thank you." [puppet] - 10https://gerrit.wikimedia.org/r/1124763 (https://phabricator.wikimedia.org/T380295) (owner: 10Muehlenhoff)
[13:48:46] <wikibugs>	 (03PS9) 10Jelto: services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836)
[13:49:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2154 db1167', diff saved to https://phabricator.wikimedia.org/P74096 and previous config saved to /var/cache/conftool/dbconfig/20250305-134936-marostegui.json
[13:50:27] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Index rebuild
[13:50:38] <Amir1>	 jouncebot: nowandnext
[13:50:38] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 9 minute(s)
[13:50:38] <jouncebot>	 In 0 hour(s) and 9 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1400)
[13:51:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1167.eqiad.wmnet
[13:51:07] <Amir1>	 wow that's packed, I do mine later then, going for lunch
[13:51:08] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2154.codfw.wmnet
[13:53:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet
[13:53:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1032.eqiad.wmnet
[13:53:17] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good, thanks! While safe to roll out, let us know if we should do it. (It's only fair after you did the patch :)))" [puppet] - 10https://gerrit.wikimedia.org/r/1124764 (https://phabricator.wikimedia.org/T380295) (owner: 10Muehlenhoff)
[13:55:13] <wikibugs>	 (03CR) 10KartikMistry: Enable CX unified dashboard on phase 2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124464 (https://phabricator.wikimedia.org/T387820) (owner: 10Sbisson)
[13:55:35] <wikibugs>	 (03CR) 10Marostegui: "Why was db1253 in s2?" [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto)
[13:57:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1167.eqiad.wmnet
[13:58:11] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2154.codfw.wmnet
[13:58:32] <logmsgbot>	 !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2154.codfw.wmnet with reason: Index rebuild
[13:58:49] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Index rebuild
[13:59:35] <wikibugs>	 (03PS1) 10Ayounsi: Also exclude Private-Peer from remote_instance:gnmi_bgp_neighbor_session_state [puppet] - 10https://gerrit.wikimedia.org/r/1124795 (https://phabricator.wikimedia.org/T387287)
[13:59:43] <wikibugs>	 (03PS1) 10Hnowlan: shellbox-video: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124796
[14:00:05] <jouncebot>	 Urbanecm and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1400).
[14:00:05] <jouncebot>	 zip, dbrant, Daimona, Dreamy_Jazz, and abijeet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:10] <zip>	 present
[14:00:10] <Dreamy_Jazz>	 \o
[14:00:11] <Daimona>	 o/
[14:00:13] <dbrant>	 o/
[14:01:00] <zip>	 Quick question, which server should I be using in the debugging extension to check my stuff?
[14:02:12] <wikibugs>	 (03PS1) 10Federico Ceratto: dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209)
[14:02:14] <Dreamy_Jazz>	 I think the `k8s-mwdebug` server would be good.
[14:02:19] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s8 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 317.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:02:33] <TheresNoTime>	 (do folx need a deployer or are you self-serving?)
[14:02:54] <Dreamy_Jazz>	 I can self-serve, but not sure if everyone on the schedule can self-serve
[14:03:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove access to logstash for cn=wmf [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff)
[14:03:25] <wikibugs>	 (03CR) 10Federico Ceratto: "dbctl cookbook, initial version" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto)
[14:03:26] <zip>	 I think I have requisite privs but also this is my first deploy, or at least, my first in so long I don't remember
[14:03:51] <Dreamy_Jazz>	 It probably makes sense to combine the config changes anyway into one backport
[14:05:27] <Dreamy_Jazz>	 I can start with your change zip. The task description at https://phabricator.wikimedia.org/T378834 doesn't say that the wikis are ready yet, but I see that the latest comment said they were
[14:05:54] <TheresNoTime>	 please ping if folx need anything, I am somewhat-around :)
[14:05:57] <zip>	 yup, my understanding is we are good to go
[14:06:00] * zip waves at TheresNoTime 
[14:06:06] <TheresNoTime>	 o/
[14:06:10] <zip>	 \o
[14:07:08] <Dreamy_Jazz>	 abijeet: You around for the window?
[14:07:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124480 (https://phabricator.wikimedia.org/T378834) (owner: 10Zoe)
[14:07:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124500 (owner: 10Dbrant)
[14:07:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) (owner: 10Daimona Eaytoy)
[14:07:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124549 (https://phabricator.wikimedia.org/T387943) (owner: 10Daimona Eaytoy)
[14:07:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124768 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz)
[14:08:16] <Dreamy_Jazz>	 Going to deploy all but abijeet's change in one go to make it quicker. I didn't see anything particularly risky in any of these changes, so shouldn't need to stop at the test stage.
[14:08:30] <wikibugs>	 (03Merged) 10jenkins-bot: Set Flow to read-only on remaining phase 2a wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124480 (https://phabricator.wikimedia.org/T378834) (owner: 10Zoe)
[14:08:34] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused config parameters from ReadingLists extension. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124500 (owner: 10Dbrant)
[14:08:37] <wikibugs>	 (03Merged) 10jenkins-bot: Use namespaced Title and Html classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) (owner: 10Daimona Eaytoy)
[14:08:39] <wikibugs>	 (03Merged) 10jenkins-bot: officewiki: Disable the event-organizer user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124549 (https://phabricator.wikimedia.org/T387943) (owner: 10Daimona Eaytoy)
[14:08:42] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily unset temporary-account-viewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124768 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz)
[14:08:51] <zip>	 genuinely setting "Zoe)" as a highlight message was one of the better ideas I've had
[14:08:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto)
[14:09:14] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1124480|Set Flow to read-only on remaining phase 2a wikis (T378834)]], [[gerrit:1124500|Remove unused config parameters from ReadingLists extension.]], [[gerrit:1124548|Use namespaced Title and Html classes (T166010 T387938)]], [[gerrit:1124549|officewiki: Disable the event-organizer user group (T387943)]], [[gerrit:1124768|Temporarily unset tempora
[14:09:14] <logmsgbot>	 ry-account-viewer group (T387205)]]
[14:09:21] <stashbot>	 T378834: [Config] Set Flow to read-only at all *Phase 2a* wikis - https://phabricator.wikimedia.org/T378834
[14:09:21] <stashbot>	 T166010: The Great Namespaceization Effort - https://phabricator.wikimedia.org/T166010
[14:09:22] <stashbot>	 T387938: beta cluster  down - Internal error  - https://phabricator.wikimedia.org/T387938
[14:09:22] <stashbot>	 T387943: Disable the event-organizer group in officewiki - https://phabricator.wikimedia.org/T387943
[14:09:22] <stashbot>	 T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205
[14:09:31] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: open tasks for long standing lint problems [alerts] - 10https://gerrit.wikimedia.org/r/1124790 (https://phabricator.wikimedia.org/T309182)
[14:09:32] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: route AlertLintProblem to the alert file team [alerts] - 10https://gerrit.wikimedia.org/r/1124800 (https://phabricator.wikimedia.org/T354762)
[14:09:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloudbackup: work around a postgresql bug by adjusting work_mem [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott)
[14:10:53] <Dreamy_Jazz>	 dbrant: Assuming there is nothing to test for your change?
[14:11:17] <dbrant>	 nope, and nothing's broken!
[14:11:52] <abijeet>	 Dreamy_Jazz, hey. I'm around
[14:12:05] <Dreamy_Jazz>	 Hi. I can get back to your change after I've finished this deploy
[14:12:12] <logmsgbot>	 !log dreamyjazz@deploy2002 daimona, zoe, dreamyjazz, dbrant: Backport for [[gerrit:1124480|Set Flow to read-only on remaining phase 2a wikis (T378834)]], [[gerrit:1124500|Remove unused config parameters from ReadingLists extension.]], [[gerrit:1124548|Use namespaced Title and Html classes (T166010 T387938)]], [[gerrit:1124549|officewiki: Disable the event-organizer user group (T387943)]], [[gerrit:1124768|Temporarily unse
[14:12:12] <logmsgbot>	 t temporary-account-viewer group (T387205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:12:13] <abijeet>	 Dreamy_Jazz, sounds good, thanks
[14:12:49] <Dreamy_Jazz>	 zip and Daimona: Please do any testing (if relevant)
[14:12:52] <zip>	 i'm seeing mediawikiwiki and cawiki Flow boards as read-only now, as expected
[14:13:00] <Daimona>	 Doing
[14:14:30] <sukhe>	 !log restart pybal on lvs2013
[14:14:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:14:53] <sukhe>	 !log restart pybal on lvs2014
[14:14:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:16:44] <Daimona>	 officewiki change looks good; prod didn't explode, so I assume the other change works fine too (I'm not sure how to test the "shitty enwiki hack")
[14:16:55] <Dreamy_Jazz>	 :D
[14:16:59] <logmsgbot>	 !log dreamyjazz@deploy2002 daimona, zoe, dreamyjazz, dbrant: Continuing with sync
[14:17:27] <Daimona>	 It's the first time I read this comment and now I want to know more :D https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/refs/changes/48/1124548/3/wmf-config/CommonSettings.php#2402
[14:20:58] <Dreamy_Jazz>	 My change technically didn't work, but I will be able to fix it in a follow-up. It doesn't break anything as it stands.
[14:23:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Also exclude Private-Peer from remote_instance:gnmi_bgp_neighbor_session_state [puppet] - 10https://gerrit.wikimedia.org/r/1124795 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi)
[14:23:41] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124480|Set Flow to read-only on remaining phase 2a wikis (T378834)]], [[gerrit:1124500|Remove unused config parameters from ReadingLists extension.]], [[gerrit:1124548|Use namespaced Title and Html classes (T166010 T387938)]], [[gerrit:1124549|officewiki: Disable the event-organizer user group (T387943)]], [[gerrit:1124768|Temporarily unset tempor
[14:23:41] <logmsgbot>	 ary-account-viewer group (T387205)]] (duration: 14m 26s)
[14:23:47] <stashbot>	 T378834: [Config] Set Flow to read-only at all *Phase 2a* wikis - https://phabricator.wikimedia.org/T378834
[14:23:47] <stashbot>	 T166010: The Great Namespaceization Effort - https://phabricator.wikimedia.org/T166010
[14:23:47] <stashbot>	 T387938: beta cluster  down - Internal error  - https://phabricator.wikimedia.org/T387938
[14:23:48] <stashbot>	 T387943: Disable the event-organizer group in officewiki - https://phabricator.wikimedia.org/T387943
[14:23:48] <stashbot>	 T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205
[14:24:13] <wikibugs>	 (03PS1) 10Btullis: Remove sudo privileges for journalctl from airflow instance admins [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719)
[14:24:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove sudo privileges for journalctl from airflow instance admins [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719) (owner: 10Btullis)
[14:25:25] <logmsgbot>	 !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=cumin2002.codfw.wmnet
[14:26:00] <logmsgbot>	 !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=ml-staging2003,service=ml-staging
[14:26:43] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1070 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:26:44] <zip>	 all done, then?
[14:26:49] <logmsgbot>	 !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=ml-staging2003.codfw.wmnet,service=ml-staging
[14:26:52] <wikibugs>	 (03PS1) 10Dreamy Jazz: Unset unused IP reveal groups in $wgExtensionFunctions callbacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205)
[14:26:57] <Dreamy_Jazz>	 For your change yes
[14:27:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Unset unused IP reveal groups in $wgExtensionFunctions callbacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz)
[14:27:01] <zip>	 grand, thank you
[14:27:07] <Dreamy_Jazz>	 Need to do the last change in the window plus my followup
[14:27:15] <Dreamy_Jazz>	 Then can end the window
[14:27:28] <wikibugs>	 (03PS2) 10Dreamy Jazz: Unset unused IP reveal groups in $wgExtensionFunctions callbacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205)
[14:29:55] <logmsgbot>	 !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=ml-staging2003.codfw.wmnet,service=ml_staging
[14:30:10] <stevemunene>	 !log draining and depooling dse-k8s-ctrl1001 ready for reimage to bookworm and containerd for T377875
[14:31:47] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[14:32:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122632 (https://phabricator.wikimedia.org/T387230) (owner: 10Abijeet Patro)
[14:33:01] <wikibugs>	 (03Merged) 10jenkins-bot: metawiki: Enable Chinese variant translation for message bundles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122632 (https://phabricator.wikimedia.org/T387230) (owner: 10Abijeet Patro)
[14:33:33] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1122632|metawiki: Enable Chinese variant translation for message bundles (T387230)]]
[14:33:37] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-ctrl1001.eqiad.wmnet with OS bookworm
[14:34:34] <wikibugs>	 (03CR) 10Herron: [C:03+1] profile: add restbase scrape jobs to profile::prometheus::services [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite)
[14:35:04] <wikibugs>	 (03CR) 10Herron: [C:03+1] prometheus: split envoy rules into separate groups [puppet] - 10https://gerrit.wikimedia.org/r/1124743 (https://phabricator.wikimedia.org/T387965) (owner: 10Filippo Giunchedi)
[14:35:58] <wikibugs>	 (03PS10) 10Bking: cloudelastic: begin transition to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904)
[14:36:08] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=ml-staging2003.codfw.wmnet
[14:36:10] <wikibugs>	 (03CR) 10Bking: cloudelastic: begin transition to opensearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking)
[14:36:29] <logmsgbot>	 !log dreamyjazz@deploy2002 abi, dreamyjazz: Backport for [[gerrit:1122632|metawiki: Enable Chinese variant translation for message bundles (T387230)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:36:37] <wikibugs>	 (03CR) 10Herron: [C:03+1] prometheus: trial moving k8s-mlstaging to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1124747 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:26] <wikibugs>	 (03CR) 10Herron: [C:03+1] prometheus: add sync-data script [puppet] - 10https://gerrit.wikimedia.org/r/1124749 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[14:38:06] <wikibugs>	 (03CR) 10Herron: [C:03+1] prometheus: replace prometheus::migration with prometheus-sync-data [puppet] - 10https://gerrit.wikimedia.org/r/1124751 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[14:38:09] <_joe_>	 Dreamy_Jazz: when the deployments for the windows are over, please ping me,  I have one tiny patch to deploy
[14:38:21] <wikibugs>	 (03PS2) 10Btullis: Remove sudo privileges for journalctl from airflow instance admins [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719)
[14:38:23] <wikibugs>	 (03PS3) 10Dreamy Jazz: Unset unused IP reveal groups in properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205)
[14:38:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Unset unused IP reveal groups in properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz)
[14:38:37] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: analytics_cluster::datahub::opensearch@eqiad
[14:38:37] <Dreamy_Jazz>	 abijeet: Are you testing your change?
[14:38:42] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,analytics_cluster: Enable IPIP on datahubsearch@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124064 (https://phabricator.wikimedia.org/T387306) (owner: 10Vgutierrez)
[14:38:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove sudo privileges for journalctl from airflow instance admins [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719) (owner: 10Btullis)
[14:39:01] <Dreamy_Jazz>	 Just realised it didn't mention your specific IRC username so you might not have been pinged
[14:39:02] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2046
[14:39:16] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2046
[14:39:18] <Dreamy_Jazz>	 I'll ping you when done.
[14:39:29] <Dreamy_Jazz>	 I also think Amir will want to deploy something too after the window
[14:39:47] <wikibugs>	 (03PS4) 10Dreamy Jazz: Unset unused IP reveal groups in properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205)
[14:40:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Ganeti hosts added on codfw per-rack vlans - https://phabricator.wikimedia.org/T388005 (10cmooney) 03NEW p:05Triage→03Medium
[14:40:47] <wikibugs>	 (03PS5) 10Dreamy Jazz: Unset unused IP reveal groups in properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205)
[14:40:52] <wikibugs>	 (03CR) 10Ottomata: [C:03+1] "Oh! Thank you!  I did a codesearch for stuff like this but I guess missed this!" [alerts] - 10https://gerrit.wikimedia.org/r/1124787 (https://phabricator.wikimedia.org/T238230) (owner: 10Filippo Giunchedi)
[14:41:03] <wikibugs>	 (03CR) 10Bking: [C:03+2] cloudelastic: begin transition to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking)
[14:41:16] <abijeet>	 Dreamy_Jazz, on it
[14:41:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] data-engineering: remove legacy eventlogging alerts [alerts] - 10https://gerrit.wikimedia.org/r/1124787 (https://phabricator.wikimedia.org/T238230) (owner: 10Filippo Giunchedi)
[14:41:30] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Migrate es backups from backup[12]02 to backup[12]13 [puppet] - 10https://gerrit.wikimedia.org/r/1124738 (https://phabricator.wikimedia.org/T387892)
[14:41:46] <wikibugs>	 (03PS2) 10Federico Ceratto: dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209)
[14:42:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: split envoy rules into separate groups [puppet] - 10https://gerrit.wikimedia.org/r/1124743 (https://phabricator.wikimedia.org/T387965) (owner: 10Filippo Giunchedi)
[14:42:47] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.dbctl
[14:42:47] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.dbctl (exit_code=0)
[14:43:09] <_joe_>	 Dreamy_Jazz: thank you <3
[14:43:10] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.dbctl
[14:43:10] <logmsgbot>	 !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.dbctl (exit_code=2)
[14:43:39] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[14:43:40] <_joe_>	 Amir1: we can do a double-deploy in one go, if you want. My patch is specifically for noc.wikimedia.org 
[14:43:47] <wikibugs>	 (03CR) 10Herron: [C:03+1] "Nice idea!  I'm assuming the expr is commented since the next patch will update that to bring in team parsing, lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/1124790 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[14:44:04] <wikibugs>	 (03CR) 10Herron: [C:03+1] "Nice! an improvement for sure" [alerts] - 10https://gerrit.wikimedia.org/r/1124800 (https://phabricator.wikimedia.org/T354762) (owner: 10Filippo Giunchedi)
[14:44:18] <wikibugs>	 (03PS1) 10Dreamy Jazz: Unset 'push-subscription-manager' group using hook callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124811 (https://phabricator.wikimedia.org/T275334)
[14:44:22] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2046
[14:44:47] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[14:44:47] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: analytics_cluster::datahub::opensearch@eqiad
[14:44:48] <abijeet>	 Dreamy_Jazz, looks good.
[14:44:48] <wikibugs>	 (03Abandoned) 10Dreamy Jazz: Unset 'push-subscription-manager' group using hook callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124811 (https://phabricator.wikimedia.org/T275334) (owner: 10Dreamy Jazz)
[14:44:56] <Dreamy_Jazz>	 Thanks. Proceeding
[14:45:10] <logmsgbot>	 !log dreamyjazz@deploy2002 abi, dreamyjazz: Continuing with sync
[14:45:19] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2046
[14:45:45] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.dbctl
[14:45:47] <logmsgbot>	 !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.dbctl (exit_code=1)
[14:45:50] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] "Will deploy this shortly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123499 (https://phabricator.wikimedia.org/T275336) (owner: 10Pppery)
[14:46:38] <abijeet>	 Dreamy_Jazz, thank you!
[14:46:44] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Unset unused IP reveal groups in properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz)
[14:46:52] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+2] Use MediaWikiServices hook for push-subscription-manager changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123499 (https://phabricator.wikimedia.org/T275336) (owner: 10Pppery)
[14:47:26] <wikibugs>	 (03PS3) 10Btullis: Remove sudo privileges for journalctl from airflow instance admins [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719)
[14:47:30] <wikibugs>	 (03CR) 10Ssingh: Release dnsdist 1.9.8-1+wmf12u1 (031 comment) [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 (owner: 10Ssingh)
[14:47:36] <wikibugs>	 (03Merged) 10jenkins-bot: Unset unused IP reveal groups in properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz)
[14:47:38] <wikibugs>	 (03Merged) 10jenkins-bot: Use MediaWikiServices hook for push-subscription-manager changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123499 (https://phabricator.wikimedia.org/T275336) (owner: 10Pppery)
[14:47:43] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:48:01] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-ctrl1001.eqiad.wmnet with reason: host reimage
[14:48:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto)
[14:48:58] <wikibugs>	 (03PS6) 10Ssingh: Release dnsdist 1.9.8-1~wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607
[14:49:15] <wikibugs>	 (03CR) 10Ssingh: Release dnsdist 1.9.8-1~wmf12u1 (031 comment) [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 (owner: 10Ssingh)
[14:51:22] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-ctrl1001.eqiad.wmnet with reason: host reimage
[14:51:34] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1007* for ban host prior to reimage - bking@cumin2002 - T387904
[14:51:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 (owner: 10Ssingh)
[14:51:36] <stashbot>	 T387904: Migrate Cloudelastic to Opensearch - https://phabricator.wikimedia.org/T387904
[14:51:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1007* for ban host prior to reimage - bking@cumin2002 - T387904
[14:52:02] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122632|metawiki: Enable Chinese variant translation for message bundles (T387230)]] (duration: 18m 29s)
[14:52:05] <stashbot>	 T387230: Mandarin Translation Issue (zh-hans, zh-hant are not seprated handle properly) in WikiLearn - https://phabricator.wikimedia.org/T387230
[14:52:07] <wikibugs>	 (03PS3) 10Federico Ceratto: dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209)
[14:52:16] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.dbctl
[14:52:28] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.dbctl (exit_code=99)
[14:53:02] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1123499|Use MediaWikiServices hook for push-subscription-manager changes (T275336)]], [[gerrit:1124805|Unset unused IP reveal groups in properly (T387205)]]
[14:53:07] <stashbot>	 T275336: push-subscription-manager group is sometimes available at all wikis - https://phabricator.wikimedia.org/T275336
[14:53:07] <stashbot>	 T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205
[14:54:02] <Amir1>	 _joe_: already eating. Will do it later. Thanks for the offer!
[14:54:19] <icinga-wm>	 PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:54:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417#10605653 (10herron)
[14:54:33] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Migrate es backups from backup[12]02 to backup[12]13 [puppet] - 10https://gerrit.wikimedia.org/r/1124738 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[14:54:33] <icinga-wm>	 PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:54:53] <wikibugs>	 (03CR) 10Federico Ceratto: "I was using it as a testbed for incremental tests of the new cloning script." [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto)
[14:55:11] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] shellbox-video: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124796 (owner: 10Hnowlan)
[14:55:55] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz, pppery: Backport for [[gerrit:1123499|Use MediaWikiServices hook for push-subscription-manager changes (T275336)]], [[gerrit:1124805|Unset unused IP reveal groups in properly (T387205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:56:51] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124509
[14:56:52] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-02-24-145135 to 2025-03-05-140259 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124817
[14:56:52] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-02-25-210518 to 2025-03-05-140247 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124818
[14:57:24] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz, pppery: Continuing with sync
[14:57:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch-disable-readahead.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:57:33] <wikibugs>	 (03CR) 10Marostegui: [C:04-1] "then it also needs to be moved to s7 in site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto)
[14:58:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto)
[14:59:23] <James_F>	 I guess the backport window is going to go over, as ever? :-)
[14:59:39] <Dreamy_Jazz>	 :D
[14:59:46] <Dreamy_Jazz>	 jouncebot: nowandnext
[14:59:46] <jouncebot>	 For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1400)
[14:59:46] <jouncebot>	 In 0 hour(s) and 0 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1500)
[15:00:01] <Dreamy_Jazz>	 The changes from Amir and joe were not technically in the window
[15:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1500)
[15:00:08] <Dreamy_Jazz>	 I might be done in the next minute or two.
[15:00:11] <James_F>	 And yet.
[15:00:35] <Dreamy_Jazz>	 Maybe the window needs to be 24 hours long :D
[15:01:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:04:08] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123499|Use MediaWikiServices hook for push-subscription-manager changes (T275336)]], [[gerrit:1124805|Unset unused IP reveal groups in properly (T387205)]] (duration: 11m 05s)
[15:04:11] <Dreamy_Jazz>	 _joe_: I'm now done with deploying, though may be good to coordinate with others just to check if you can deploy in this window
[15:04:12] <stashbot>	 T275336: push-subscription-manager group is sometimes available at all wikis - https://phabricator.wikimedia.org/T275336
[15:04:13] <stashbot>	 T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205
[15:04:33] <James_F>	 We're just deploying a service bump.
[15:05:07] <wikibugs>	 06SRE, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q3): Ops-monitoring-bot creating duplicate tasks for the same RAID failure - https://phabricator.wikimedia.org/T387754#10605718 (10fgiunchedi)
[15:05:59] <wikibugs>	 (03CR) 10Ecarg: [C:03+2] wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124509 (owner: 10Jforrester)
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:15] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/services/mw-debug: apply
[15:07:26] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/services/mw-debug: apply
[15:07:35] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124509 (owner: 10Jforrester)
[15:08:42] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet with OS bookworm
[15:08:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[15:09:07] <wikibugs>	 (03PS14) 10Clément Goubert: mediawiki::periodic_job: Split periodic job definition [puppet] - 10https://gerrit.wikimedia.org/r/1118080 (https://phabricator.wikimedia.org/T385869)
[15:09:14] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] haproxy: Remove cipher regsub of "ECDHE-RSA-" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[15:09:17] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/services/mw-debug: apply
[15:09:40] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudvirt: increase conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/1124821 (https://phabricator.wikimedia.org/T387179)
[15:09:52] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:10:49] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:11:28] <moritzm>	 !log installing openssh security updates
[15:11:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:30] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719) (owner: 10Btullis)
[15:11:36] <wikibugs>	 (03PS4) 10Ebernhardson: flink-app chart: Support per-chart logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546
[15:11:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:11:50] <wikibugs>	 06SRE, 07Kubernetes, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): etcd: adapt etcd-backup.py for etcd 3.4 - https://phabricator.wikimedia.org/T385727#10605753 (10fgiunchedi)
[15:12:04] <wikibugs>	 (03CR) 10Ebernhardson: "ahh, indeed. Done." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546 (owner: 10Ebernhardson)
[15:12:06] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "No objections on my end!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124752 (owner: 10Effie Mouzeli)
[15:12:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719) (owner: 10Btullis)
[15:13:24] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Release dnsdist 1.9.8-1~wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 (owner: 10Ssingh)
[15:13:26] <_joe_>	 Dreamy_Jazz: yeah and I'm in multiple meetings in a row at this point, heh, I'll piggyback Amir later :)
[15:13:27] <wikibugs>	 (03CR) 10Kamila Součková: "@hnowlan@wikimedia.org could you please review the helm bits? thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková)
[15:13:41] <wikibugs>	 (03CR) 10Kamila Součková: "@hnowlan@wikimedia.org could you please review the helm bits? thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 (owner: 10Kamila Součková)
[15:14:59] <wikibugs>	 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10605768 (10MatthewVernon) >>! In T377827#10591134, @Ladsgroup wrote: > These are eqiad hosts which I haven't been deleting the thumbna...
[15:16:35] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:17:13] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:17:29] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:18:07] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118080 (https://phabricator.wikimedia.org/T385869) (owner: 10Clément Goubert)
[15:18:21] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:19:25] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/services/mw-debug: apply
[15:19:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: haproxy: check ingress workers with the /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1124756 (https://phabricator.wikimedia.org/T387959) (owner: 10Arturo Borrero Gonzalez)
[15:20:18] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Remove sudo privileges for journalctl from airflow instance admins [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719) (owner: 10Btullis)
[15:20:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10605808 (10cmooney) >>! In T384838#10603754, @Jhancock.wm wrote: > @Papaul i found a weird little thing. I racked ganeti2049 in B5, U40. There are three other serve...
[15:21:06] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/services/mw-debug: apply
[15:23:06] <wikibugs>	 (03CR) 10Ecarg: [C:03+2] wikifunctions: Update evaluators from 2025-02-24-145135 to 2025-03-05-140259 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124817 (owner: 10Jforrester)
[15:23:08] <logmsgbot>	 !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/services/mw-debug: apply
[15:24:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10605822 (10cmooney) @Jhancock.wm one thing to make sure is all ganeti hosts are added to **row-wide** vlans.  So in the [[ https://netbox.wikimedia.org/extras/scrip...
[15:24:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye
[15:24:32] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-02-24-145135 to 2025-03-05-140259 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124817 (owner: 10Jforrester)
[15:24:33] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505) (owner: 10Arlolra)
[15:26:05] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-ctrl1002.eqiad.wmnet with OS bookworm
[15:26:10] <icinga-wm>	 RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 717 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:26:14] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:26:22] <icinga-wm>	 RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 11 May 2025 11:48:24 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:27:39] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudelastic1007.eqiad.wmnet with OS bullseye
[15:27:40] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:27:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Ganeti hosts added on codfw per-rack vlans - https://phabricator.wikimedia.org/T388005#10605846 (10cmooney)
[15:28:02] <wikibugs>	 (03PS5) 10Herron: KubernetesRsyslogDown: alert only if logs were sent before [alerts] - 10https://gerrit.wikimedia.org/r/1124453 (https://phabricator.wikimedia.org/T381417)
[15:28:22] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:30:20] <sukhe>	 !log upload dnsdist 1.9.8-1~wmf12u1 to apt.wm.org for bookworm
[15:30:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:41] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:31:40] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:32:00] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:32:55] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:33:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns
[15:34:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10605869 (10phaultfinder)
[15:34:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic
[15:35:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic
[15:35:45] <wikibugs>	 (03CR) 10Ecarg: [C:03+2] wikifunctions: Update orchestrator from 2025-02-25-210518 to 2025-03-05-140247 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124818 (owner: 10Jforrester)
[15:37:14] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-02-25-210518 to 2025-03-05-140247 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124818 (owner: 10Jforrester)
[15:37:56] <logmsgbot>	 !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-ctrl1002.eqiad.wmnet with reason: host reimage
[15:38:21] <icinga-wm>	 PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:38:33] <icinga-wm>	 PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:38:47] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:39:08] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:40:35] <jynus>	 !log starting es backups on new hosts backup1013, backup2013  T387892
[15:40:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:38] <stashbot>	 T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892
[15:41:37] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-ctrl1002.eqiad.wmnet with reason: host reimage
[15:41:40] <wikibugs>	 (03PS3) 10JMeybohm: Add pod-security.wmg.org labels to wikikube mediawiki namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507)
[15:41:40] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Disable hostPath and capabilities baseline rules for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124830 (https://phabricator.wikimedia.org/T273507)
[15:41:40] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: haproxy: don't use TLS on the HTTP check for k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/1124829 (https://phabricator.wikimedia.org/T387959)
[15:41:47] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:42:00] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:42:23] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:42:51] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:43:23] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:45:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add component/lshw on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1124763 (https://phabricator.wikimedia.org/T380295) (owner: 10Muehlenhoff)
[15:46:23] <wikibugs>	 (03CR) 10David Caro: [C:03+1] toolforge: haproxy: don't use TLS on the HTTP check for k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/1124829 (https://phabricator.wikimedia.org/T387959) (owner: 10Arturo Borrero Gonzalez)
[15:47:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: haproxy: don't use TLS on the HTTP check for k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/1124829 (https://phabricator.wikimedia.org/T387959) (owner: 10Arturo Borrero Gonzalez)
[15:48:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1007* for ban host prior to reimage - bking@cumin2002 - T387904
[15:48:36] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: cloudelastic1007* for ban host prior to reimage - bking@cumin2002 - T387904
[15:48:41] <stashbot>	 T387904: Migrate Cloudelastic to Opensearch - https://phabricator.wikimedia.org/T387904
[15:50:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[15:52:04] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707#10605945 (10Jclark-ctr) @MatthewVernon  can this drive be replaced at any time it is arriving tonight/tomorrow morning?
[15:53:55] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707#10605951 (10MatthewVernon) @Jclark-ctr yes, please go ahead :)  [I intend that to be clear from "you can work on this system at any time without further input from me." in the ticke...
[15:54:33] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "\o/ yay!" [puppet] - 10https://gerrit.wikimedia.org/r/1124733 (https://phabricator.wikimedia.org/T353457) (owner: 10Filippo Giunchedi)
[15:59:33] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): openstack galera no recent writes 2025-03-04, suspected network hardware problem - https://phabricator.wikimedia.org/T387828#10605987 (10VRiley-WMF) Looks like the SFP failed. Swapped it out and it looks like it's communicating...
[16:00:05] <jouncebot>	 tgr: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) SUL deploy window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1600).
[16:00:07] <logmsgbot>	 !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-ctrl1002.eqiad.wmnet with OS bookworm
[16:00:30] <Dreamy_Jazz>	 Yet another window for a custom window :D
[16:00:43] <Dreamy_Jazz>	 jouncebot has the funniest messages
[16:01:59] <wikibugs>	 (03PS1) 10JMeybohm: staging-codfw: Unset image.tag for coredns to apply the default version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124831 (https://phabricator.wikimedia.org/T384450)
[16:02:01] <wikibugs>	 (03PS1) 10JMeybohm: admin_ng: Update dependencies between releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124832 (https://phabricator.wikimedia.org/T341984)
[16:03:27] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): statistics::wmde::graphite: add syslog_identifier [puppet] - 10https://gerrit.wikimedia.org/r/1124833 (https://phabricator.wikimedia.org/T387514)
[16:03:50] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10606045 (10Krinkle)
[16:05:10] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Great! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1124833 (https://phabricator.wikimedia.org/T387514) (owner: 10Lucas Werkmeister (WMDE))
[16:05:17] <wikibugs>	 (03CR) 10Btullis: [C:03+2] statistics::wmde::graphite: add syslog_identifier [puppet] - 10https://gerrit.wikimedia.org/r/1124833 (https://phabricator.wikimedia.org/T387514) (owner: 10Lucas Werkmeister (WMDE))
[16:05:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye
[16:06:37] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] admin_ng: Disable hostPath and capabilities baseline rules for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124830 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[16:06:49] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1124733 (https://phabricator.wikimedia.org/T353457) (owner: 10Filippo Giunchedi)
[16:07:06] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Add additional m1 grants for backup[12]013 stats user [puppet] - 10https://gerrit.wikimedia.org/r/1124834 (https://phabricator.wikimedia.org/T387892)
[16:07:11] <icinga-wm>	 RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 717 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:07:23] <icinga-wm>	 RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 11 May 2025 11:48:24 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:07:24] <wikibugs>	 (03PS4) 10Federico Ceratto: dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209)
[16:08:08] <wikibugs>	 (03PS14) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231)
[16:10:04] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1124800 (https://phabricator.wikimedia.org/T354762) (owner: 10Filippo Giunchedi)
[16:10:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns
[16:11:10] <wikibugs>	 (03CR) 10Jcrespo: "I plan to deploy this tomorrow (I missed it during setup today)." [puppet] - 10https://gerrit.wikimedia.org/r/1124834 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo)
[16:11:21] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Disable hostPath and capabilities baseline rules for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124830 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[16:11:48] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] instances.yaml, db1253.yaml, db1254.yaml, site.pp: clone db1253 and db1254 [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto)
[16:12:50] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[16:13:39] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1124790 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[16:14:19] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1124751 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[16:15:09] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1124749 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[16:15:40] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1124747 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[16:15:58] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] hiera,wcqs: Enable IPIP on wcqs@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123663 (https://phabricator.wikimedia.org/T387313) (owner: 10Vgutierrez)
[16:16:22] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite)
[16:16:50] <wikibugs>	 (03CR) 10Ebrahim: "If is possible please add the brand new table also to dump and specially replica similar to what is done to globalimagelinks https://githu" [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581) (owner: 10Bvibber)
[16:17:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2046-48 to codfw - jhancock@cumin2002"
[16:17:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124781 (https://phabricator.wikimedia.org/T379909) (owner: 10Gergő Tisza)
[16:17:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124782 (https://phabricator.wikimedia.org/T379909) (owner: 10Gergő Tisza)
[16:17:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124783 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza)
[16:17:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124784 (owner: 10Gergő Tisza)
[16:17:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124785 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza)
[16:17:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124786 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza)
[16:18:41] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): snapshot: add syslog_identifier to Wikibase dumps [puppet] - 10https://gerrit.wikimedia.org/r/1124835 (https://phabricator.wikimedia.org/T387514)
[16:19:01] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Note: I haven’t looked at the rsyslog configurations in detail and am not very sure that this is 100% correct…" [puppet] - 10https://gerrit.wikimedia.org/r/1124835 (https://phabricator.wikimedia.org/T387514) (owner: 10Lucas Werkmeister (WMDE))
[16:19:06] <wikibugs>	 (03Merged) 10jenkins-bot: CentralAuthIdLookup: Reuse cached object on single-value lookup [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124781 (https://phabricator.wikimedia.org/T379909) (owner: 10Gergő Tisza)
[16:19:08] <wikibugs>	 (03Merged) 10jenkins-bot: CentralAuthIdLookup: Use primary DB after writes [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124782 (https://phabricator.wikimedia.org/T379909) (owner: 10Gergő Tisza)
[16:19:11] <wikibugs>	 (03Merged) 10jenkins-bot: Use UserOptionsManager for SUL3 rollout flag [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124783 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza)
[16:19:12] <wikibugs>	 (03Merged) 10jenkins-bot: Make SUL3 global preference optional and simplify logic [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124784 (owner: 10Gergő Tisza)
[16:19:13] <wikibugs>	 (03Merged) 10jenkins-bot: Add passive central domain to edge login list [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124785 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza)
[16:19:14] <wikibugs>	 (03Merged) 10jenkins-bot: SUL3: Use a central wiki for autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124786 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza)
[16:19:24] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2046-48 to codfw - jhancock@cumin2002"
[16:19:24] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:19:27] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2045
[16:19:27] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10606183 (10Jdlrobson-WMF) >>! In T214998#10600094, @Peter wrote: > I've been looking into the data we get...
[16:19:28] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2046
[16:19:29] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047
[16:19:40] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2045
[16:19:42] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2046
[16:19:43] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047
[16:19:50] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124781|CentralAuthIdLookup: Reuse cached object on single-value lookup (T379909 T380500 T387106)]], [[gerrit:1124782|CentralAuthIdLookup: Use primary DB after writes (T379909 T380500)]], [[gerrit:1124783|Use UserOptionsManager for SUL3 rollout flag (T384549)]], [[gerrit:1124784|Make SUL3 global preference optional and simplify logic]], [[gerrit:1124785|Ad
[16:19:50] <logmsgbot>	 d passive central domain to edge login list (T375796)]], [[gerrit:1124786|SUL3: Use a central wiki for autologin (T387357)]]
[16:19:58] <stashbot>	 T379909: Define where to add code that needs to run after a new central user has been created - https://phabricator.wikimedia.org/T379909
[16:19:58] <stashbot>	 T380500: CentralAuthUser returning outdated data after user creation - https://phabricator.wikimedia.org/T380500
[16:19:58] <stashbot>	 T387106: CentralAuthIdLookup should use a cache - https://phabricator.wikimedia.org/T387106
[16:19:59] <stashbot>	 T384549: Create a per-user flag for enabling SUL3 - https://phabricator.wikimedia.org/T384549
[16:19:59] <stashbot>	 T375796: Synchronize SUL2 and SUL3 central browser state - https://phabricator.wikimedia.org/T375796
[16:19:59] <stashbot>	 T387357: SUL3 signup results in autocreation on all edge login domains - https://phabricator.wikimedia.org/T387357
[16:20:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2048
[16:20:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2048
[16:20:56] <tgr_>	 wow those merges were fast. Did we finally stop running cross-repo Selenium tests for backports?
[16:22:50] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1124781|CentralAuthIdLookup: Reuse cached object on single-value lookup (T379909 T380500 T387106)]], [[gerrit:1124782|CentralAuthIdLookup: Use primary DB after writes (T379909 T380500)]], [[gerrit:1124783|Use UserOptionsManager for SUL3 rollout flag (T384549)]], [[gerrit:1124784|Make SUL3 global preference optional and simplify logic]], [[gerrit:1124785|Add passive central do
[16:22:51] <logmsgbot>	 main to edge login list (T375796)]], [[gerrit:1124786|SUL3: Use a central wiki for autologin (T387357)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:22:56] <James_F>	 Yes, I dropped that a few weeks ago.
[16:22:56] <wikibugs>	 (03CR) 10Btullis: [C:03+2] snapshot: add syslog_identifier to Wikibase dumps [puppet] - 10https://gerrit.wikimedia.org/r/1124835 (https://phabricator.wikimedia.org/T387514) (owner: 10Lucas Werkmeister (WMDE))
[16:23:12] <James_F>	 And on Monday I switched the wmf-quibble jobs from 7.4 to 8.1 which should speed things up a little.
[16:23:28] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] hiera,druid: Enable IPIP on druid-public-broker@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307) (owner: 10Vgutierrez)
[16:23:46] <tgr_>	 thanks for that! it's neat to have extension backports merge in <2 min
[16:23:56] <James_F>	 Indeed, back to the good old days.
[16:24:15] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-internal-main@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123684 (https://phabricator.wikimedia.org/T387319) (owner: 10Vgutierrez)
[16:24:19] <James_F>	 But the main speed up is the re-use of existing cached job outputs that RelEng landed last week.
[16:24:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1007.eqiad.wmnet with reason: host reimage
[16:26:17] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wcqs::public@codfw
[16:26:27] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] instances.yaml, db1253.yaml, db1254.yaml, site.pp: clone db1253 and db1254 [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto)
[16:26:29] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] instances.yaml, db1253.yaml, db1254.yaml, site.pp: clone db1253 and db1254 [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto)
[16:26:45] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,wcqs: Enable IPIP on wcqs@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123663 (https://phabricator.wikimedia.org/T387313) (owner: 10Vgutierrez)
[16:26:58] <tgr_>	 Huh. "WikimediaDebug is disabled. To re-enable it, accept the new permissions: Block content on any page."
[16:27:22] <tgr_>	 I guess we aren't the only ones who have a hard time making our permission system intuitive.
[16:28:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1007.eqiad.wmnet with reason: host reimage
[16:29:57] <claime>	 tgr_: yeah that's being worked on afaik
[16:30:10] <Lucas_WMDE>	 indeed, T387822
[16:30:10] <stashbot>	 T387822: WikimediaDebug Firefox extension requires permission to block content on any page - https://phabricator.wikimedia.org/T387822
[16:30:14] <claime>	 https://phabricator.wikimedia.org/T387899
[16:30:40] <claime>	 Ah, I had the "parent" handy :p
[16:30:46] <Lucas_WMDE>	 ^^
[16:32:20] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[16:33:17] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs
[16:33:17] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wcqs::public@codfw
[16:33:34] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] deployment server: Don't pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/1124462 (https://phabricator.wikimedia.org/T387823) (owner: 10Ahmon Dancy)
[16:34:45] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wcqs::public@eqiad
[16:34:59] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,wcqs: Enable IPIP on wcqs@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123664 (https://phabricator.wikimedia.org/T387313) (owner: 10Vgutierrez)
[16:35:08] <wikibugs>	 (03PS2) 10Vgutierrez: hiera,wcqs: Enable IPIP on wcqs@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123664 (https://phabricator.wikimedia.org/T387313)
[16:38:09] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,wcqs: Enable IPIP on wcqs@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123664 (https://phabricator.wikimedia.org/T387313) (owner: 10Vgutierrez)
[16:39:40] <logmsgbot>	 !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for role: wcqs::public@eqiad
[16:39:53] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wcqs::public@eqiad
[16:40:15] <rzl>	 tgr_: re https://gerrit.wikimedia.org/r/c/operations/puppet/+/1123029, I think sukhe offered to take a look
[16:41:28] <sukhe>	 rzl: I am in a meeting right now but I wasn't aware of this so can look when I am done
[16:41:42] <sukhe>	 vgutierrez ^
[16:41:47] <sukhe>	 can you take a look please?
[16:43:23] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user ori - https://phabricator.wikimedia.org/T388029 (10acooper) 03NEW
[16:43:25] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030 (10acooper) 03NEW
[16:43:44] <vgutierrez>	 sukhe: sure
[16:44:13] <sukhe>	 ,3
[16:44:13] <sukhe>	 <3
[16:44:39] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[16:45:36] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:45:42] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:45:43] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs
[16:45:44] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wcqs::public@eqiad
[16:47:07] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user jdcc - https://phabricator.wikimedia.org/T388029#10606414 (10acooper)
[16:47:09] <wikibugs>	 07Puppet, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032 (10Jdlrobson-WMF) 03NEW
[16:48:01] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[16:48:25] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] Update CentralAuth multi-DC rules for SUL3, attempt 2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695) (owner: 10Gergő Tisza)
[16:48:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034 (10acooper) 03NEW
[16:48:47] <vgutierrez>	 sukhe, tgr_, rzl: it looks good to me
[16:48:50] <tgr_>	 thanks sukhe, vgutierrez! It's not particularly urgent, just trying to figure out the next step
[16:50:19] <sukhe>	 vgutierrez: :*
[16:50:23] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[16:50:29] <wikibugs>	 (03PS2) 10Sergio Gimeno: [Growth] Set default api lookahead size to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120180 (https://phabricator.wikimedia.org/T325990)
[16:50:30] <sukhe>	 tgr_: no worries, feel free to add one of me or vgutierrez to such patches
[16:50:44] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:50:47] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120180 (https://phabricator.wikimedia.org/T325990) (owner: 10Sergio Gimeno)
[16:50:50] <vgutierrez>	 or both for extra TZ coverage :D
[16:50:52] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:51:10] <wikibugs>	 07Puppet, 06SRE, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10606475 (10Jdlrobson-WMF)
[16:53:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10606477 (10acooper) Update - confirming with staff members whether this access is still required as they may still be actively doing volunteer work, will confirm back, so pause thi...
[16:53:04] <wikibugs>	 (03PS1) 10Kamila Součková: prometheus: charmuseum relabel config [puppet] - 10https://gerrit.wikimedia.org/r/1124843 (https://phabricator.wikimedia.org/T386808)
[16:53:17] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:53:43] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:53:52] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[16:54:08] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10606486 (10acooper) a:03odimitrijevic
[16:54:45] <wikibugs>	 (03CR) 10Gergő Tisza: Update CentralAuth multi-DC rules for SUL3, attempt 2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695) (owner: 10Gergő Tisza)
[16:54:47] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124781|CentralAuthIdLookup: Reuse cached object on single-value lookup (T379909 T380500 T387106)]], [[gerrit:1124782|CentralAuthIdLookup: Use primary DB after writes (T379909 T380500)]], [[gerrit:1124783|Use UserOptionsManager for SUL3 rollout flag (T384549)]], [[gerrit:1124784|Make SUL3 global preference optional and simplify logic]], [[gerrit:1124785|A
[16:54:47] <logmsgbot>	 dd passive central domain to edge login list (T375796)]], [[gerrit:1124786|SUL3: Use a central wiki for autologin (T387357)]] (duration: 34m 57s)
[16:54:54] <stashbot>	 T379909: Define where to add code that needs to run after a new central user has been created - https://phabricator.wikimedia.org/T379909
[16:54:55] <stashbot>	 T380500: CentralAuthUser returning outdated data after user creation - https://phabricator.wikimedia.org/T380500
[16:54:55] <stashbot>	 T387106: CentralAuthIdLookup should use a cache - https://phabricator.wikimedia.org/T387106
[16:54:55] <stashbot>	 T384549: Create a per-user flag for enabling SUL3 - https://phabricator.wikimedia.org/T384549
[16:54:56] <stashbot>	 T375796: Synchronize SUL2 and SUL3 central browser state - https://phabricator.wikimedia.org/T375796
[16:54:56] <stashbot>	 T387357: SUL3 signup results in autocreation on all edge login domains - https://phabricator.wikimedia.org/T387357
[16:55:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124757 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[16:56:05] <wikibugs>	 (03Merged) 10jenkins-bot: CentralAuth: Enable SUL3 signup on group 0 (attempt 4) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124757 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[16:56:26] <wikibugs>	 (03PS4) 10Gergő Tisza: Update CentralAuth multi-DC rules for SUL3, attempt 2 [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695)
[16:56:35] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124757|CentralAuth: Enable SUL3 signup on group 0 (attempt 4) (T384007)]]
[16:56:37] <stashbot>	 T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007
[16:56:48] <wikibugs>	 (03CR) 10Vgutierrez: Update CentralAuth multi-DC rules for SUL3, attempt 2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695) (owner: 10Gergő Tisza)
[16:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:59:34] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1124757|CentralAuth: Enable SUL3 signup on group 0 (attempt 4) (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:00:50] <wikibugs>	 (03CR) 10Gergő Tisza: Update CentralAuth multi-DC rules for SUL3, attempt 2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695) (owner: 10Gergő Tisza)
[17:01:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:03:33] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] "This duplicates Ic9f792e9749e299ff8257474a2c73ca549e3f4e7, but this has more explanation in the commit message. If we deploy this one inst" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124745 (https://phabricator.wikimedia.org/T380441) (owner: 10Máté Szabó)
[17:04:13] <wikibugs>	 (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite)
[17:05:09] <wikibugs>	 (03PS1) 10Volans: sre.hosts.provision: disable HostHeaderCheck [cookbooks] - 10https://gerrit.wikimedia.org/r/1124845 (https://phabricator.wikimedia.org/T382416)
[17:05:58] <Amir1>	 jouncebot: nowandnext
[17:05:59] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 54 minute(s)
[17:05:59] <jouncebot>	 In 0 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1800)
[17:06:13] <Amir1>	 tgr_: Hi, let me know once you're fully done
[17:09:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10606534 (10phaultfinder)
[17:09:51] <wikibugs>	 (03PS3) 10Cwhite: profile: add restbase scrape jobs to profile::prometheus::services [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343)
[17:10:00] <wikibugs>	 (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite)
[17:12:00] <wikibugs>	 06SRE, 10SRE-Access-Requests, 05WMF-NDA: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10606547 (10acooper)
[17:12:06] <wikibugs>	 06SRE, 10SRE-Access-Requests, 05WMF-NDA: Remove production data access for NDA expired user jdcc - https://phabricator.wikimedia.org/T388029#10606549 (10acooper)
[17:12:16] <wikibugs>	 06SRE, 10SRE-Access-Requests, 05WMF-NDA: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10606550 (10acooper)
[17:12:38] <wikibugs>	 07Puppet, 06SRE, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10606551 (10bwang) p:05Triage→03High
[17:12:40] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10606552 (10acooper)
[17:12:44] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user jdcc - https://phabricator.wikimedia.org/T388029#10606553 (10acooper)
[17:12:51] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10606555 (10acooper)
[17:14:09] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[17:17:45] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user jdcc - https://phabricator.wikimedia.org/T388029#10606599 (10acooper) a:03MoritzMuehlenhoff
[17:17:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10606600 (10acooper) a:03MoritzMuehlenhoff
[17:20:48] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124757|CentralAuth: Enable SUL3 signup on group 0 (attempt 4) (T384007)]] (duration: 24m 13s)
[17:20:52] <stashbot>	 T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007
[17:21:05] <tgr_>	 Amir1: done
[17:21:54] <Amir1>	 thanks
[17:25:42] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Enable thumbnail steps in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124759 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[17:26:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124759 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[17:27:32] <wikibugs>	 (03Merged) 10jenkins-bot: Enable thumbnail steps in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124759 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[17:28:02] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1124759|Enable thumbnail steps in testwiki (T360589)]]
[17:28:05] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[17:29:28] <wikibugs>	 (03PS1) 10Scott French: mw-(api-ext|web): right-size given current traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124848 (https://phabricator.wikimedia.org/T383845)
[17:29:30] <wikibugs>	 (03PS2) 10Scott French: mw-(api-ext|web): serve 5% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124849 (https://phabricator.wikimedia.org/T383845)
[17:31:01] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1124759|Enable thumbnail steps in testwiki (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:31:48] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:31:58] <wikibugs>	 (03CR) 10Scott French: "This is part one of two patches that together (1) right-size mw-web and mw-api-ext for the current migration state (2) clarify multi-DC se" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124848 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[17:32:21] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124849 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[17:34:22] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[17:34:28] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] shellbox-video: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124796 (owner: 10Hnowlan)
[17:35:42] <wikibugs>	 (03PS1) 10Michael Große: Growth: remove unused config wgGENewcomerTasksOresTopicConfigTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124836
[17:36:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124836 (owner: 10Michael Große)
[17:36:33] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10606672 (10KFrancis) Hi @Ben.buchenau, please confirm this is your correct name and I will put the NDA agreement together.  Thanks!
[17:36:35] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124796 (owner: 10Hnowlan)
[17:38:16] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] profile: add restbase scrape jobs to profile::prometheus::services [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite)
[17:40:39] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Ganeti hosts added on codfw per-rack vlans - https://phabricator.wikimedia.org/T388005#10606698 (10Jhancock.wm) 47 and 48 are not live. new machines. so i can redo those
[17:41:07] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124759|Enable thumbnail steps in testwiki (T360589)]] (duration: 13m 04s)
[17:41:10] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[17:47:42] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[17:50:19] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[17:50:23] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[17:50:39] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply
[17:50:41] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply
[17:50:45] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply
[17:50:55] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply
[17:53:15] <wikibugs>	 (03PS2) 10Bernard Wang: Enable Search AB test for en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510
[17:56:05] <wikibugs>	 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10606809 (10Aklapper) @acooper: Hmm, where exactly does that information come from? https://phabricator.wikimedia.org/p/aude/ implies that they are //currently// a contractor (or ve...
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1800)
[18:03:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:03:29] <wikibugs>	 07Puppet, 06SRE, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10606837 (10Jdlrobson-WMF)
[18:04:01] <wikibugs>	 (03PS1) 10Fabfur: acme_chief: add parameter for destination path [puppet] - 10https://gerrit.wikimedia.org/r/1124855 (https://phabricator.wikimedia.org/T387929)
[18:09:08] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): openstack galera no recent writes 2025-03-04, suspected network hardware problem - https://phabricator.wikimedia.org/T387828#10606881 (10VRiley-WMF) 05Open→03Resolved Confirmed that this unit came back online
[18:10:29] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124855 (https://phabricator.wikimedia.org/T387929) (owner: 10Fabfur)
[18:10:42] <wikibugs>	 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10606893 (10Ladsgroup) The deletions will be quite slow and on top of that, we are introducing the thumbnail steps and bumping the defa...
[18:12:45] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mw-(api-ext|web): right-size given current traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124848 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:13:40] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "lgtm, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124849 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[18:16:00] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for jhuneidi - https://phabricator.wikimedia.org/T388044 (10thcipriani) 03NEW
[18:18:34] <wikibugs>	 07Puppet, 06SRE, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10606946 (10Krinkle) I believe it would be a mistake to hardcode `MiuiBrowser` as a mobile browser, as this would break the browser UI and the end-users...
[18:20:23] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Ganeti hosts added on codfw per-rack vlans - https://phabricator.wikimedia.org/T388005#10606953 (10cmooney) >>! In T388005#10606698, @Jhancock.wm wrote: > 47 and 48 are not live. new machines. so i can redo those  Sorry I was being dumb, ganeti203...
[18:20:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1007.eqiad.wmnet with OS bullseye
[18:20:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Ganeti hosts added on codfw per-rack vlans - https://phabricator.wikimedia.org/T388005#10606957 (10cmooney)
[18:21:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122990 (https://phabricator.wikimedia.org/T383774) (owner: 10Itamar Givon)
[18:21:22] <icinga-wm>	 PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:21:34] <icinga-wm>	 PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration
[18:21:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic
[18:21:48] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic
[18:29:12] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10607012 (10Krinkle) I've written up my analysis and proposal at: https://www.mediawiki.org/wiki/Requests_f...
[18:33:20] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:45:22] <brett>	 !log import trafficserver 9.2.9-1wm1 into bullseye-wikimedia
[18:45:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:48] <brett>	 !log import trafficserver 9.2.9-1wm1 into bullseye-wikimedia (T388035)
[18:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:51] <stashbot>	 T388035: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035
[18:52:22] <wikibugs>	 (03PS1) 10Dzahn: zuul: remove gearman wait queue monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1124857 (https://phabricator.wikimedia.org/T388041)
[19:00:04] <jouncebot>	 hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1900)
[19:03:24] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s8 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:04:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74098 and previous config saved to /var/cache/conftool/dbconfig/20250305-190403-root.json
[19:14:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3554 MB (3% inode=98%): /tmp 3554 MB (3% inode=98%): /var/tmp 3554 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[19:15:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74099 and previous config saved to /var/cache/conftool/dbconfig/20250305-191550-root.json
[19:16:16] <wikibugs>	 (03PS1) 10Gergő Tisza: Preserve usesul3 flag during autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124860 (https://phabricator.wikimedia.org/T375788)
[19:16:35] <wikibugs>	 (03PS1) 10Gergő Tisza: Preserve usesul3 flag during autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124861 (https://phabricator.wikimedia.org/T375788)
[19:17:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124860 (https://phabricator.wikimedia.org/T375788) (owner: 10Gergő Tisza)
[19:17:40] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124861 (https://phabricator.wikimedia.org/T375788) (owner: 10Gergő Tisza)
[19:19:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74100 and previous config saved to /var/cache/conftool/dbconfig/20250305-191909-root.json
[19:22:11] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] haproxy: Remove cipher regsub of "ECDHE-RSA-" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[19:27:16] <wikibugs>	 (03PS1) 10Bking: cloudelastic: migrate cloudelastic1008 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1124863 (https://phabricator.wikimedia.org/T387904)
[19:27:40] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[19:27:55] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124863 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking)
[19:30:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74101 and previous config saved to /var/cache/conftool/dbconfig/20250305-193056-root.json
[19:30:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1008* for ban host prior to reimage - bking@cumin2002 - T387904
[19:31:02] <stashbot>	 T387904: Migrate Cloudelastic to Opensearch - https://phabricator.wikimedia.org/T387904
[19:31:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1008* for ban host prior to reimage - bking@cumin2002 - T387904
[19:34:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74102 and previous config saved to /var/cache/conftool/dbconfig/20250305-193414-root.json
[19:41:18] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:46:01] <wikibugs>	 (03PS1) 10Gergő Tisza: Clean up SUL3 config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124865 (https://phabricator.wikimedia.org/T384007)
[19:46:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74103 and previous config saved to /var/cache/conftool/dbconfig/20250305-194601-root.json
[19:46:02] <wikibugs>	 (03PS1) 10Gergő Tisza: Roll out SUL3 signup to 1% of users on most group 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124866 (https://phabricator.wikimedia.org/T384007)
[19:46:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124865 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[19:46:53] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124866 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[19:49:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74104 and previous config saved to /var/cache/conftool/dbconfig/20250305-194920-root.json
[19:54:18] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707#10607356 (10VRiley-WMF) 05Open→03Resolved This hard drive has been replaced.
[20:01:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74105 and previous config saved to /var/cache/conftool/dbconfig/20250305-200106-root.json
[20:04:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74106 and previous config saved to /var/cache/conftool/dbconfig/20250305-200426-root.json
[20:04:55] <wikibugs>	 (03PS1) 10Arlolra: Invert Parsoid read view wiktionary configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867
[20:07:33] <swfrench-wmf>	 jouncebot: nowandnext
[20:07:33] <jouncebot>	 For the next 0 hour(s) and 52 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1900)
[20:07:33] <jouncebot>	 In 0 hour(s) and 52 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T2100)
[20:09:12] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4052.ulsfo.wmnet} and A:cp for 9.2.9-1wm1
[20:09:45] <swfrench-wmf>	 dduvall: I see group1 rolled during the primary train window earlier today. any objections if I use the rest of your window to make some mediawiki capacity right-sizing changes?
[20:11:24] <dduvall>	 swfrench-wmf: no objection from me
[20:11:32] <swfrench-wmf>	 dduvall: great, thank you!
[20:11:53] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4052.ulsfo.wmnet} and A:cp for 9.2.9-1wm1
[20:12:39] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:04-1] Invert Parsoid read view wiktionary configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra)
[20:13:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10607402 (10HCoplin-WMF) Just tested with dashboards I previously didn't have access to, and...
[20:15:33] <wikibugs>	 (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124848 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[20:15:41] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): right-size given current traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124848 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[20:16:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74107 and previous config saved to /var/cache/conftool/dbconfig/20250305-201612-root.json
[20:16:52] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:04-1] "Otherwise, the list matches what I see on metawiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra)
[20:17:19] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(api-ext|web): right-size given current traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124848 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[20:19:32] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[20:20:04] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[20:20:16] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[20:20:26] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[20:21:55] <wikibugs>	 (03CR) 10Arlolra: Invert Parsoid read view wiktionary configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra)
[20:22:31] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra)
[20:28:05] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] cloudelastic: migrate cloudelastic1008 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1124863 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking)
[20:34:22] <swfrench-wmf>	 rzl: I'm back now
[20:34:30] * swfrench-wmf shakes fist at computer
[20:34:33] <rzl>	 👍
[20:38:00] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[20:38:12] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[20:38:26] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[20:38:35] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[20:39:54] <swfrench-wmf>	 !log right-sized capacity distribution between mw-(api-ext|web) main and next releases - T383845
[20:39:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:57] <stashbot>	 T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845
[20:45:51] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] Enable Search AB test for en wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang)
[20:51:29] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:04-1] Invert Parsoid read view wiktionary configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra)
[20:55:19] <wikibugs>	 (03CR) 10Bernard Wang: Enable Search AB test for en wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang)
[20:56:39] <wikibugs>	 (03PS2) 10Arlolra: Turn on Parsoid Read Views for 44 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505)
[20:56:40] <wikibugs>	 (03PS2) 10Arlolra: Invert Parsoid read view wiktionary configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867
[20:57:28] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Turn on Parsoid Read Views for 44 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505) (owner: 10Arlolra)
[20:57:41] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[20:57:50] <wikibugs>	 (03PS3) 10Arlolra: Invert Parsoid read view wiktionary configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867
[20:57:59] <wikibugs>	 (03CR) 10Arlolra: Invert Parsoid read view wiktionary configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra)
[20:59:00] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Invert Parsoid read view wiktionary configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra)
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T2100).
[21:00:05] <jouncebot>	 bwang, arlolra, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:21] <arlolra>	 o/
[21:01:54] <bwang>	 hi
[21:01:55] <bwang>	 Im here
[21:03:22] <tgr_>	 o/
[21:04:32] <bwang>	 Sorry i just have 1 patch to back port but its in the table 3 times haha
[21:05:10] <wikibugs>	 (03CR) 10Bernard Wang: [C:04-1] Enable Search AB test for en wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang)
[21:05:51] <bwang>	 Sorry its not ready yet
[21:06:28] <tgr_>	 I can deploy
[21:06:53] <wikibugs>	 (03PS3) 10Bernard Wang: Enable Search AB test for en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510
[21:07:51] <tgr_>	 arlolra: can I deploy the two config changes together?
[21:07:57] <arlolra>	 Yes pelase
[21:08:00] <arlolra>	 please
[21:08:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505) (owner: 10Arlolra)
[21:08:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra)
[21:09:35] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for 44 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505) (owner: 10Arlolra)
[21:09:37] <wikibugs>	 (03Merged) 10jenkins-bot: Invert Parsoid read view wiktionary configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra)
[21:10:06] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124551|Turn on Parsoid Read Views for 44 wiktionaries (T387505)]], [[gerrit:1124867|Invert Parsoid read view wiktionary configs]]
[21:10:09] <stashbot>	 T387505: Parsoid Read Views to Wiktionary deploy ~2025-03-03 - https://phabricator.wikimedia.org/T387505
[21:13:26] <logmsgbot>	 !log tgr@deploy2002 tgr, arlolra: Backport for [[gerrit:1124551|Turn on Parsoid Read Views for 44 wiktionaries (T387505)]], [[gerrit:1124867|Invert Parsoid read view wiktionary configs]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:13:57] <wikibugs>	 (03PS1) 10Scott French: mw-web: additional right-sizing tweaks for next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124872 (https://phabricator.wikimedia.org/T383845)
[21:14:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10607511 (10phaultfinder)
[21:14:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3649 MB (3% inode=98%): /tmp 3649 MB (3% inode=98%): /var/tmp 3649 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[21:15:42] <arlolra>	 tgr_: looks good to continue
[21:16:28] <logmsgbot>	 !log tgr@deploy2002 tgr, arlolra: Continuing with sync
[21:17:00] <wikibugs>	 (03CR) 10Scott French: "Thanks in advance for the review, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124872 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[21:17:26] <wikibugs>	 (03CR) 10Gergő Tisza: [C:04-1] "We forgot to deploy this, oops." [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01)
[21:20:34] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] flink-app chart: Support per-chart logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546 (owner: 10Ebernhardson)
[21:21:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/next at eqiad: 15.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:21:21] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] mw-web: additional right-sizing tweaks for next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124872 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[21:21:33] <swfrench-wmf>	 ^ this was me - I have a patch to resize
[21:22:04] <swfrench-wmf>	 tgr_: after this backport completes, could you please pause so I can tweak this
[21:22:28] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app chart: Support per-chart logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546 (owner: 10Ebernhardson)
[21:22:29] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124551|Turn on Parsoid Read Views for 44 wiktionaries (T387505)]], [[gerrit:1124867|Invert Parsoid read view wiktionary configs]] (duration: 12m 23s)
[21:22:32] <stashbot>	 T387505: Parsoid Read Views to Wiktionary deploy ~2025-03-03 - https://phabricator.wikimedia.org/T387505
[21:22:40] <swfrench-wmf>	 tgr_: please pause here
[21:22:51] <tgr_>	 ack
[21:22:55] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-web: additional right-sizing tweaks for next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124872 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[21:23:17] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:23:30] <arlolra>	 tgr_: thanks
[21:24:09] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:24:17] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:24:28] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web: additional right-sizing tweaks for next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124872 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French)
[21:25:42] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[21:25:54] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[21:26:03] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[21:26:11] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[21:26:15] <jinxer-wm>	 FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:26:34] <swfrench-wmf>	 arlolra: what exactly do these patches do in practice?
[21:26:50] <swfrench-wmf>	 we're seeing a rather large bump in latency
[21:28:29] <rzl>	 oh, we sure are https://grafana.wikimedia.org/goto/TAIaXPpNg
[21:28:54] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[21:29:03] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:30:27] <rzl>	 I see a small bump in worker saturation at 20:18 associated with swfrench-wmf's deployment, and a bigger one at 21:17 associated with tgr_/arlolra's
[21:31:03] <rzl>	 the initial spike in latency didn't hang around, but it's stabilizing *much* higher than previously, and the worker saturation isn't going anywhere
[21:31:14] <rzl>	 if this wasn't expected, please strongly consider a rollback while investigating
[21:31:15] <jinxer-wm>	 FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:31:48] <rzl>	 tgr_, arlolra: (and if you're here digging, please say so -- if it gets worse and I don't hear from you, I may roll back)
[21:32:51] <swfrench-wmf>	 I'm concerned that container network rx on mw-web is way up vs. before
[21:33:11] <hashar>	 cache text got invalidated I guess  https://grafana.wikimedia.org/d/O9zAmeOWz/ats-cache-operations?orgId=1&from=now-3h&to=now&viewPanel=4
[21:33:38] <hashar>	 there is a large bump of "cache_text fresh backend"
[21:33:43] <tgr_>	 at a glance it should only affect a bunch of smaller wiktionaries
[21:33:47] <swfrench-wmf>	 good catch, hashar!
[21:34:04] <tgr_>	 so not a lot of traffic
[21:35:08] <rzl>	 for context the difference in php resource consumption is about 10% of the fleet -- we were using a little under 40% of workers and are now using a little under 50%
[21:35:28] <tgr_>	 should I roll back?
[21:35:36] <rzl>	 even if it wasn't a lot of traffic in CDN terms, if we invalidated the cache for all of it, it's a lot of traffic in app layer terms
[21:35:48] <tgr_>	 I am not familiar with the feature and I guess arlolra left already
[21:36:01] <hashar>	 roll back so
[21:36:07] <hashar>	 I am happy to assist :)
[21:36:26] <hashar>	 then we can see whether the latency is restored
[21:36:29] <tgr_>	 but 10% of CPU usage for cache invalidation of what's probably less than 1% of our content would be surprising
[21:36:32] <tgr_>	 ok
[21:36:33] <hnowlan>	 There's been a significant jump in job insertion https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%20prometheus%2Fk8s&from=now-3h&to=now
[21:37:05] <swfrench-wmf>	 that would explain why jobrunners are running hot
[21:37:31] <tgr_>	 eh, scap backport --revert can't handle stacked commits
[21:37:32] <rzl>	 tgr_: yes, let's roll back please -- this isn't immediately an emergency, so feel free to do so in a leisurely fashion
[21:37:34] <tgr_>	 just a sec
[21:37:47] <rzl>	 if arlolra wants to collect any data first that's fine, maybe ping them out of IRC
[21:38:04] <wikibugs>	 (03PS1) 10Gergő Tisza: Revert "Invert Parsoid read view wiktionary configs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124874
[21:38:23] <wikibugs>	 (03PS1) 10Gergő Tisza: Revert "Turn on Parsoid Read Views for 44 wiktionaries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124875
[21:38:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Turn on Parsoid Read Views for 44 wiktionaries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124875 (owner: 10Gergő Tisza)
[21:38:42] <wikibugs>	 (03PS2) 10Gergő Tisza: Revert "Turn on Parsoid Read Views for 44 wiktionaries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124875
[21:38:52] <hashar>	 I have poked content-transformers in their thread on Slack
[21:38:55] <arlolra>	 swfrench-wmf: they make parsoid the default wikitext parser
[21:39:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124874 (owner: 10Gergő Tisza)
[21:39:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124875 (owner: 10Gergő Tisza)
[21:39:59] <swfrench-wmf>	 arlolra: that's in theory limited to 44 wiktionaries?
[21:40:13] <arlolra>	 yup
[21:40:31] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Invert Parsoid read view wiktionary configs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124874 (owner: 10Gergő Tisza)
[21:40:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Turn on Parsoid Read Views for 44 wiktionaries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124875 (owner: 10Gergő Tisza)
[21:41:02] <arlolra>	 swfrench-wmf: it's already enabled for all of wikivoyage and most other wiktionaries
[21:41:03] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124874|Revert "Invert Parsoid read view wiktionary configs"]], [[gerrit:1124875|Revert "Turn on Parsoid Read Views for 44 wiktionaries"]]
[21:41:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:41:31] <hashar>	 hmm
[21:41:39] <hashar>	 the alarm on the canary resolved as part of the deploy?
[21:42:18] <swfrench-wmf>	 hashar: latency is slowly trending down, so I think that one just slipped below the alert threshold
[21:42:48] <swfrench-wmf>	 arlolra: got it, thanks. have other similarly sized enrollments in parsoid read views caused similar latency impact previously?
[21:42:59] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Enable Search AB test for en wiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang)
[21:43:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:43:19] <arlolra>	 we've been deploying to a set of ~40 wikis the past few weeks
[21:43:25] <rzl>	 that might be an alerting bug, 6.25% idle is still over the-- yeah okay
[21:44:01] * subbu catches up with backlog
[21:44:02] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1124874|Revert "Invert Parsoid read view wiktionary configs"]], [[gerrit:1124875|Revert "Turn on Parsoid Read Views for 44 wiktionaries"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:44:06] <hashar>	 and that alarm has a bit.ly link 🤭
[21:44:08] <rzl>	 "bug" or some kind of transient state associated with the deploy, but either way they're very much consistently saturated
[21:44:12] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[21:44:21] <rzl>	 hashar: if you click on it, it will explain why :)
[21:45:16] <subbu>	 per arlo, we have been deploying these for the last 3 weeks ... and haven't had any alerts thus far, and these are much smaller wikis than the previous wikis we rolled out to.
[21:45:24] <hashar>	 rzl: I imagine if we have a cluster fuck issue, we surely want a link and a doc hosted outside of our domains/cluster etc :-]
[21:45:53] <arlolra>	 swfrench-wmf: this is the first time we've seen any alert
[21:45:58] <rzl>	 subbu: yeah, for clarity -- not opposed to the content of the change, I just want to make sure we understand why this had such a big effect on the app layer
[21:46:35] <subbu>	 understood .. i am just thinking aloud here ... those recent wikis all have a few thousand pages at most.
[21:46:44] <denisse>	 rzl: +1, understanding why it's happening is important.
[21:46:57] <wikibugs>	 (03CR) 10Bernard Wang: Enable Search AB test for en wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang)
[21:46:59] <subbu>	 arlolra, wonder if the invert patch had something else we missed.
[21:47:12] <swfrench-wmf>	 ^ this is what I'm wondering
[21:47:14] <bwang>	 Hi, so this patch is ready https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1124510/2
[21:47:21] <subbu>	 there were 2 config pages we deployed .. one was the rollout to 44 wiktionaries.
[21:47:28] <bwang>	 If there’s still time after in this window
[21:47:34] <subbu>	 the second one was to invert the config to simplify the config.
[21:47:42] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[21:47:50] <arlolra>	 we could roll them out individually
[21:48:05] <subbu>	 ya.
[21:48:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:49:21] <swfrench-wmf>	 alright, so if we see a second wave of cache invalidation, we may experience a second bump
[21:49:28] <rzl>	 yeah, was thinking the same
[21:49:30] <tgr_>	 I wonder if maybe wikimedia-config does not handle three-level settings (default => false, wiktionary => true, <specific wiktionary> => false) correctly?
[21:49:50] <subbu>	 we did that for wikivoyages though?
[21:50:09] <rzl>	 swfrench-wmf: the bad news is it'll be the same size as the first bump, so we're already committed -- but the good news is we know we have the resources to handle that
[21:50:25] <swfrench-wmf>	 agreed, yeah
[21:50:34] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124874|Revert "Invert Parsoid read view wiktionary configs"]], [[gerrit:1124875|Revert "Turn on Parsoid Read Views for 44 wiktionaries"]] (duration: 09m 30s)
[21:50:53] <tgr_>	 is there actual cache invalidation involved? if the cache is merely split on parser type, the old parser entries should still be there in the parser cache
[21:51:02] <subbu>	 yes.
[21:51:15] <subbu>	 yes to the cache is split by parser type.
[21:51:17] <swfrench-wmf>	 ah, that's good to know
[21:51:19] <rzl>	 oh, great
[21:52:18] * subbu doesn't see anything obviously broken with the invert.
[21:52:31] <subbu>	 so, once all the other patches are deployed, can we re-try the first config patch?
[21:53:22] <subbu>	 assuming we have time and there is nothing else pending after this window. if not, we can try this again tomorrow.
[21:53:25] <arlolra>	 we're still doing it with wikivoyage and zhwikivoyage flase
[21:53:31] <subbu>	 ya
[21:53:33] <rzl>	 I'd like to get swfrench-wmf's resizing followup out too, if that's still outstanding, but otherwise no objection from me, as long as we're paying attention
[21:53:49] <hashar>	 not sure if this matter but there is a graph showing "Parser cache save reason"  had a raise for "view" | https://grafana.wikimedia.org/d/a97c66ff-0e10-4d2a-b9e1-37b96b7b4d35/parser-cache-misses?orgId=1&from=now-3h&to=now&viewPanel=32
[21:53:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.eqiad.wmnet with OS bullseye
[21:53:59] <rzl>	 like I say this didn't actually break anything, it just swung the graph unexpectedly
[21:54:19] <swfrench-wmf>	 rzl: it's applied - I did that somewhat urgently when I though I was the source of the alert :)
[21:54:23] <tgr_>	 jouncebot: next
[21:54:23] <jouncebot>	 In 0 hour(s) and 5 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T2200)
[21:54:26] <rzl>	 okay cool, thanks
[21:54:34] <rzl>	 only outstanding in the other sense, then :D
[21:54:37] <tgr_>	 so there's that, not sure if they plan to use it
[21:54:38] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1008.eqiad.wmnet with OS bullseye
[21:54:57] <tgr_>	 James_F: ^
[21:55:01] <wikibugs>	 (03CR) 10Bking: [C:03+2] cloudelastic: migrate cloudelastic1008 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1124863 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking)
[21:55:47] <subbu>	 tgr_, so the revert is now live for the last 5 mins, right?
[21:56:05] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye
[21:56:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10607629 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1181.eqiad.wmnet with OS bullseye
[21:56:20] <tgr_>	 yeah, should be live
[21:57:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.eqiad.wmnet with OS bullseye
[21:59:23] <hashar>	 then if lot of jobs are still in the queue, that would take a while to process them
[21:59:36] <subbu>	 so, that parsercache panel that hashar shared is not showing any change in the # of saves because of view even after the revert .. it is still double what it was was at the start of the hour.
[21:59:38] <subbu>	 ah, okay.
[21:59:39] <hashar>	 I could not find a graph showing the size of the pending queues though, only rates
[22:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T2200)
[22:00:08] <hashar>	 that is all fringe theory, cause really I have long forgot/lost contact with jobs/jobqueue/parser etc
[22:01:04] <tgr_>	 what job is this? htmlCacheUpdate?
[22:01:29] <hashar>	 parserCachePreWarm apparently
[22:01:52] <hashar>	 hnowlan shared https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%20prometheus%2Fk8s&from=now-3h&to=now
[22:02:01] <wikibugs>	 (03PS1) 10Dzahn: aptrepo: replace http with https in downloads.linux.hpe.com URLs [puppet] - 10https://gerrit.wikimedia.org/r/1124877 (https://phabricator.wikimedia.org/T388042)
[22:02:39] <hashar>	 parsoidCachePrewarm went from 70 jobs / s to up to 226 jobs / s
[22:03:42] <hashar>	 if I zoom it out we had a similar behavior this morning around 7:20
[22:04:06] <subbu>	 ya .. and it hasn't stopped now after the revert.
[22:04:43] <wikibugs>	 (03CR) 10Dzahn: "curl http://downloads.linux.hpe.com/SDR/repo/mcp/" [puppet] - 10https://gerrit.wikimedia.org/r/1124877 (https://phabricator.wikimedia.org/T388042) (owner: 10Dzahn)
[22:05:44] <wikibugs>	 (03PS2) 10Dzahn: aptrepo: replace http with https in downloads.linux.hpe.com URLs [puppet] - 10https://gerrit.wikimedia.org/r/1124877 (https://phabricator.wikimedia.org/T388042)
[22:06:49] <subbu>	 Can we retry just the first config patch now?
[22:07:02] <subbu>	 looks like wikifunctions doesn't have anything to deploy now?
[22:07:13] <wikibugs>	 (03PS1) 10Ebernhardson: flink-app chart: Repair custom log configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124878
[22:07:37] <hashar>	 (I see the same bump at 7:08 this morning for the cache_text fresh backend which I have pasted earlier https://grafana.wikimedia.org/d/O9zAmeOWz/ats-cache-operations?orgId=1&viewPanel=4  )
[22:07:40] <hashar>	 so that looks similar
[22:07:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10607677 (10Jclark-ctr) 05Open→03Resolved Received additional drives and replaced
[22:08:27] <tgr_>	 these jobs are triggered on page view, right?
[22:09:48] <tgr_>	 I added the first 10 of the 44 wiktionaries to the pageview tool and it says 24k views a day (so <100/min for all 44 unless there is a huge outlier)
[22:10:14] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] flink-app chart: Repair custom log configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124878 (owner: 10Ebernhardson)
[22:10:16] <tgr_>	 but job stats went up by like 200/sec
[22:10:35] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Revert "Let sysops add/remove the event-organizer group by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124879 (https://phabricator.wikimedia.org/T386738)
[22:10:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Let sysops add/remove the event-organizer group by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124879 (https://phabricator.wikimedia.org/T386738) (owner: 10Daimona Eaytoy)
[22:10:46] <wikibugs>	 (03PS1) 10Cwhite: grafana: add quotes around interpolated log variables [puppet] - 10https://gerrit.wikimedia.org/r/1124880
[22:11:30] <tgr_>	 rzl: hashar: ok to give it another try?
[22:11:40] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Revert "Let sysops add/remove the event-organizer group by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124879 (https://phabricator.wikimedia.org/T386738)
[22:11:58] <rzl>	 fine by me
[22:12:05] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app chart: Repair custom log configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124878 (owner: 10Ebernhardson)
[22:12:16] <hashar>	 +1 I guess
[22:12:23] <hashar>	 but I will stop here, it is too late for me
[22:12:36] <hashar>	 unless you need someone to drive scap?
[22:12:55] <wikibugs>	 (03PS1) 10Gergő Tisza: Revert^2 "Turn on Parsoid Read Views for 44 wiktionaries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124881 (https://phabricator.wikimedia.org/T387505)
[22:13:03] <tgr_>	 no, I can do it
[22:14:12] <subbu>	 thanks tgr_ 
[22:14:39] <hashar>	 great thanks
[22:15:43] <hashar>	 I ll check tomorrow morning when I run the train :)
[22:15:57] <wikibugs>	 (03PS1) 10Ebernhardson: flink-app: Update chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124882
[22:16:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124881 (https://phabricator.wikimedia.org/T387505) (owner: 10Gergő Tisza)
[22:16:17] <wikibugs>	 (03PS3) 10Daimona Eaytoy: Revert "Let sysops add/remove the event-organizer group by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124879 (https://phabricator.wikimedia.org/T386738)
[22:17:14] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "Turn on Parsoid Read Views for 44 wiktionaries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124881 (https://phabricator.wikimedia.org/T387505) (owner: 10Gergő Tisza)
[22:17:45] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124881|Revert^2 "Turn on Parsoid Read Views for 44 wiktionaries" (T387505)]]
[22:17:49] <stashbot>	 T387505: Parsoid Read Views to Wiktionary deploy ~2025-03-03 - https://phabricator.wikimedia.org/T387505
[22:18:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124879 (https://phabricator.wikimedia.org/T386738) (owner: 10Daimona Eaytoy)
[22:18:22] <icinga-wm>	 PROBLEM - Disk space on deploy2002 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/3fed3640d35b7e68de691c1a8e75c92260c0dc2c19c4eabc8af14bfa6f7bb315/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops
[22:18:26] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] flink-app: Update chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124882 (owner: 10Ebernhardson)
[22:20:16] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app: Update chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124882 (owner: 10Ebernhardson)
[22:20:45] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1124881|Revert^2 "Turn on Parsoid Read Views for 44 wiktionaries" (T387505)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:22:30] <tgr_>	 subbu: do you want to inspect something or can it go live?
[22:23:26] <subbu>	 it can go live .. i don't think canaries will reflect any change in latencies or jobs.
[22:23:39] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[22:24:24] <tgr_>	 parsoidCachePrewarm seems to be running on all wikis btw
[22:24:25] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:24:34] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:24:43] <tgr_>	 or at least logstash has a steady stream of trigger:parsoidCachePrewarm events on large Wikipedias
[22:25:42] <subbu>	 they are queued for Parsoid when a legacy parser view generates a fresh parse .. to ensure that Parsoid's HTML is ready for when a page might be opened in VE
[22:26:56] <subbu>	 https://phabricator.wikimedia.org/T327164
[22:28:09] <tgr_>	 so maybe there was a scraper with unfortunate timing, and it's not at all related to wiktionaries?
[22:29:05] <wikibugs>	 (03Abandoned) 10D3r1ck01: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01)
[22:29:18] <wikibugs>	 (03CR) 10D3r1ck01: "Ack!" [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01)
[22:29:21] <tgr_>	 granted the timing matches very well
[22:29:29] <subbu>	 ya ...
[22:29:51] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124881|Revert^2 "Turn on Parsoid Read Views for 44 wiktionaries" (T387505)]] (duration: 12m 06s)
[22:29:55] <stashbot>	 T387505: Parsoid Read Views to Wiktionary deploy ~2025-03-03 - https://phabricator.wikimedia.org/T387505
[22:30:53] <subbu>	 https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%2520prometheus%252Fk8s&from=now-30d&to=now&viewPanel=18 shows a cycle with non-zero peaks around 19:20 
[22:31:06] <subbu>	 which is also when the wiktionary config changes went live.
[22:31:07] <tgr_>	 scap says "21:17:24 Started sync-prod-k8s
[22:31:24] <tgr_>	 and the spike starts at 17:30
[22:31:24] <subbu>	 sorry 21:20
[22:32:19] <swfrench-wmf>	 so, one additional observation: one of the reasons the effect of this was "amplified" is that it seems the PHP 8.1 deployments of mediawiki took the brunt of the load from this
[22:32:39] <wikibugs>	 (03PS1) 10Ebernhardson: flink-app: Provide full log4j-console.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124884
[22:32:41] <swfrench-wmf>	 you can see a sizable (~ 20%) bump in RPS on them when the backports went out
[22:32:49] <swfrench-wmf>	 which is not visible on the 7.4 deployments
[22:33:00] <swfrench-wmf>	 (though both experience elevated latency)
[22:33:27] <subbu>	 so, looks like this retry went through fine so far?
[22:34:29] <swfrench-wmf>	 the "only" difference between external traffic directed to the 8.1-based deployments vs. 7.4 is the kinds of clients: these are all "real people using browsers" (e.g., accept cookies and run javascript)
[22:35:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[22:36:13] <subbu>	 8.1 is real people?
[22:36:28] <tgr_>	 yeah, no spike this time
[22:37:12] <swfrench-wmf>	 subbu: as a simplification, yeah - in the sense that only clients presenting an enrollment cookie (which is granted by js that runs in-browser) are routed there
[22:37:32] <subbu>	 got it.
[22:38:03] <subbu>	 anyway, we can rule out wiktionaries config having been the source of the spike. The only thing left to rule out is the invert patch. should we try that tomorrow or now?
[22:38:40] <tgr_>	 what's weird is, there was a spike in GET requests: https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m&from=now-3h&to=now&viewPanel=62
[22:38:51] <tgr_>	 how can parsoid cause that?
[22:39:01] <tgr_>	 jobs are POST requests, right?
[22:39:12] <wikibugs>	 (03PS2) 10Ebernhardson: flink-app: Provide full log4j-console.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124884
[22:39:28] <swfrench-wmf>	 tgr_: that's exactly the increase I was talking about, yeah
[22:39:37] <tgr_>	 or maybe the method selector is just not working for that metric
[22:40:10] <swfrench-wmf>	 ah, that's possible too
[22:40:48] <swfrench-wmf>	 tgr_: importantly, as you point out, that's traffic to mw-web, not mw-jobrunner
[22:40:56] <swfrench-wmf>	 so, the source should not be jobs
[22:42:39] <wikibugs>	 (03PS3) 10Ebernhardson: flink-app: Provide full log4j-console.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124884
[22:42:53] <tgr_>	 hm, right
[22:43:04] <tgr_>	 there was a spike for both jobs and web
[22:43:21] <tgr_>	 at around 500/sec
[22:43:50] <tgr_>	 which seems ridiculously high for a set of wiktionaries that don't include the big European ones
[22:44:00] <tgr_>	 maybe the job is making an API request?
[22:44:31] <subbu>	 the job spike is smaller .. but, expected because if the web spike causes a bunch of cache misses on wikis.
[22:45:02] <tgr_>	 if it's a large wiki, I'd assume a major template got reparsed and then that triggered a bunch of recursive parses, but these wiktionaries are fairly small, right?
[22:46:22] <wikibugs>	 (03PS4) 10Ebernhardson: flink-app: Provide full log4j-console.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124884
[22:46:30] <tgr_>	 anyway, should we try the other patch?
[22:46:34] <subbu>	 yes, many of the bigger wikis went out in earlier deploys (enwikt is still not on parsoid).
[22:46:41] <subbu>	 works for me. arlolra ?
[22:46:44] <tgr_>	 which is theoretically a noop
[22:46:48] <subbu>	 yes.
[22:49:14] <arlolra>	 sure
[22:49:38] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] flink-app: Provide full log4j-console.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124884 (owner: 10Ebernhardson)
[22:50:10] <wikibugs>	 (03PS1) 10Gergő Tisza: Revert^2 "Invert Parsoid read view wiktionary configs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124885
[22:50:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124885 (owner: 10Gergő Tisza)
[22:50:47] <subbu>	 all of this is good training for us on CTT :-)
[22:51:20] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "Invert Parsoid read view wiktionary configs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124885 (owner: 10Gergő Tisza)
[22:51:24] <wikibugs>	 (03Merged) 10jenkins-bot: flink-app: Provide full log4j-console.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124884 (owner: 10Ebernhardson)
[22:51:47] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124885|Revert^2 "Invert Parsoid read view wiktionary configs"]]
[22:53:34] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:53:49] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:54:45] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1124885|Revert^2 "Invert Parsoid read view wiktionary configs"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:57:35] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[22:58:17] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Drop cloudelastic custom logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124886
[22:59:44] <swfrench-wmf>	 subbu: is it expected that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1124885 is not a noop, a least according to the config diff check?
[23:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T2300)
[23:00:08] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+2] cirrus: Drop cloudelastic custom logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124886 (owner: 10Ebernhardson)
[23:00:12] <swfrench-wmf>	 for example, it seems to flip frwiktionary from wgParserMigrationEnableParsoidDiscussionTools: true to false
[23:00:22] <swfrench-wmf>	 https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig/3807/console
[23:01:15] <arlolra>	 swfrench-wmf: that was overlooked in a previous revert we had made, it's fine
[23:01:38] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Drop cloudelastic custom logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124886 (owner: 10Ebernhardson)
[23:01:47] <arlolra>	 but maybe it explains some things
[23:02:00] <swfrench-wmf>	 arlolra: ah, got - so the "invert" patch is not itself a noop
[23:02:56] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:03:04] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:03:16] <arlolra>	 it was intended to be a noop but we missed that that was changing
[23:03:23] <arlolra>	 the change is fine though
[23:04:01] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124885|Revert^2 "Invert Parsoid read view wiktionary configs"]] (duration: 12m 13s)
[23:04:02] <subbu>	 arlolra, but looks like there are a bunch of other wiktionaries that flipped to true ... so, there is more going on there.
[23:04:11] <arlolra>	 so pageviews of cold frwiktionary talk pages?
[23:04:16] <tgr_>	 in any case, no spike this time
[23:04:36] <swfrench-wmf>	 to clarify, frwiktionary is just one example
[23:05:38] <tgr_>	 swfrench-wmf: is it ok to continue with the other (non-parsoid) backports or does someone intend to investigate more?
[23:06:38] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Enable Search AB test for en wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang)
[23:06:49] <subbu>	 looks like they are all wiktionaries with < 100 pages .. 
[23:07:13] <swfrench-wmf>	 tgr_: no objections on my end - things continue to stabilize
[23:07:23] <swfrench-wmf>	 rzl: any concerns?
[23:07:32] <subbu>	 I think this is good even if there are other non-noop changes. those other changes did surprise me though.
[23:08:16] <rzl>	 okay by me
[23:08:41] <tgr_>	 bwang: still around?
[23:08:43] <rzl>	 the jobworkers are still hot but trending in the right direction
[23:08:46] <arlolra>	 swfrench-wmf: sorry, surprising that we were workinf off incomplete data
[23:09:03] <rzl>	 *jobrunner workers
[23:10:06] <subbu>	 arlolra, https://aa.wiktionary.org/wiki/Special:AllPages is just Main Page .. so, an empty wiki but still lists 100 pages in Special:Statistics.
[23:10:14] <wikibugs>	 (03PS1) 10Ebernhardson: flink-app chart: Use ECS logging configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124887
[23:11:04] <subbu>	 Same for a couple others I checked. So, I think there are a number of "empty" wiktionaries which all flipped to Parsoid Read Views on the invert of config. We didn't realize it, but looks fine.
[23:12:01] <subbu>	 I expect the same happened with Scott inverted the wikivoyage config.
[23:12:10] <subbu>	 *when
[23:13:27] <subbu>	 tgr_, swfrench-wmf rzl thanks so much for hanging around and helping us work through this and keeping a close eye for issues.
[23:13:27] <rzl>	 consider: (1) https://usercontent.irccloud-cdn.com/file/zhjArS9X/image.png
[23:13:31] <rzl>	 (2) https://en.wikipedia.org/wiki/The_Great_Wave_off_Kanagawa
[23:13:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2208:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2208 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:13:52] <rzl>	 subbu: of course! thanks for digging into it when it didn't look as expected
[23:13:55] <wikibugs>	 (03CR) 10Krinkle: Profiler: emit both statsd and dogstatsd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite)
[23:14:02] <arlolra>	 subbu: these are all closed wikis
[23:14:21] <subbu>	 arlolra, aha .. that explains it.
[23:14:32] <arlolra>	 yes, thank you tgr_ swfrench-wmf 
[23:14:35] <subbu>	 rzl, that is funny (great wave off kanagawa).
[23:15:00] <swfrench-wmf>	 thank you both for sticking around as well, and tgr_ for rolling forward-and-back :)
[23:15:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124860 (https://phabricator.wikimedia.org/T375788) (owner: 10Gergő Tisza)
[23:15:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124861 (https://phabricator.wikimedia.org/T375788) (owner: 10Gergő Tisza)
[23:15:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124865 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[23:15:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns
[23:15:53] <wikibugs>	 (03Merged) 10jenkins-bot: Clean up SUL3 config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124865 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[23:16:20] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta
[23:16:34] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.019e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[23:16:35] <wikibugs>	 (03Merged) 10jenkins-bot: Preserve usesul3 flag during autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124860 (https://phabricator.wikimedia.org/T375788) (owner: 10Gergő Tisza)
[23:16:36] <wikibugs>	 (03Merged) 10jenkins-bot: Preserve usesul3 flag during autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124861 (https://phabricator.wikimedia.org/T375788) (owner: 10Gergő Tisza)
[23:17:11] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124860|Preserve usesul3 flag during autologin (T375788)]], [[gerrit:1124861|Preserve usesul3 flag during autologin (T375788)]], [[gerrit:1124865|Clean up SUL3 config (T384007)]]
[23:17:15] <stashbot>	 T375788: Implement SUL3 central autologin - https://phabricator.wikimedia.org/T375788
[23:17:16] <stashbot>	 T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007
[23:17:25] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1008.eqiad.wmnet with OS bullseye
[23:19:56] <toyofuku>	 tgr_: bwang's out for the day - we're gonna deploy tomorrow so no worries on that one
[23:20:02] <toyofuku>	 thank you though!!
[23:20:05] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1124860|Preserve usesul3 flag during autologin (T375788)]], [[gerrit:1124861|Preserve usesul3 flag during autologin (T375788)]], [[gerrit:1124865|Clean up SUL3 config (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:23:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2208:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2208 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:27:40] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:29:10] * subbu slowly backs away from the computer 
[23:29:43] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[23:36:04] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124860|Preserve usesul3 flag during autologin (T375788)]], [[gerrit:1124861|Preserve usesul3 flag during autologin (T375788)]], [[gerrit:1124865|Clean up SUL3 config (T384007)]] (duration: 18m 53s)
[23:36:08] <stashbot>	 T375788: Implement SUL3 central autologin - https://phabricator.wikimedia.org/T375788
[23:36:09] <stashbot>	 T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007
[23:38:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124866 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[23:39:14] <wikibugs>	 (03Merged) 10jenkins-bot: Roll out SUL3 signup to 1% of users on most group 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124866 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza)
[23:39:41] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye
[23:39:41] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124866|Roll out SUL3 signup to 1% of users on most group 1 wikis (T384007)]]
[23:39:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10608104 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1181.eqiad.wmnet with OS bullseye
[23:41:23] <wikibugs>	 07Puppet, 06SRE, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10608114 (10Jdlrobson-WMF) > This hits both the android and mobile tokens in our regex, and is correctly routed to the mobile site. Yes that's correct,...
[23:41:30] <wikibugs>	 (03CR) 10Cwhite: Profiler: emit both statsd and dogstatsd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite)
[23:41:36] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10608111 (10Jdlrobson-WMF) > As part of my analysis at T214998#10551073, I went through much of the long ta...
[23:42:20] <wikibugs>	 (03PS1) 10Daimona Eaytoy: Drop $wmgCampaignEventsProgramsAndEventsDashboardEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124893 (https://phabricator.wikimedia.org/T387025)
[23:42:41] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1124866|Roll out SUL3 signup to 1% of users on most group 1 wikis (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:42:44] <stashbot>	 T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007
[23:43:39] <wikibugs>	 (03PS2) 10Daimona Eaytoy: Drop $wmgCampaignEventsProgramsAndEventsDashboardEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124893 (https://phabricator.wikimedia.org/T387025)
[23:44:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124893 (https://phabricator.wikimedia.org/T387025) (owner: 10Daimona Eaytoy)
[23:47:37] <wikibugs>	 (03PS1) 10Fabfur: Fix previous commit [debs/benthos] - 10https://gerrit.wikimedia.org/r/1124894 (https://phabricator.wikimedia.org/T256098)
[23:49:34] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "seems a good idea to me, thanks for taking care of this!" [puppet] - 10https://gerrit.wikimedia.org/r/1124764 (https://phabricator.wikimedia.org/T380295) (owner: 10Muehlenhoff)
[23:49:59] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Remove unused $wgDiscussionToolsABTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124895
[23:50:42] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall)
[23:51:11] <wikibugs>	 (03PS7) 10Cwhite: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385)
[23:51:13] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Remove unused $wgOATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124896
[23:51:48] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[23:51:59] <wikibugs>	 (03PS8) 10Cwhite: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385)
[23:54:00] <wikibugs>	 (03CR) 10Cwhite: Profiler: emit both statsd and dogstatsd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite)
[23:55:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński)
[23:55:18] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124895 (owner: 10Bartosz Dziewoński)
[23:55:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124896 (owner: 10Bartosz Dziewoński)