[00:02:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2014.codfw.wmnet with OS bookworm [00:02:28] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603562 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm executed with errors: - backu... [00:03:54] (03PS1) 10Daimona Eaytoy: Use namespaced Title and Html classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) [00:03:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:06:46] (03PS2) 10Daimona Eaytoy: Use namespaced Title and Html classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) [00:10:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2013.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [00:10:54] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 647.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:11:17] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup2013'] [00:11:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:11:50] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['backup2013'] [00:14:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2013.codfw.wmnet with OS bookworm [00:14:10] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603626 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm [00:19:15] (03PS1) 10Daimona Eaytoy: officewiki: Disable the event-organizer user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124549 (https://phabricator.wikimedia.org/T387943) [00:19:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124549 (https://phabricator.wikimedia.org/T387943) (owner: 10Daimona Eaytoy) [00:20:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) (owner: 10Daimona Eaytoy) [00:24:10] (03CR) 10Zabe: [C:03+1] Use namespaced Title and Html classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) (owner: 10Daimona Eaytoy) [00:32:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2013.codfw.wmnet with reason: host reimage [00:34:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3528 MB (3% inode=98%): /tmp 3528 MB (3% inode=98%): /var/tmp 3528 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [00:36:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2013.codfw.wmnet with reason: host reimage [00:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1124550 [00:38:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1124550 (owner: 10TrainBranchBot) [00:42:37] (03PS1) 10Arlolra: Turn on Parsoid Read Views for 42 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505) [00:43:01] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10603665 (10Ladsgroup) I forgot to mention: This will be done as part of {T360589} First, we start serving 250px thumbnails gradually but sized to 220px,... [00:50:30] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1124550 (owner: 10TrainBranchBot) [00:55:49] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:57:21] (03CR) 10Arlolra: "Good point but, yeah, we can do that as part of the exercise of figuring what's left to do" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505) (owner: 10Arlolra) [00:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:59:09] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [01:00:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [01:00:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2013.codfw.wmnet with OS bookworm [01:00:25] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2013.codfw.wmnet with OS bookworm completed: - backup2013 (**PA... [01:01:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host backup2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:08:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1124553 [01:08:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1124553 (owner: 10TrainBranchBot) [01:12:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup2014.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [01:18:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host backup2014.codfw.wmnet with OS bookworm [01:18:33] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm [01:28:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1124553 (owner: 10TrainBranchBot) [01:36:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup2014.codfw.wmnet with reason: host reimage [01:40:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup2014.codfw.wmnet with reason: host reimage [01:51:28] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/94060a3722501301746a3e179221819b7849ebe36f7ec016b239e19d7bf89883/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:51:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:54:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3523 MB (3% inode=98%): /tmp 3523 MB (3% inode=98%): /var/tmp 3523 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [02:00:14] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:05:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10603755 (10Jhancock.wm) @Papaul i found a weird little thing. I racked ganeti2049 in B5, U40. There are three other servers in the same set of 4 on the switch. two... [02:05:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:05:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2014.codfw.wmnet with OS bookworm [02:05:57] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603756 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host backup2014.codfw.wmnet with OS bookworm completed: - backup2014 (... [02:05:59] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603757 (10Jhancock.wm) 05Open→03Resolved [02:06:36] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup201[34] - https://phabricator.wikimedia.org/T384973#10603760 (10Jhancock.wm) @jcrespo this is complete [02:11:28] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:16:26] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 29.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:34:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3594 MB (3% inode=98%): /tmp 3594 MB (3% inode=98%): /var/tmp 3594 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:21:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10603809 (10phaultfinder) [03:34:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3547 MB (3% inode=98%): /tmp 3547 MB (3% inode=98%): /var/tmp 3547 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [03:59:54] (03CR) 10VolkerE: [C:03+1] Use namespaced Title and Html classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) (owner: 10Daimona Eaytoy) [04:05:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:28:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10603867 (10Papaul) @Jhancock.wm thanks for checking. I see in netbox that ganetti2049 is rack in B4 and U41 and not U40 like you mentioned so i am guessing that you... [04:54:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3489 MB (3% inode=98%): /tmp 3489 MB (3% inode=98%): /var/tmp 3489 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [04:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:59:09] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [05:41:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:11:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:24:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2166 db1226', diff saved to https://phabricator.wikimedia.org/P74066 and previous config saved to /var/cache/conftool/dbconfig/20250305-062402-marostegui.json [06:24:32] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2166.codfw.wmnet [06:24:37] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1226.eqiad.wmnet [06:25:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1244 with weight 0 T387816', diff saved to https://phabricator.wikimedia.org/P74067 and previous config saved to /var/cache/conftool/dbconfig/20250305-062554-marostegui.json [06:25:58] T387816: Switchover s4 master (db1160 -> db1244) - https://phabricator.wikimedia.org/T387816 [06:26:12] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s4 T387816 [06:26:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1244 from API/vslow/dump T387816', diff saved to https://phabricator.wikimedia.org/P74068 and previous config saved to /var/cache/conftool/dbconfig/20250305-062629-marostegui.json [06:26:48] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1244 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1124335 (https://phabricator.wikimedia.org/T387816) (owner: 10Gerrit maintenance bot) [06:29:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1226.eqiad.wmnet [06:30:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2166.codfw.wmnet [06:30:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1226.eqiad.wmnet with reason: Index rebuild [06:30:51] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2166.codfw.wmnet with reason: Index rebuild [06:30:58] !log Starting s4 eqiad failover from db1160 to db1244 - T387816 [06:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:01] T387816: Switchover s4 master (db1160 -> db1244) - https://phabricator.wikimedia.org/T387816 [06:31:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1244 to s4 primary T387816', diff saved to https://phabricator.wikimedia.org/P74069 and previous config saved to /var/cache/conftool/dbconfig/20250305-063124-marostegui.json [06:32:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1160 T387816', diff saved to https://phabricator.wikimedia.org/P74070 and previous config saved to /var/cache/conftool/dbconfig/20250305-063216-marostegui.json [06:35:36] (03PS1) 10Marostegui: db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124576 [06:35:58] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1160.eqiad.wmnet [06:36:02] (03CR) 10Marostegui: [C:03+2] db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124576 (owner: 10Marostegui) [06:39:43] (03PS1) 10Marostegui: mariadb: Productionize db1250 [puppet] - 10https://gerrit.wikimedia.org/r/1124641 (https://phabricator.wikimedia.org/T385141) [06:40:24] (03CR) 10Marostegui: "Starting to clone this host, will eventually become a master, but not yet. That's why the master lines are critical are commented out." [puppet] - 10https://gerrit.wikimedia.org/r/1124641 (https://phabricator.wikimedia.org/T385141) (owner: 10Marostegui) [06:42:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1160.eqiad.wmnet [06:45:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1160.eqiad.wmnet with reason: Rebuilding index [06:59:43] (03PS2) 10Anzx: sewikimedia: update wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124547 (https://phabricator.wikimedia.org/T377921) [06:59:47] (03PS2) 10Anzx: Lift IP cap for edit-a-thon (Illinois Tech) on 2024-03-27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124541 (https://phabricator.wikimedia.org/T387568) [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T0700) [07:00:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124541 (https://phabricator.wikimedia.org/T387568) (owner: 10Anzx) [07:00:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124547 (https://phabricator.wikimedia.org/T377921) (owner: 10Anzx) [07:00:54] 06SRE, 06DBA, 07Datacenter-Switchover: Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default) - https://phabricator.wikimedia.org/T207385#10604058 (10Marostegui) 05Open→03Declined No longer... [07:02:16] (03PS1) 10Marostegui: db1246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124650 (https://phabricator.wikimedia.org/T387673) [07:03:08] (03CR) 10Marostegui: [C:03+2] db1246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1124650 (https://phabricator.wikimedia.org/T387673) (owner: 10Marostegui) [07:03:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74071 and previous config saved to /var/cache/conftool/dbconfig/20250305-070321-root.json [07:03:50] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10604064 (10Marostegui) I am repooling this host. [07:18:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74072 and previous config saved to /var/cache/conftool/dbconfig/20250305-071827-root.json [07:23:18] (03CR) 10Filippo Giunchedi: [C:03+1] "Instance looks good, thank you Cole" [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite) [07:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10604071 (10phaultfinder) [07:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:33:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74073 and previous config saved to /var/cache/conftool/dbconfig/20250305-073333-root.json [07:38:28] (03PS1) 10Muehlenhoff: Add hcoplin to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124708 (https://phabricator.wikimedia.org/T387459) [07:40:42] (03CR) 10Muehlenhoff: [C:03+2] Add hcoplin to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1124708 (https://phabricator.wikimedia.org/T387459) (owner: 10Muehlenhoff) [07:41:37] (03CR) 10Fabfur: [C:03+2] cache,haproxy: create tmpfile configuration for tls [puppet] - 10https://gerrit.wikimedia.org/r/1124403 (https://phabricator.wikimedia.org/T387826) (owner: 10Fabfur) [07:43:02] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), No backups: 1 (backup1013), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:46:46] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10604083 (10MoritzMuehlenhoff) 05Open→03Resolved @HCoplin-WMF I've... [07:46:57] checking backups [07:48:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74074 and previous config saved to /var/cache/conftool/dbconfig/20250305-074838-root.json [07:49:55] (03PS1) 10Muehlenhoff: Add wrai to releasers-mobile [puppet] - 10https://gerrit.wikimedia.org/r/1124709 (https://phabricator.wikimedia.org/T387786) [07:55:29] (03CR) 10Muehlenhoff: [C:03+2] Add wrai to releasers-mobile [puppet] - 10https://gerrit.wikimedia.org/r/1124709 (https://phabricator.wikimedia.org/T387786) (owner: 10Muehlenhoff) [08:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T0800). [08:00:05] anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:09] o/ [08:00:16] o/ [08:00:17] good morning [08:00:26] good morning [08:00:35] let me check the server logs before we start :) [08:01:54] looks like those servers are not doing much over night [08:02:33] anzx: can we really specify IP range as `'192.42.83.144 - 192.42.83.159` ? [08:03:38] hashar: it was done on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1124541 [08:03:42] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mobile for WRai-WMF - https://phabricator.wikimedia.org/T387786#10604094 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @WRai-WMF I've just enabled your access, you should now be able to log into release... [08:03:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74075 and previous config saved to /var/cache/conftool/dbconfig/20250305-080343-root.json [08:04:00] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10604097 (10MoritzMuehlenhoff) 05Open→03Stalled [08:04:05] and I swear I did review/wrote the code handling IP addresses :) [08:04:46] I ll deploy both at the same time [08:05:09] ok one minute i will update commit message [08:05:16] sure [08:05:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:46] (03PS3) 10Anzx: Lift IP cap for edit-a-thon (Illinois Tech) on March 12, 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124541 (https://phabricator.wikimedia.org/T387568) [08:06:00] hashar: done [08:06:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124541 (https://phabricator.wikimedia.org/T387568) (owner: 10Anzx) [08:06:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124547 (https://phabricator.wikimedia.org/T377921) (owner: 10Anzx) [08:07:31] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10604098 (10Marostegui) Repooled @wiki_willy I emailed Dell about this host (in the existing thread we have with them) but so far there's been no reply. Do you want to keep this ticket open and... [08:07:35] (03Merged) 10jenkins-bot: Lift IP cap for edit-a-thon (Illinois Tech) on March 12, 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124541 (https://phabricator.wikimedia.org/T387568) (owner: 10Anzx) [08:07:40] (03Merged) 10jenkins-bot: sewikimedia: update wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124547 (https://phabricator.wikimedia.org/T377921) (owner: 10Anzx) [08:07:48] the sewikimedia logo would need some url wouldn't it? [08:08:35] i dont think so [08:08:40] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1124541|Lift IP cap for edit-a-thon (Illinois Tech) on March 12, 2025 (T387568)]], [[gerrit:1124547|sewikimedia: update wordmark and tagline (T377921)]] [08:08:42] :) [08:08:44] T387568: Request list off IP cap Illinois Institute of Technology March 12, 2025 - https://phabricator.wikimedia.org/T387568 [08:08:44] T377921: Wikimedia Sverige logo distorted by new skin - https://phabricator.wikimedia.org/T377921 [08:11:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:13:43] 08:13:05 Started check-testservers [08:13:57] !log hashar@deploy2002 hashar, anzx: Backport for [[gerrit:1124541|Lift IP cap for edit-a-thon (Illinois Tech) on March 12, 2025 (T387568)]], [[gerrit:1124547|sewikimedia: update wordmark and tagline (T377921)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:14:01] T387568: Request list off IP cap Illinois Institute of Technology March 12, 2025 - https://phabricator.wikimedia.org/T387568 [08:14:01] T377921: Wikimedia Sverige logo distorted by new skin - https://phabricator.wikimedia.org/T377921 [08:14:02] hashar: logo looks good [08:14:05] !log hashar@deploy2002 hashar, anzx: Continuing with sync [08:14:09] you are fast :) [08:15:46] i opened link and refreshed page trying in Firefox , since wikimediadebug was not working on chrome logo was updated [08:17:05] what is broken with WikimediaDebug? We did some changes recently :) [08:17:11] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10604115 (10wiki_willy) Hi @Marostegui - thanks for checking. When I look back at previous email from Dell Support sent in November, MarcoAntonio says //"we can temporarily archive the case, an... [08:19:20] hashar: it still not available in chrome, but on Firefox it's available [08:19:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed & rebooted twice - https://phabricator.wikimedia.org/T387673#10604117 (10Marostegui) Thanks @wiki_willy - I thought the email was just a thread and not handled via some internal ticketing system. Let's leave this open for now so we don't forget. If there... [08:20:43] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124541|Lift IP cap for edit-a-thon (Illinois Tech) on March 12, 2025 (T387568)]], [[gerrit:1124547|sewikimedia: update wordmark and tagline (T377921)]] (duration: 12m 02s) [08:20:47] T387568: Request list off IP cap Illinois Institute of Technology March 12, 2025 - https://phabricator.wikimedia.org/T387568 [08:20:47] T377921: Wikimedia Sverige logo distorted by new skin - https://phabricator.wikimedia.org/T377921 [08:22:05] hashar: logo change looks good, with wmdebug turnoff , thanks for deploying [08:22:22] thank you for taking care of those [08:22:39] anzx: for WikimediaDebug that is because we have done a major migration of its code base (manifest v2 to v3) [08:22:48] (03CR) 10DCausse: "lgtm but the chart version might need to be updated" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546 (owner: 10Ebernhardson) [08:23:04] ok [08:23:07] (03PS1) 10Muehlenhoff: Bitu: Also point to idm-help@w.o for password resets [software/bitu] - 10https://gerrit.wikimedia.org/r/1124718 [08:23:08] the new version is under review and somehow the old one got flagged for removal cause it is "obsolete" [08:23:26] https://phabricator.wikimedia.org/T387822#10603735 [08:23:41] (03PS1) 10Volans: reports/librenms: fix f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124719 [08:23:47] and the large task is "upgrade to manifest v3" https://phabricator.wikimedia.org/T312694 [08:24:07] I don't have a workaround for Chrome short of loading the extension from source [08:24:10] else use Firefox :-] [08:27:21] i didn't try load from source, i saw task and tried it on Firefox instead [08:27:39] sounds good :) [08:27:54] hopefully the extension will be published in the Chrome store soonish [08:30:52] 06SRE, 07LDAP, 13Patch-For-Review: ldap-admins POSIX group does not actually give any permissions to its members - https://phabricator.wikimedia.org/T386472#10604136 (10MoritzMuehlenhoff) >>! In T386472#10573065, @Urbanecm wrote: > Noting @jrbs was added to the group in T220860, in order to be able to run ch... [08:33:22] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [08:34:13] (03PS1) 10Jcrespo: dbbackups: Prepare backup2013 to take over codfw backups of es* dbs [puppet] - 10https://gerrit.wikimedia.org/r/1124720 (https://phabricator.wikimedia.org/T387892) [08:34:53] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10604144 (10Ben.buchenau) Thanks @MoritzMuehlenhoff , confused my Phabricator with the developer account. Just created a developer account, named Ben.buchenau (ssh access... [08:34:55] (03CR) 10DCausse: [C:03+1] "thanks, good catch!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122151 (https://phabricator.wikimedia.org/T375520) (owner: 10Bking) [08:40:16] (03CR) 10Jcrespo: [C:03+2] dbbackups: Prepare backup2013 to take over codfw backups of es* dbs [puppet] - 10https://gerrit.wikimedia.org/r/1124720 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [08:41:14] (03CR) 10Ayounsi: [C:03+1] reports/librenms: fix f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124719 (owner: 10Volans) [08:41:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:42:42] (03CR) 10Tiziano Fogli: [C:03+2] pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [08:44:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604156 (10MoritzMuehlenhoff) [08:44:53] (03CR) 10Tiziano Fogli: [C:03+2] network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [08:44:58] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10604157 (10Volans) AFAICS we are still missing the AAAA record on all of the hosts listed in the task description. [08:45:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet [08:45:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604160 (10ops-monitoring-bot) Draining ganeti1032.eqiad.wmnet of running VMs [08:46:14] (03PS1) 10Volans: reports/network: update no AAAA records list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124722 [08:46:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet [08:46:23] (03CR) 10CI reject: [V:04-1] reports/network: update no AAAA records list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124722 (owner: 10Volans) [08:46:59] (03CR) 10Volans: reports/network: update no AAAA records list (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124722 (owner: 10Volans) [08:47:03] (03CR) 10Volans: [C:03+2] reports/librenms: fix f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124719 (owner: 10Volans) [08:47:48] (03CR) 10Slyngshede: [C:03+1] "LGTM. We probably need to clean up that page a bit, it's getting a little messy." [software/bitu] - 10https://gerrit.wikimedia.org/r/1124718 (owner: 10Muehlenhoff) [08:48:24] PROBLEM - Hadoop NodeManager on an-worker1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:49:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd1003.eqiad.wmnet to drbd [08:50:08] !log jelto@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:50:15] !log jelto@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:50:46] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604164 (10ops-monitoring-bot) VM aux-k8s-etcd1003.eqiad.wmnet switching disk type to drbd [08:50:49] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:51:09] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:51:30] (03Merged) 10jenkins-bot: network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [08:51:30] (03Merged) 10jenkins-bot: reports/librenms: fix f-string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124719 (owner: 10Volans) [08:52:54] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:53:05] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:54:46] !log tappof@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "network_devices: adding device model - tappof@cumin1002 - T387231" [08:54:51] T387231: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231 [08:55:09] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1124641 (https://phabricator.wikimedia.org/T385141) (owner: 10Marostegui) [08:55:52] !log tappof@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "network_devices: adding device model - tappof@cumin1002 - T387231" [08:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:58:39] (03CR) 10Ayounsi: pdu_config_netbox: add new module to grab PDUs from netbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [08:59:09] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [08:59:31] (03CR) 10Ayounsi: pdu_config_netbox: add new module to grab PDUs from netbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [09:00:05] hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T0900) [09:00:21] (03CR) 10Muehlenhoff: [C:03+2] Bitu: Also point to idm-help@w.o for password resets [software/bitu] - 10https://gerrit.wikimedia.org/r/1124718 (owner: 10Muehlenhoff) [09:00:45] 10SRE-swift-storage: IPv6 records inconsistent on the ms-be hosts - https://phabricator.wikimedia.org/T320947#10604194 (10Volans) As of today `ms-be2057` is the only host left without AAAA record, all the others have it. It would be great if it could be fixed. [09:01:24] (03Abandoned) 10Volans: reports/network: update no AAAA records list [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124722 (owner: 10Volans) [09:02:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:03:39] ok [09:03:45] * hashar flexes fingers muscles [09:04:07] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:04:14] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db1250 [puppet] - 10https://gerrit.wikimedia.org/r/1124641 (https://phabricator.wikimedia.org/T385141) (owner: 10Marostegui) [09:04:55] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:05:05] (03CR) 10Ayounsi: "Make sure to update https://github.com/wikimedia/operations-puppet/blob/08eefaa046b24853b51919047bf7515c315af28c/modules/netbox/types/devi" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [09:05:16] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124726 (https://phabricator.wikimedia.org/T386214) [09:05:17] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124726 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [09:05:23] tchou tchou [09:05:43] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1217.eqiad.wmnet with reason: cloning [09:06:03] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124726 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [09:07:04] !log Stop db1217:3321 to clone db1250 T385141 [09:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:07] T385141: Productionize db125[0-4] - https://phabricator.wikimedia.org/T385141 [09:07:24] (03PS1) 10Federico Ceratto: sre.mysql.pool: fix hostname check logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1124728 (https://phabricator.wikimedia.org/T378572) [09:07:41] (03CR) 10Volans: [C:03+1] "My bad, I hadn't notice we already had the slug available." [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [09:08:58] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1202 gradually with 4 steps - Cloned db1202 to db1253 [09:09:03] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db1202 gradually with 4 steps - Cloned db1202 to db1253 [09:09:41] (03CR) 10Marostegui: [C:03+1] sre.mysql.pool: fix hostname check logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1124728 (https://phabricator.wikimedia.org/T378572) (owner: 10Federico Ceratto) [09:09:48] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:10:14] ^expected [09:10:34] PROBLEM - Host aux-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [09:12:47] (03PS1) 10Volans: CI: future-proof prospector config [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124729 [09:13:56] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:14:01] 10SRE-swift-storage: IPv6 records inconsistent on the ms-be hosts - https://phabricator.wikimedia.org/T320947#10604227 (10MatthewVernon) I expect it to be refreshed in Q1 or maybe Q2 (purchase date was 2020-08-11). [09:14:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd1003.eqiad.wmnet to drbd [09:14:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet [09:14:46] (03CR) 10Jgiannelos: [C:03+2] pcs: Enable more wikis for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [09:14:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604229 (10ops-monitoring-bot) Draining ganeti1032.eqiad.wmnet of running VMs [09:15:02] RECOVERY - Host aux-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [09:15:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet [09:15:12] !log upgrade to karma 0.120 - T353457 [09:15:15] godog: Failed to log message to wiki. Somebody should check the error logs. [09:15:16] T353457: Karma UI shows duplicate alerts - https://phabricator.wikimedia.org/T353457 [09:15:33] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.19 refs T386214 [09:15:36] (03PS1) 10Muehlenhoff: Switch ganeti1032 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1124730 [09:15:36] T386214: 1.44.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T386214 [09:15:47] (03CR) 10Ayounsi: [C:03+2] Add exporter port to gNMI metrics instance label [puppet] - 10https://gerrit.wikimedia.org/r/1122955 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi) [09:15:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd1003.eqiad.wmnet to plain [09:16:10] (03Merged) 10jenkins-bot: pcs: Enable more wikis for native PCS pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122902 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [09:16:14] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604235 (10ops-monitoring-bot) VM aux-k8s-etcd1003.eqiad.wmnet switching disk type to plain [09:16:24] RECOVERY - Hadoop NodeManager on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [09:16:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd1003.eqiad.wmnet to plain [09:17:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to drbd [09:17:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:17:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604237 (10ops-monitoring-bot) VM dse-k8s-etcd1001.eqiad.wmnet switching disk type to drbd [09:18:53] !log deploy new backup grants for es1036,es1040 T387892 [09:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:56] T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892 [09:19:22] (03CR) 10DCausse: [C:03+1] cloudelastic: begin transition to opensearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [09:19:39] (03PS1) 10Tiziano Fogli: Revert "network_devices: adding device model" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124731 [09:19:56] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:20:42] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:20:59] logs are happy at least [09:22:35] federico3: your changes to db1202 aren't committed [09:22:38] federico3: can you check? [09:22:43] (03PS1) 10Muehlenhoff: Remove access to logstash for cn=wmf [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) [09:22:44] (the alert above) [09:22:45] FIRING: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:23:05] looking [09:23:08] !log deploy new backup grants for es2036,es2040 T387892 [09:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:10] yes, the pooling-in cookbook just tripped on the comma again [09:25:12] (03PS1) 10Filippo Giunchedi: alertmanager: remove 'default' receiver when duplicated [puppet] - 10https://gerrit.wikimedia.org/r/1124733 (https://phabricator.wikimedia.org/T353457) [09:26:27] federico3: you can commit them manually to clear the alert for now if you like [09:26:58] give me 1 minute [09:27:28] (03CR) 10Ayounsi: [C:03+1] CI: future-proof prospector config [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124729 (owner: 10Volans) [09:27:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to drbd [09:27:34] PROBLEM - Host dse-k8s-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:28:07] (03CR) 10Volans: [C:03+2] CI: future-proof prospector config [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124729 (owner: 10Volans) [09:28:19] RECOVERY - Host dse-k8s-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [09:28:38] (03CR) 10Ayounsi: [C:03+1] "nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124396 (owner: 10Volans) [09:29:17] (03PS1) 10Tiziano Fogli: fix: netbox network_devices type [puppet] - 10https://gerrit.wikimedia.org/r/1124734 [09:29:19] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [09:30:05] (03Merged) 10jenkins-bot: CI: future-proof prospector config [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1124729 (owner: 10Volans) [09:30:22] (03CR) 10Ayounsi: [C:03+1] "nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans) [09:30:39] (03PS3) 10Hnowlan: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) [09:30:44] (03PS2) 10Federico Ceratto: sre.mysql.pool: allow merging unexpected changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124728 (https://phabricator.wikimedia.org/T378572) [09:30:53] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1202 gradually with 4 steps - Cloned db1202 to db1253 [09:31:13] (03CR) 10Marostegui: [C:03+1] sre.mysql.pool: allow merging unexpected changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124728 (https://phabricator.wikimedia.org/T378572) (owner: 10Federico Ceratto) [09:31:31] marostegui: committing manually [09:31:35] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1202 gradually with 4 steps - Cloned db1202 to db1253 [09:32:09] federico3: thanks [09:32:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Cloned db1202 to db1253', diff saved to https://phabricator.wikimedia.org/P74077 and previous config saved to /var/cache/conftool/dbconfig/20250305-093249-fceratto.json [09:32:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74078 and previous config saved to /var/cache/conftool/dbconfig/20250305-093254-root.json [09:32:58] (03CR) 10Muehlenhoff: Remove access to logstash for cn=wmf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [09:33:02] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1202 gradually with 4 steps - Cloned db1202 to db1253 [09:34:19] ^ the puppet alert from above is being worked on [09:34:36] (03PS1) 10Volans: sre.puppet.sync-netbox-hiera: add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/1124735 [09:34:57] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:35:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet [09:35:33] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604341 (10ops-monitoring-bot) Draining ganeti1032.eqiad.wmnet of running VMs [09:35:43] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:35:44] (03CR) 10Filippo Giunchedi: [C:03+1] Remove access to logstash for cn=wmf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [09:35:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet [09:36:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to plain [09:36:44] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604354 (10ops-monitoring-bot) VM dse-k8s-etcd1001.eqiad.wmnet switching disk type to plain [09:36:45] (03PS2) 10Tiziano Fogli: fix: netbox network_devices type [puppet] - 10https://gerrit.wikimedia.org/r/1124734 [09:36:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of dse-k8s-etcd1001.eqiad.wmnet to plain [09:37:31] (03CR) 10Ayounsi: [C:03+1] sre.puppet.sync-netbox-hiera: add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/1124735 (owner: 10Volans) [09:37:35] (03PS2) 10Muehlenhoff: Remove access to logstash for cn=wmf [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) [09:37:52] (03CR) 10Muehlenhoff: Remove access to logstash for cn=wmf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [09:38:29] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM! Thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [09:38:40] (03PS1) 10Slyngshede: Show existing approvals on permission approval pages [software/bitu] - 10https://gerrit.wikimedia.org/r/1124736 [09:38:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet [09:38:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet [09:39:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1032.eqiad.wmnet [09:39:15] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10604369 (10ops-monitoring-bot) Draining ganeti1032.eqiad.wmnet of running VMs [09:39:18] (03PS3) 10Tiziano Fogli: fix: netbox network_devices type [puppet] - 10https://gerrit.wikimedia.org/r/1124734 [09:39:43] (03CR) 10JMeybohm: [C:03+1] miscweb: add support for external-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123738 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [09:39:54] (03CR) 10JMeybohm: [C:03+2] validating-admission-policies: Be more explicit in tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124415 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [09:41:01] (03Merged) 10jenkins-bot: validating-admission-policies: Be more explicit in tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124415 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [09:42:52] (03CR) 10Federico Ceratto: [C:03+1] sre.mysql.pool: allow merging unexpected changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124728 (https://phabricator.wikimedia.org/T378572) (owner: 10Federico Ceratto) [09:42:58] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.pool: allow merging unexpected changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124728 (https://phabricator.wikimedia.org/T378572) (owner: 10Federico Ceratto) [09:46:54] (03CR) 10Ayounsi: [C:03+1] fix: netbox network_devices type [puppet] - 10https://gerrit.wikimedia.org/r/1124734 (owner: 10Tiziano Fogli) [09:47:07] (03CR) 10Volans: [C:03+2] sre.puppet.sync-netbox-hiera: add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/1124735 (owner: 10Volans) [09:47:17] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [09:47:59] (03CR) 10Tiziano Fogli: [C:03+2] fix: netbox network_devices type [puppet] - 10https://gerrit.wikimedia.org/r/1124734 (owner: 10Tiziano Fogli) [09:48:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74081 and previous config saved to /var/cache/conftool/dbconfig/20250305-094759-root.json [09:53:11] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10604396 (10Aklapper) @Ben.buchenau Feel free to [connect](https://phabricator.wikimedia.org/settings/panel/external/) your LDAP/developer account to [your Phab account](... [09:54:22] (03Merged) 10jenkins-bot: sre.puppet.sync-netbox-hiera: add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/1124735 (owner: 10Volans) [09:55:13] (03PS1) 10Jcrespo: dbbackups: Migrate es backups from backup[12]02 to backup[12]13 [puppet] - 10https://gerrit.wikimedia.org/r/1124738 (https://phabricator.wikimedia.org/T387892) [09:55:44] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) db1202 gradually with 4 steps - Cloned db1202 to db1253 [09:56:14] (03Abandoned) 10Tiziano Fogli: Revert "network_devices: adding device model" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124731 (owner: 10Tiziano Fogli) [09:56:49] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:56:55] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:57:51] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [09:58:05] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [09:58:09] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [09:58:37] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [09:58:46] (03PS24) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [09:59:11] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10604417 (10MoritzMuehlenhoff) @Ben.buchenau You don't seem to have an NDA on record yet. I'm adding @KFrancis from the Wikimedia Legal department to set this up. [09:59:21] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10604418 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:00:17] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [10:02:27] (03CR) 10Jcrespo: "heads up of this migration, will test it before the end of today ^" [puppet] - 10https://gerrit.wikimedia.org/r/1124738 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [10:02:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:03:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74083 and previous config saved to /var/cache/conftool/dbconfig/20250305-100304-root.json [10:03:28] (03PS2) 10Jcrespo: dbbackups: Migrate es backups from backup[12]02 to backup[12]13 [puppet] - 10https://gerrit.wikimedia.org/r/1124738 (https://phabricator.wikimedia.org/T387892) [10:05:11] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 14912MiB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [10:05:22] (03PS25) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [10:05:23] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [10:05:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:32] (03CR) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [10:06:46] (03CR) 10Filippo Giunchedi: "The alert is firing atm for ctrl and worker for k8s-aux (https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=" [alerts] - 10https://gerrit.wikimedia.org/r/1124453 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [10:12:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:13:46] (03PS1) 10Federico Ceratto: instances.yaml, db1253.yaml, db1254.yaml, site.pp: clone db1253 and db1254 [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141) [10:17:45] RESOLVED: [4x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:18:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74084 and previous config saved to /var/cache/conftool/dbconfig/20250305-101810-root.json [10:20:26] (03PS2) 10JMeybohm: mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478 [10:23:03] (03PS1) 10Filippo Giunchedi: prometheus: split envoy rules into separate groups [puppet] - 10https://gerrit.wikimedia.org/r/1124743 (https://phabricator.wikimedia.org/T387965) [10:23:05] (03PS12) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) [10:23:08] (03PS25) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [10:25:39] (03PS1) 10Arturo Borrero Gonzalez: toolforge: haproxy: don't log normal connections [puppet] - 10https://gerrit.wikimedia.org/r/1124744 [10:26:22] hashar: may I use the rest of your window for a shallbox change? [10:26:27] shellbox, lol [10:26:37] (03CR) 10David Caro: [C:03+2] toolforge: haproxy: don't log normal connections [puppet] - 10https://gerrit.wikimedia.org/r/1124744 (owner: 10Arturo Borrero Gonzalez) [10:27:11] “you shallbox pass”? ^^ [10:27:29] (03CR) 10Tchanders: [C:03+1] CommonSettings.php: Remove $wgSecurePollGPGCommand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124514 (owner: 10Reedy) [10:28:25] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: haproxy: don't log normal connections [puppet] - 10https://gerrit.wikimedia.org/r/1124744 (owner: 10Arturo Borrero Gonzalez) [10:29:35] (03PS1) 10Máté Szabó: Remove unused $wgSecurePollGPGCommand setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124745 (https://phabricator.wikimedia.org/T380441) [10:30:57] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet [10:31:15] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2008.codfw.wmnet [10:32:51] !log restart kube-apiserver on ml-staging-ctrl200[12] after the move to containerd (some issues regisstered) [10:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74085 and previous config saved to /var/cache/conftool/dbconfig/20250305-103316-root.json [10:37:24] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:38:18] (03PS1) 10Filippo Giunchedi: prometheus: trial moving k8s-mlstaging to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1124747 (https://phabricator.wikimedia.org/T383232) [10:38:24] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2008.codfw.wmnet [10:38:26] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet [10:38:35] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:44:12] (03CR) 10Effie Mouzeli: [C:03+1] "The different math approaches have been discussed already, I have no strong opinions towards one or the other approach, so I think that ov" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [10:45:30] (03PS3) 10JMeybohm: mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478 [10:45:39] (03CR) 10Hnowlan: [C:03+2] mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [10:45:56] (03CR) 10Hnowlan: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [10:49:09] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [10:51:17] (03CR) 10Hnowlan: [C:03+2] mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [10:51:31] (03CR) 10Hnowlan: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [10:51:47] (03PS4) 10Hnowlan: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) [10:53:09] (03PS5) 10Hnowlan: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) [10:54:03] (03CR) 10Hnowlan: [C:03+2] mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [10:55:28] (03Merged) 10jenkins-bot: mw-api-ext, mw-web: right-size clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122561 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [10:55:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74086 and previous config saved to /var/cache/conftool/dbconfig/20250305-105534-root.json [10:56:01] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:56:11] FIRING: Temperature: Temp issue on wdqs1021:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1021 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [10:56:13] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:57:06] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:57:19] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [10:57:36] (03PS1) 10Filippo Giunchedi: prometheus: add sync-data script [puppet] - 10https://gerrit.wikimedia.org/r/1124749 (https://phabricator.wikimedia.org/T383232) [10:57:52] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:57:59] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:57:59] (03PS4) 10JMeybohm: mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478 [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1100) [11:01:11] RESOLVED: Temperature: Temp issue on wdqs1021:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1021 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [11:07:36] (03PS1) 10Filippo Giunchedi: prometheus: replace prometheus::migration with prometheus-sync-data [puppet] - 10https://gerrit.wikimedia.org/r/1124751 (https://phabricator.wikimedia.org/T383232) [11:07:49] !log elukey@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqiad [reason: no reason specified, no task ID specified] [11:07:59] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: no reason specified, no task ID specified] [11:09:00] (03PS2) 10Elukey: profile::dns::auth::discovery-map: prefer codfw over eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122627 (https://phabricator.wikimedia.org/T380858) [11:09:30] (03PS1) 10Effie Mouzeli: shellbox-video: disable debug logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124752 [11:10:35] (03CR) 10Filippo Giunchedi: "I'll be merging this early next week" [puppet] - 10https://gerrit.wikimedia.org/r/1124747 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [11:10:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74087 and previous config saved to /var/cache/conftool/dbconfig/20250305-111040-root.json [11:11:35] (03CR) 10Elukey: [C:03+2] profile::dns::auth::discovery-map: prefer codfw over eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1122627 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey) [11:13:31] (03PS2) 10Tiziano Fogli: network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124741 (https://phabricator.wikimedia.org/T387231) [11:16:07] (03CR) 10Ayounsi: [C:03+1] network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124741 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [11:20:45] PROBLEM - Hadoop NodeManager on an-worker1168 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:23:26] (03CR) 10Tiziano Fogli: [C:03+2] network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124741 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [11:23:45] RECOVERY - Hadoop NodeManager on an-worker1168 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:25:11] (03PS5) 10Jelto: services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) [11:25:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74088 and previous config saved to /var/cache/conftool/dbconfig/20250305-112545-root.json [11:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:29:13] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:29:15] (03CR) 10Btullis: "I agree with elukey here. We don't need a new partition recipe for the dse-k8s control plane nodes, so you can simply abandon this change." [puppet] - 10https://gerrit.wikimedia.org/r/1121335 (https://phabricator.wikimedia.org/T386900) (owner: 10Stevemunene) [11:29:25] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:29:38] (03Merged) 10jenkins-bot: network_devices: adding device model [cookbooks] - 10https://gerrit.wikimedia.org/r/1124741 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [11:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10604726 (10phaultfinder) [11:31:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74089 and previous config saved to /var/cache/conftool/dbconfig/20250305-113126-root.json [11:32:44] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [11:34:35] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:34:48] !log tappof@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "network_devices: adding device model - tappof@cumin1002 - T387231" [11:34:52] T387231: missing pdu infos for magru - https://phabricator.wikimedia.org/T387231 [11:35:04] !log tappof@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "network_devices: adding device model - tappof@cumin1002 - T387231" [11:37:24] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:38:05] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet [11:38:35] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:38:51] (03CR) 10CI reject: [V:04-1] services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto) [11:40:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74090 and previous config saved to /var/cache/conftool/dbconfig/20250305-114051-root.json [11:46:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74091 and previous config saved to /var/cache/conftool/dbconfig/20250305-114632-root.json [11:50:06] (03CR) 10Clément Goubert: [C:03+1] mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478 (owner: 10JMeybohm) [11:50:13] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:55:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74092 and previous config saved to /var/cache/conftool/dbconfig/20250305-115557-root.json [12:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1100) [12:00:05] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1200). [12:01:31] (03PS2) 10Slyngshede: Upgrade idp-test to 7.1.4 [dns] - 10https://gerrit.wikimedia.org/r/1124376 [12:01:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74093 and previous config saved to /var/cache/conftool/dbconfig/20250305-120138-root.json [12:02:52] (03CR) 10Slyngshede: [C:03+2] Upgrade idp-test to 7.1.4 [dns] - 10https://gerrit.wikimedia.org/r/1124376 (owner: 10Slyngshede) [12:03:07] !log slyngshede@dns1004 START - running authdns-update [12:05:15] !log slyngshede@dns1004 END - running authdns-update [12:07:17] (03Abandoned) 10Stevemunene: Create dse-k8s control panel partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/1121335 (https://phabricator.wikimedia.org/T386900) (owner: 10Stevemunene) [12:07:26] (03CR) 10Stevemunene: "Ack, Thanks @btullis@wikimedia.org and @ltoscano@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1121335 (https://phabricator.wikimedia.org/T386900) (owner: 10Stevemunene) [12:09:15] (03PS1) 10Elukey: profile::dns::auth::discovery-map: fix eqiad private config [puppet] - 10https://gerrit.wikimedia.org/r/1124755 (https://phabricator.wikimedia.org/T380858) [12:09:28] heads up, i am planning to deploy changeprop for T387277 [12:09:29] T387277: Rollout more wikis after week 1 of testing with production traffic - https://phabricator.wikimedia.org/T387277 [12:09:54] (03CR) 10Giuseppe Lavagetto: [C:03+1] profile::dns::auth::discovery-map: fix eqiad private config [puppet] - 10https://gerrit.wikimedia.org/r/1124755 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey) [12:10:17] (03PS1) 10Arturo Borrero Gonzalez: toolforge: haproxy: check ingress workers with the /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1124756 (https://phabricator.wikimedia.org/T387959) [12:10:32] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [12:10:45] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [12:11:09] (03CR) 10Cathal Mooney: [C:03+2] Update policy for K8s BGP to allow a wider range of v4 prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/1121438 (https://phabricator.wikimedia.org/T375845) (owner: 10Cathal Mooney) [12:11:26] (03CR) 10Elukey: [C:03+2] profile::dns::auth::discovery-map: fix eqiad private config [puppet] - 10https://gerrit.wikimedia.org/r/1124755 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey) [12:12:44] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [12:13:24] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [12:13:32] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [12:13:42] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [12:13:46] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [12:14:44] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [12:15:03] (03CR) 10Ladsgroup: [C:03+1] "thank you! I can't believe I'm seeing this day <3" [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [12:16:13] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:16:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74094 and previous config saved to /var/cache/conftool/dbconfig/20250305-121643-root.json [12:16:46] (03CR) 10David Caro: [C:03+1] toolforge: haproxy: check ingress workers with the /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1124756 (https://phabricator.wikimedia.org/T387959) (owner: 10Arturo Borrero Gonzalez) [12:17:46] (03PS1) 10Gergő Tisza: CentralAuth: Enable SUL3 signup on group 0 (attempt 4) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124757 (https://phabricator.wikimedia.org/T384007) [12:17:57] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122999 (owner: 10PipelineBot) [12:19:24] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122999 (owner: 10PipelineBot) [12:19:56] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:20:24] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:20:40] rzl: I finally got around to fixing the CentralAuth multi-DC patch ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/1123029 ). Should I schedule it in a puppet or infra window, or can it go through normal code review? In the latter case, do you know who I should add as a reviewer? [12:21:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [12:21:38] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:21:45] (03PS1) 10Slyngshede: Add missing secrets for OIDC in IDP [labs/private] - 10https://gerrit.wikimedia.org/r/1124758 [12:22:02] (03PS1) 10Ladsgroup: Enable thumbnail steps in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124759 (https://phabricator.wikimedia.org/T360589) [12:22:19] (03CR) 10Slyngshede: [C:03+2] Add missing secrets for OIDC in IDP [labs/private] - 10https://gerrit.wikimedia.org/r/1124758 (owner: 10Slyngshede) [12:22:37] (03CR) 10Slyngshede: [V:03+2 C:03+2] Add missing secrets for OIDC in IDP [labs/private] - 10https://gerrit.wikimedia.org/r/1124758 (owner: 10Slyngshede) [12:22:39] (03PS1) 10Ladsgroup: maintenance: Also check for utf-8 encoding in findBadBlobs [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124761 (https://phabricator.wikimedia.org/T351953) [12:22:42] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:22:52] (03PS1) 10Ladsgroup: maintenance: Also check for utf-8 encoding in findBadBlobs [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124762 (https://phabricator.wikimedia.org/T351953) [12:23:24] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:23:24] jouncebot: nownandnext [12:23:33] jouncebot: nowandnext [12:23:34] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1100) [12:23:34] For the next 0 hour(s) and 36 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1200) [12:23:34] In 1 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1400) [12:23:55] (03CR) 10Ladsgroup: [C:03+2] maintenance: Also check for utf-8 encoding in findBadBlobs [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124761 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup) [12:23:55] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:24:01] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5029/console" [puppet] - 10https://gerrit.wikimedia.org/r/1115801 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [12:24:06] (03CR) 10Ladsgroup: [C:03+2] maintenance: Also check for utf-8 encoding in findBadBlobs [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124762 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup) [12:24:37] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:24:53] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5030/console" [puppet] - 10https://gerrit.wikimedia.org/r/1115801 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [12:25:11] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:25:44] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5031/co" [puppet] - 10https://gerrit.wikimedia.org/r/1115801 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [12:26:27] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5032/console" [puppet] - 10https://gerrit.wikimedia.org/r/1115801 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [12:27:25] jouncebot: nowandnext [12:27:26] For the next 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1100) [12:27:26] For the next 0 hour(s) and 32 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1200) [12:27:26] In 1 hour(s) and 32 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1400) [12:27:33] (03CR) 10Muehlenhoff: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1124479 (owner: 10Ahmon Dancy) [12:27:35] (03CR) 10Muehlenhoff: [C:03+2] envoy: Update examples [puppet] - 10https://gerrit.wikimedia.org/r/1124479 (owner: 10Ahmon Dancy) [12:27:44] (03PS1) 10Muehlenhoff: Add component/lshw on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1124763 (https://phabricator.wikimedia.org/T380295) [12:27:47] (03PS1) 10Muehlenhoff: Install lshw backport from component/lshw [puppet] - 10https://gerrit.wikimedia.org/r/1124764 (https://phabricator.wikimedia.org/T380295) [12:27:59] please avoid doing any scap deploys during this window [12:28:55] (03PS3) 10Slyngshede: C:apereo_cas Specify encryption algorithms for CAS 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1115801 (https://phabricator.wikimedia.org/T372892) [12:30:50] Sure. [12:31:10] Thanks for the info [12:31:12] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:31:41] (03PS1) 10Hnowlan: trafficserver: fix hostnames for citoid requests [puppet] - 10https://gerrit.wikimedia.org/r/1124766 (https://phabricator.wikimedia.org/T361576) [12:31:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2166 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74095 and previous config saved to /var/cache/conftool/dbconfig/20250305-123149-root.json [12:31:55] 06SRE, 06Infrastructure-Foundations, 06serviceops, 07Kubernetes: Remove `.cluster.local.` suffix in PTR responses - https://phabricator.wikimedia.org/T376762#10604979 (10MoritzMuehlenhoff) [12:32:32] (03CR) 10Mvolz: [C:03+1] trafficserver: fix hostnames for citoid requests [puppet] - 10https://gerrit.wikimedia.org/r/1124766 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [12:32:45] 06SRE, 06SRE Observability, 13Patch-For-Review: etcd: adapt etcd-backup.py for etcd 3.4 - https://phabricator.wikimedia.org/T385727#10604981 (10MoritzMuehlenhoff) [12:33:20] (03PS1) 10Effie Mouzeli: shellbox-media: switch main to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124767 (https://phabricator.wikimedia.org/T377038) [12:33:34] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: sqlite::db can get stuck on zero byte file database - https://phabricator.wikimedia.org/T387112#10604983 (10MoritzMuehlenhoff) [12:35:20] !log restart envoy/swift on ms-fe2010 [12:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:59] (03PS1) 10Dreamy Jazz: Temporarily unset temporary-account-viewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124768 (https://phabricator.wikimedia.org/T387205) [12:36:40] (03CR) 10Klausman: [C:03+1] prometheus: trial moving k8s-mlstaging to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1124747 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [12:36:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124768 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [12:36:44] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122997 (owner: 10PipelineBot) [12:36:59] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122927 (owner: 10PipelineBot) [12:37:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124761 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup) [12:37:08] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122976 (owner: 10PipelineBot) [12:37:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124762 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup) [12:37:24] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:38:39] (03PS1) 10Btullis: Replace the production SSH key for btullis [puppet] - 10https://gerrit.wikimedia.org/r/1124769 (https://phabricator.wikimedia.org/T385943) [12:38:40] (03Merged) 10jenkins-bot: maintenance: Also check for utf-8 encoding in findBadBlobs [core] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124761 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup) [12:38:45] (03Merged) 10jenkins-bot: maintenance: Also check for utf-8 encoding in findBadBlobs [core] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124762 (https://phabricator.wikimedia.org/T351953) (owner: 10Ladsgroup) [12:39:16] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]] [12:39:20] T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953 [12:39:35] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:40:04] Amir1: hnowlan asked for no deploys during this window. [12:40:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124764 (https://phabricator.wikimedia.org/T380295) (owner: 10Muehlenhoff) [12:41:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [12:42:40] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:44:15] Amir1: please wait for 15 minutes or so if possible [12:45:57] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:49:52] (03CR) 10Jforrester: [C:03+1] "Eurgh. Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) (owner: 10Daimona Eaytoy) [12:50:01] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:50:07] shit [12:50:20] aborted [12:54:24] (03PS1) 10Elukey: Revert "profile::dns::auth::discovery-map: prefer codfw over eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1124772 [12:54:35] (03CR) 10CI reject: [V:04-1] Revert "profile::dns::auth::discovery-map: prefer codfw over eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1124772 (owner: 10Elukey) [12:55:04] (03Abandoned) 10Elukey: Revert "profile::dns::auth::discovery-map: prefer codfw over eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1124772 (owner: 10Elukey) [12:55:40] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1124769 (https://phabricator.wikimedia.org/T385943) (owner: 10Btullis) [12:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:01:15] (03CR) 10Tiziano Fogli: [C:03+2] "Suggestions have been applied in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1124741." [cookbooks] - 10https://gerrit.wikimedia.org/r/1124142 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [13:02:33] (03CR) 10Muehlenhoff: [C:03+1] "SSH has been confirmed via out-of-band channel (Slack)" [puppet] - 10https://gerrit.wikimedia.org/r/1124769 (https://phabricator.wikimedia.org/T385943) (owner: 10Btullis) [13:02:51] (03CR) 10Btullis: [C:03+2] Replace the production SSH key for btullis [puppet] - 10https://gerrit.wikimedia.org/r/1124769 (https://phabricator.wikimedia.org/T385943) (owner: 10Btullis) [13:03:44] (03PS13) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) [13:03:44] (03PS26) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [13:03:45] (03CR) 10JMeybohm: [C:03+2] mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478 (owner: 10JMeybohm) [13:03:50] (03PS1) 10Elukey: Revert "profile::dns::auth::discovery-map: fix eqiad private config" [puppet] - 10https://gerrit.wikimedia.org/r/1124773 [13:03:58] (03CR) 10Elukey: [V:03+2 C:03+2] Revert "profile::dns::auth::discovery-map: fix eqiad private config" [puppet] - 10https://gerrit.wikimedia.org/r/1124773 (owner: 10Elukey) [13:04:22] (03PS1) 10Elukey: Revert "profile::dns::auth::discovery-map: prefer codfw over eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1124774 [13:05:34] (03PS1) 10Hnowlan: mw-(web|api-int|api-ext): scale down, correct messages after test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124775 (https://phabricator.wikimedia.org/T380858) [13:05:45] (03PS2) 10Hnowlan: mw-(web|api-int|api-ext): scale down, correct messages after test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124775 (https://phabricator.wikimedia.org/T380858) [13:06:11] hey folks, I am reverting back eqiad to its pooled state, should be ready in 10 mins [13:06:21] (03Merged) 10jenkins-bot: mediawiki: Fix envvars with values evaluating to false [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124478 (owner: 10JMeybohm) [13:06:30] (03PS1) 10Filippo Giunchedi: sre: limit netbox reports alerts to eqiad and codfw [alerts] - 10https://gerrit.wikimedia.org/r/1124776 (https://phabricator.wikimedia.org/T350694) [13:06:51] (03CR) 10Clément Goubert: [C:03+1] mw-(web|api-int|api-ext): scale down, correct messages after test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124775 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [13:07:36] (03CR) 10Hnowlan: [C:03+2] mw-(web|api-int|api-ext): scale down, correct messages after test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124775 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [13:08:11] (03CR) 10Elukey: [C:03+2] profile::dns::auth::discovery-map: prefer codfw over eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1122627 (https://phabricator.wikimedia.org/T380858) (owner: 10Elukey) [13:08:33] (03CR) 10Elukey: [C:03+2] Revert "profile::dns::auth::discovery-map: prefer codfw over eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/1124774 (owner: 10Elukey) [13:09:13] (03Merged) 10jenkins-bot: mw-(web|api-int|api-ext): scale down, correct messages after test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124775 (https://phabricator.wikimedia.org/T380858) (owner: 10Hnowlan) [13:10:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1032.eqiad.wmnet [13:11:07] (03CR) 10Tiziano Fogli: "This will be tested on Pontoon." [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [13:11:52] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]] [13:11:56] T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953 [13:12:57] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1032.eqiad.wmnet with reason: remove from cluster for reimage [13:13:03] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10605054 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=836a9ab9-c457-4a78-ab8b-24d0332b99af) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [13:13:44] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1032 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1124730 (owner: 10Muehlenhoff) [13:15:04] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:16:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122632 (https://phabricator.wikimedia.org/T387230) (owner: 10Abijeet Patro) [13:16:32] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [13:17:06] (03PS1) 10Gergő Tisza: CentralAuthIdLookup: Reuse cached object on single-value lookup [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124781 (https://phabricator.wikimedia.org/T379909) [13:17:12] (03PS1) 10Gergő Tisza: CentralAuthIdLookup: Use primary DB after writes [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124782 (https://phabricator.wikimedia.org/T379909) [13:17:16] (03PS1) 10Gergő Tisza: Use UserOptionsManager for SUL3 rollout flag [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124783 (https://phabricator.wikimedia.org/T384549) [13:17:18] (03PS1) 10Gergő Tisza: Make SUL3 global preference optional and simplify logic [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124784 [13:17:18] (03PS1) 10Gergő Tisza: Add passive central domain to edge login list [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124785 (https://phabricator.wikimedia.org/T375796) [13:17:20] (03PS1) 10Gergő Tisza: SUL3: Use a central wiki for autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124786 (https://phabricator.wikimedia.org/T387357) [13:17:24] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:18:43] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1032.eqiad.wmnet [13:19:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124781 (https://phabricator.wikimedia.org/T379909) (owner: 10Gergő Tisza) [13:19:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124782 (https://phabricator.wikimedia.org/T379909) (owner: 10Gergő Tisza) [13:20:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124783 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [13:20:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124784 (owner: 10Gergő Tisza) [13:20:20] (03CR) 10Hnowlan: [C:03+1] shellbox-media: switch main to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124767 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [13:20:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124785 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza) [13:20:48] (03CR) 10Effie Mouzeli: [C:03+2] shellbox-media: switch main to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124767 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [13:20:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124786 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza) [13:20:57] (03CR) 10Hnowlan: [C:03+2] trafficserver: fix hostnames for citoid requests [puppet] - 10https://gerrit.wikimedia.org/r/1124766 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [13:22:07] !log elukey@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: Repool eqiad after maintenance, no task ID specified] [13:22:42] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: Repool eqiad after maintenance, no task ID specified] [13:23:03] (03PS2) 10Effie Mouzeli: shellbox-media: switch main to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124767 (https://phabricator.wikimedia.org/T377038) [13:23:10] eqiad is back into serving traffic, maintenance finished, thanks all! [13:23:11] (03CR) 10Effie Mouzeli: shellbox-media: switch main to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124767 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [13:23:22] (03PS1) 10Filippo Giunchedi: data-engineering: remove legacy eventlogging alerts [alerts] - 10https://gerrit.wikimedia.org/r/1124787 (https://phabricator.wikimedia.org/T238230) [13:23:24] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]] (duration: 11m 31s) [13:23:27] T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953 [13:24:30] (03CR) 10Filippo Giunchedi: "Please review if I got everything, or we can nuke all eventlogging alerts altogether?" [alerts] - 10https://gerrit.wikimedia.org/r/1124787 (https://phabricator.wikimedia.org/T238230) (owner: 10Filippo Giunchedi) [13:24:41] (03Merged) 10jenkins-bot: shellbox-media: switch main to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124767 (https://phabricator.wikimedia.org/T377038) (owner: 10Effie Mouzeli) [13:25:47] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [13:26:16] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [13:26:33] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [13:26:52] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [13:27:53] (03PS1) 10Filippo Giunchedi: sre: deploy thumbor alerts to prometheus k8s [alerts] - 10https://gerrit.wikimedia.org/r/1124788 (https://phabricator.wikimedia.org/T379559) [13:27:57] !log klausman@deploy2002 conftool action : set/pooled=yes; selector: name=inference [13:28:10] !log klausman@deploy2002 conftool action : set/pooled=yes; selector: name=inference-staging [13:34:35] RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:54] 07sre-alert-triage, 10SRE Observability (FY2024/2025-Q3): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#10605171 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Resolved in the meantime [13:35:26] (03CR) 10Slyngshede: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1124776 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi) [13:36:12] RESOLVED: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:39:35] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:40:36] (03PS1) 10Filippo Giunchedi: sre: open tasks for long standing lint problems [alerts] - 10https://gerrit.wikimedia.org/r/1124790 (https://phabricator.wikimedia.org/T309182) [13:40:47] (03CR) 10Filippo Giunchedi: [C:03+2] sre: limit netbox reports alerts to eqiad and codfw [alerts] - 10https://gerrit.wikimedia.org/r/1124776 (https://phabricator.wikimedia.org/T350694) (owner: 10Filippo Giunchedi) [13:41:55] (03CR) 10Ayounsi: [C:03+2] Duplicate gNMI BGP session state to metric with peer_descr as instance [puppet] - 10https://gerrit.wikimedia.org/r/1122957 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi) [13:42:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet [13:46:27] FIRING: [2x] HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:48:28] (03CR) 10Ssingh: [C:03+1] "Thank you." [puppet] - 10https://gerrit.wikimedia.org/r/1124763 (https://phabricator.wikimedia.org/T380295) (owner: 10Muehlenhoff) [13:48:46] (03PS9) 10Jelto: services: refactor helmfiles for helmfile 0.171.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) [13:49:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2154 db1167', diff saved to https://phabricator.wikimedia.org/P74096 and previous config saved to /var/cache/conftool/dbconfig/20250305-134936-marostegui.json [13:50:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Index rebuild [13:50:38] jouncebot: nowandnext [13:50:38] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [13:50:38] In 0 hour(s) and 9 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1400) [13:51:02] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1167.eqiad.wmnet [13:51:07] wow that's packed, I do mine later then, going for lunch [13:51:08] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2154.codfw.wmnet [13:53:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet [13:53:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1032.eqiad.wmnet [13:53:17] (03CR) 10Ssingh: [C:03+1] "Looks good, thanks! While safe to roll out, let us know if we should do it. (It's only fair after you did the patch :)))" [puppet] - 10https://gerrit.wikimedia.org/r/1124764 (https://phabricator.wikimedia.org/T380295) (owner: 10Muehlenhoff) [13:55:13] (03CR) 10KartikMistry: Enable CX unified dashboard on phase 2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124464 (https://phabricator.wikimedia.org/T387820) (owner: 10Sbisson) [13:55:35] (03CR) 10Marostegui: "Why was db1253 in s2?" [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [13:57:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1167.eqiad.wmnet [13:58:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2154.codfw.wmnet [13:58:32] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2154.codfw.wmnet with reason: Index rebuild [13:58:49] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Index rebuild [13:59:35] (03PS1) 10Ayounsi: Also exclude Private-Peer from remote_instance:gnmi_bgp_neighbor_session_state [puppet] - 10https://gerrit.wikimedia.org/r/1124795 (https://phabricator.wikimedia.org/T387287) [13:59:43] (03PS1) 10Hnowlan: shellbox-video: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124796 [14:00:05] Urbanecm and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1400). [14:00:05] zip, dbrant, Daimona, Dreamy_Jazz, and abijeet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] present [14:00:10] \o [14:00:11] o/ [14:00:13] o/ [14:01:00] Quick question, which server should I be using in the debugging extension to check my stuff? [14:02:12] (03PS1) 10Federico Ceratto: dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) [14:02:14] I think the `k8s-mwdebug` server would be good. [14:02:19] PROBLEM - MariaDB Replica Lag: s8 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 317.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:02:33] (do folx need a deployer or are you self-serving?) [14:02:54] I can self-serve, but not sure if everyone on the schedule can self-serve [14:03:02] (03CR) 10Muehlenhoff: [C:03+2] Remove access to logstash for cn=wmf [puppet] - 10https://gerrit.wikimedia.org/r/1124732 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [14:03:25] (03CR) 10Federico Ceratto: "dbctl cookbook, initial version" [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto) [14:03:26] I think I have requisite privs but also this is my first deploy, or at least, my first in so long I don't remember [14:03:51] It probably makes sense to combine the config changes anyway into one backport [14:05:27] I can start with your change zip. The task description at https://phabricator.wikimedia.org/T378834 doesn't say that the wikis are ready yet, but I see that the latest comment said they were [14:05:54] please ping if folx need anything, I am somewhat-around :) [14:05:57] yup, my understanding is we are good to go [14:06:00] * zip waves at TheresNoTime [14:06:06] o/ [14:06:10] \o [14:07:08] abijeet: You around for the window? [14:07:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124480 (https://phabricator.wikimedia.org/T378834) (owner: 10Zoe) [14:07:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124500 (owner: 10Dbrant) [14:07:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) (owner: 10Daimona Eaytoy) [14:07:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124549 (https://phabricator.wikimedia.org/T387943) (owner: 10Daimona Eaytoy) [14:07:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124768 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [14:08:16] Going to deploy all but abijeet's change in one go to make it quicker. I didn't see anything particularly risky in any of these changes, so shouldn't need to stop at the test stage. [14:08:30] (03Merged) 10jenkins-bot: Set Flow to read-only on remaining phase 2a wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124480 (https://phabricator.wikimedia.org/T378834) (owner: 10Zoe) [14:08:34] (03Merged) 10jenkins-bot: Remove unused config parameters from ReadingLists extension. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124500 (owner: 10Dbrant) [14:08:37] (03Merged) 10jenkins-bot: Use namespaced Title and Html classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124548 (https://phabricator.wikimedia.org/T166010) (owner: 10Daimona Eaytoy) [14:08:39] (03Merged) 10jenkins-bot: officewiki: Disable the event-organizer user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124549 (https://phabricator.wikimedia.org/T387943) (owner: 10Daimona Eaytoy) [14:08:42] (03Merged) 10jenkins-bot: Temporarily unset temporary-account-viewer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124768 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [14:08:51] genuinely setting "Zoe)" as a highlight message was one of the better ideas I've had [14:08:54] (03CR) 10CI reject: [V:04-1] dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto) [14:09:14] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1124480|Set Flow to read-only on remaining phase 2a wikis (T378834)]], [[gerrit:1124500|Remove unused config parameters from ReadingLists extension.]], [[gerrit:1124548|Use namespaced Title and Html classes (T166010 T387938)]], [[gerrit:1124549|officewiki: Disable the event-organizer user group (T387943)]], [[gerrit:1124768|Temporarily unset tempora [14:09:14] ry-account-viewer group (T387205)]] [14:09:21] T378834: [Config] Set Flow to read-only at all *Phase 2a* wikis - https://phabricator.wikimedia.org/T378834 [14:09:21] T166010: The Great Namespaceization Effort - https://phabricator.wikimedia.org/T166010 [14:09:22] T387938: beta cluster down - Internal error - https://phabricator.wikimedia.org/T387938 [14:09:22] T387943: Disable the event-organizer group in officewiki - https://phabricator.wikimedia.org/T387943 [14:09:22] T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205 [14:09:31] (03PS2) 10Filippo Giunchedi: sre: open tasks for long standing lint problems [alerts] - 10https://gerrit.wikimedia.org/r/1124790 (https://phabricator.wikimedia.org/T309182) [14:09:32] (03PS1) 10Filippo Giunchedi: sre: route AlertLintProblem to the alert file team [alerts] - 10https://gerrit.wikimedia.org/r/1124800 (https://phabricator.wikimedia.org/T354762) [14:09:45] (03CR) 10Andrew Bogott: [C:03+2] cloudbackup: work around a postgresql bug by adjusting work_mem [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott) [14:10:53] dbrant: Assuming there is nothing to test for your change? [14:11:17] nope, and nothing's broken! [14:11:52] Dreamy_Jazz, hey. I'm around [14:12:05] Hi. I can get back to your change after I've finished this deploy [14:12:12] !log dreamyjazz@deploy2002 daimona, zoe, dreamyjazz, dbrant: Backport for [[gerrit:1124480|Set Flow to read-only on remaining phase 2a wikis (T378834)]], [[gerrit:1124500|Remove unused config parameters from ReadingLists extension.]], [[gerrit:1124548|Use namespaced Title and Html classes (T166010 T387938)]], [[gerrit:1124549|officewiki: Disable the event-organizer user group (T387943)]], [[gerrit:1124768|Temporarily unse [14:12:12] t temporary-account-viewer group (T387205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:12:13] Dreamy_Jazz, sounds good, thanks [14:12:49] zip and Daimona: Please do any testing (if relevant) [14:12:52] i'm seeing mediawikiwiki and cawiki Flow boards as read-only now, as expected [14:13:00] Doing [14:14:30] !log restart pybal on lvs2013 [14:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:41] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:14:53] !log restart pybal on lvs2014 [14:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:41] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:16:44] officewiki change looks good; prod didn't explode, so I assume the other change works fine too (I'm not sure how to test the "shitty enwiki hack") [14:16:55] :D [14:16:59] !log dreamyjazz@deploy2002 daimona, zoe, dreamyjazz, dbrant: Continuing with sync [14:17:27] It's the first time I read this comment and now I want to know more :D https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/refs/changes/48/1124548/3/wmf-config/CommonSettings.php#2402 [14:20:58] My change technically didn't work, but I will be able to fix it in a follow-up. It doesn't break anything as it stands. [14:23:28] (03CR) 10Filippo Giunchedi: [C:03+1] Also exclude Private-Peer from remote_instance:gnmi_bgp_neighbor_session_state [puppet] - 10https://gerrit.wikimedia.org/r/1124795 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi) [14:23:41] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124480|Set Flow to read-only on remaining phase 2a wikis (T378834)]], [[gerrit:1124500|Remove unused config parameters from ReadingLists extension.]], [[gerrit:1124548|Use namespaced Title and Html classes (T166010 T387938)]], [[gerrit:1124549|officewiki: Disable the event-organizer user group (T387943)]], [[gerrit:1124768|Temporarily unset tempor [14:23:41] ary-account-viewer group (T387205)]] (duration: 14m 26s) [14:23:47] T378834: [Config] Set Flow to read-only at all *Phase 2a* wikis - https://phabricator.wikimedia.org/T378834 [14:23:47] T166010: The Great Namespaceization Effort - https://phabricator.wikimedia.org/T166010 [14:23:47] T387938: beta cluster down - Internal error - https://phabricator.wikimedia.org/T387938 [14:23:48] T387943: Disable the event-organizer group in officewiki - https://phabricator.wikimedia.org/T387943 [14:23:48] T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205 [14:24:13] (03PS1) 10Btullis: Remove sudo privileges for journalctl from airflow instance admins [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719) [14:24:42] (03CR) 10CI reject: [V:04-1] Remove sudo privileges for journalctl from airflow instance admins [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719) (owner: 10Btullis) [14:25:25] !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=cumin2002.codfw.wmnet [14:26:00] !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=ml-staging2003,service=ml-staging [14:26:43] PROBLEM - Hadoop NodeManager on analytics1070 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:26:44] all done, then? [14:26:49] !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=ml-staging2003.codfw.wmnet,service=ml-staging [14:26:52] (03PS1) 10Dreamy Jazz: Unset unused IP reveal groups in $wgExtensionFunctions callbacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205) [14:26:57] For your change yes [14:27:00] (03CR) 10CI reject: [V:04-1] Unset unused IP reveal groups in $wgExtensionFunctions callbacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [14:27:01] grand, thank you [14:27:07] Need to do the last change in the window plus my followup [14:27:15] Then can end the window [14:27:28] (03PS2) 10Dreamy Jazz: Unset unused IP reveal groups in $wgExtensionFunctions callbacks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205) [14:29:55] !log klausman@cumin2002 conftool action : set/pooled=yes; selector: name=ml-staging2003.codfw.wmnet,service=ml_staging [14:30:10] !log draining and depooling dse-k8s-ctrl1001 ready for reimage to bookworm and containerd for T377875 [14:31:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:32:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122632 (https://phabricator.wikimedia.org/T387230) (owner: 10Abijeet Patro) [14:33:01] (03Merged) 10jenkins-bot: metawiki: Enable Chinese variant translation for message bundles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122632 (https://phabricator.wikimedia.org/T387230) (owner: 10Abijeet Patro) [14:33:33] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1122632|metawiki: Enable Chinese variant translation for message bundles (T387230)]] [14:33:37] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-ctrl1001.eqiad.wmnet with OS bookworm [14:34:34] (03CR) 10Herron: [C:03+1] profile: add restbase scrape jobs to profile::prometheus::services [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite) [14:35:04] (03CR) 10Herron: [C:03+1] prometheus: split envoy rules into separate groups [puppet] - 10https://gerrit.wikimedia.org/r/1124743 (https://phabricator.wikimedia.org/T387965) (owner: 10Filippo Giunchedi) [14:35:58] (03PS10) 10Bking: cloudelastic: begin transition to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) [14:36:08] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=ml-staging2003.codfw.wmnet [14:36:10] (03CR) 10Bking: cloudelastic: begin transition to opensearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [14:36:29] !log dreamyjazz@deploy2002 abi, dreamyjazz: Backport for [[gerrit:1122632|metawiki: Enable Chinese variant translation for message bundles (T387230)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:37] (03CR) 10Herron: [C:03+1] prometheus: trial moving k8s-mlstaging to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1124747 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:26] (03CR) 10Herron: [C:03+1] prometheus: add sync-data script [puppet] - 10https://gerrit.wikimedia.org/r/1124749 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:38:06] (03CR) 10Herron: [C:03+1] prometheus: replace prometheus::migration with prometheus-sync-data [puppet] - 10https://gerrit.wikimedia.org/r/1124751 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:38:09] <_joe_> Dreamy_Jazz: when the deployments for the windows are over, please ping me, I have one tiny patch to deploy [14:38:21] (03PS2) 10Btullis: Remove sudo privileges for journalctl from airflow instance admins [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719) [14:38:23] (03PS3) 10Dreamy Jazz: Unset unused IP reveal groups in properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205) [14:38:34] (03CR) 10CI reject: [V:04-1] Unset unused IP reveal groups in properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [14:38:37] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: analytics_cluster::datahub::opensearch@eqiad [14:38:37] abijeet: Are you testing your change? [14:38:42] (03CR) 10Vgutierrez: [C:03+2] hiera,analytics_cluster: Enable IPIP on datahubsearch@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124064 (https://phabricator.wikimedia.org/T387306) (owner: 10Vgutierrez) [14:38:50] (03CR) 10CI reject: [V:04-1] Remove sudo privileges for journalctl from airflow instance admins [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719) (owner: 10Btullis) [14:39:01] Just realised it didn't mention your specific IRC username so you might not have been pinged [14:39:02] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2046 [14:39:16] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2046 [14:39:18] I'll ping you when done. [14:39:29] I also think Amir will want to deploy something too after the window [14:39:47] (03PS4) 10Dreamy Jazz: Unset unused IP reveal groups in properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205) [14:40:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Ganeti hosts added on codfw per-rack vlans - https://phabricator.wikimedia.org/T388005 (10cmooney) 03NEW p:05Triage→03Medium [14:40:47] (03PS5) 10Dreamy Jazz: Unset unused IP reveal groups in properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205) [14:40:52] (03CR) 10Ottomata: [C:03+1] "Oh! Thank you! I did a codesearch for stuff like this but I guess missed this!" [alerts] - 10https://gerrit.wikimedia.org/r/1124787 (https://phabricator.wikimedia.org/T238230) (owner: 10Filippo Giunchedi) [14:41:03] (03CR) 10Bking: [C:03+2] cloudelastic: begin transition to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1124501 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [14:41:16] Dreamy_Jazz, on it [14:41:26] (03CR) 10Filippo Giunchedi: [C:03+2] data-engineering: remove legacy eventlogging alerts [alerts] - 10https://gerrit.wikimedia.org/r/1124787 (https://phabricator.wikimedia.org/T238230) (owner: 10Filippo Giunchedi) [14:41:30] (03PS3) 10Jcrespo: dbbackups: Migrate es backups from backup[12]02 to backup[12]13 [puppet] - 10https://gerrit.wikimedia.org/r/1124738 (https://phabricator.wikimedia.org/T387892) [14:41:46] (03PS2) 10Federico Ceratto: dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) [14:42:08] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: split envoy rules into separate groups [puppet] - 10https://gerrit.wikimedia.org/r/1124743 (https://phabricator.wikimedia.org/T387965) (owner: 10Filippo Giunchedi) [14:42:47] !log fceratto@cumin1002 START - Cookbook sre.mysql.dbctl [14:42:47] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.dbctl (exit_code=0) [14:43:09] <_joe_> Dreamy_Jazz: thank you <3 [14:43:10] !log fceratto@cumin1002 START - Cookbook sre.mysql.dbctl [14:43:10] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.dbctl (exit_code=2) [14:43:39] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [14:43:40] <_joe_> Amir1: we can do a double-deploy in one go, if you want. My patch is specifically for noc.wikimedia.org [14:43:47] (03CR) 10Herron: [C:03+1] "Nice idea! I'm assuming the expr is commented since the next patch will update that to bring in team parsing, lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/1124790 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [14:44:04] (03CR) 10Herron: [C:03+1] "Nice! an improvement for sure" [alerts] - 10https://gerrit.wikimedia.org/r/1124800 (https://phabricator.wikimedia.org/T354762) (owner: 10Filippo Giunchedi) [14:44:18] (03PS1) 10Dreamy Jazz: Unset 'push-subscription-manager' group using hook callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124811 (https://phabricator.wikimedia.org/T275334) [14:44:22] !log cmooney@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2046 [14:44:47] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [14:44:47] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: analytics_cluster::datahub::opensearch@eqiad [14:44:48] Dreamy_Jazz, looks good. [14:44:48] (03Abandoned) 10Dreamy Jazz: Unset 'push-subscription-manager' group using hook callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124811 (https://phabricator.wikimedia.org/T275334) (owner: 10Dreamy Jazz) [14:44:56] Thanks. Proceeding [14:45:10] !log dreamyjazz@deploy2002 abi, dreamyjazz: Continuing with sync [14:45:19] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2046 [14:45:45] !log fceratto@cumin1002 START - Cookbook sre.mysql.dbctl [14:45:47] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.dbctl (exit_code=1) [14:45:50] (03CR) 10Dreamy Jazz: [C:03+1] "Will deploy this shortly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123499 (https://phabricator.wikimedia.org/T275336) (owner: 10Pppery) [14:46:38] Dreamy_Jazz, thank you! [14:46:44] (03CR) 10Dreamy Jazz: [C:03+2] Unset unused IP reveal groups in properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [14:46:52] (03CR) 10Dreamy Jazz: [C:03+2] Use MediaWikiServices hook for push-subscription-manager changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123499 (https://phabricator.wikimedia.org/T275336) (owner: 10Pppery) [14:47:26] (03PS3) 10Btullis: Remove sudo privileges for journalctl from airflow instance admins [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719) [14:47:30] (03CR) 10Ssingh: Release dnsdist 1.9.8-1+wmf12u1 (031 comment) [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 (owner: 10Ssingh) [14:47:36] (03Merged) 10jenkins-bot: Unset unused IP reveal groups in properly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124805 (https://phabricator.wikimedia.org/T387205) (owner: 10Dreamy Jazz) [14:47:38] (03Merged) 10jenkins-bot: Use MediaWikiServices hook for push-subscription-manager changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123499 (https://phabricator.wikimedia.org/T275336) (owner: 10Pppery) [14:47:43] RECOVERY - Hadoop NodeManager on analytics1070 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:48:01] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-ctrl1001.eqiad.wmnet with reason: host reimage [14:48:18] (03CR) 10CI reject: [V:04-1] dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto) [14:48:58] (03PS6) 10Ssingh: Release dnsdist 1.9.8-1~wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 [14:49:15] (03CR) 10Ssingh: Release dnsdist 1.9.8-1~wmf12u1 (031 comment) [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 (owner: 10Ssingh) [14:51:22] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-ctrl1001.eqiad.wmnet with reason: host reimage [14:51:34] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1007* for ban host prior to reimage - bking@cumin2002 - T387904 [14:51:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 (owner: 10Ssingh) [14:51:36] T387904: Migrate Cloudelastic to Opensearch - https://phabricator.wikimedia.org/T387904 [14:51:37] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1007* for ban host prior to reimage - bking@cumin2002 - T387904 [14:52:02] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1122632|metawiki: Enable Chinese variant translation for message bundles (T387230)]] (duration: 18m 29s) [14:52:05] T387230: Mandarin Translation Issue (zh-hans, zh-hant are not seprated handle properly) in WikiLearn - https://phabricator.wikimedia.org/T387230 [14:52:07] (03PS3) 10Federico Ceratto: dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) [14:52:16] !log fceratto@cumin1002 START - Cookbook sre.mysql.dbctl [14:52:28] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.dbctl (exit_code=99) [14:53:02] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1123499|Use MediaWikiServices hook for push-subscription-manager changes (T275336)]], [[gerrit:1124805|Unset unused IP reveal groups in properly (T387205)]] [14:53:07] T275336: push-subscription-manager group is sometimes available at all wikis - https://phabricator.wikimedia.org/T275336 [14:53:07] T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205 [14:54:02] _joe_: already eating. Will do it later. Thanks for the offer! [14:54:19] PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [14:54:29] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417#10605653 (10herron) [14:54:33] (03CR) 10Jcrespo: [C:03+2] dbbackups: Migrate es backups from backup[12]02 to backup[12]13 [puppet] - 10https://gerrit.wikimedia.org/r/1124738 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [14:54:33] PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [14:54:53] (03CR) 10Federico Ceratto: "I was using it as a testbed for incremental tests of the new cloning script." [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [14:55:11] (03CR) 10Kamila Součková: [C:03+1] shellbox-video: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124796 (owner: 10Hnowlan) [14:55:55] !log dreamyjazz@deploy2002 dreamyjazz, pppery: Backport for [[gerrit:1123499|Use MediaWikiServices hook for push-subscription-manager changes (T275336)]], [[gerrit:1124805|Unset unused IP reveal groups in properly (T387205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:56:51] (03PS2) 10Jforrester: wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124509 [14:56:52] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-02-24-145135 to 2025-03-05-140259 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124817 [14:56:52] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-02-25-210518 to 2025-03-05-140247 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124818 [14:57:24] !log dreamyjazz@deploy2002 dreamyjazz, pppery: Continuing with sync [14:57:25] FIRING: SystemdUnitFailed: opensearch-disable-readahead.service on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:57:33] (03CR) 10Marostegui: [C:04-1] "then it also needs to be moved to s7 in site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [14:58:36] (03CR) 10CI reject: [V:04-1] dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) (owner: 10Federico Ceratto) [14:59:23] I guess the backport window is going to go over, as ever? :-) [14:59:39] :D [14:59:46] jouncebot: nowandnext [14:59:46] For the next 0 hour(s) and 0 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1400) [14:59:46] In 0 hour(s) and 0 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1500) [15:00:01] The changes from Amir and joe were not technically in the window [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1500) [15:00:08] I might be done in the next minute or two. [15:00:11] And yet. [15:00:35] Maybe the window needs to be 24 hours long :D [15:01:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:04:08] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1123499|Use MediaWikiServices hook for push-subscription-manager changes (T275336)]], [[gerrit:1124805|Unset unused IP reveal groups in properly (T387205)]] (duration: 11m 05s) [15:04:11] _joe_: I'm now done with deploying, though may be good to coordinate with others just to check if you can deploy in this window [15:04:12] T275336: push-subscription-manager group is sometimes available at all wikis - https://phabricator.wikimedia.org/T275336 [15:04:13] T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205 [15:04:33] We're just deploying a service bump. [15:05:07] 06SRE, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q3): Ops-monitoring-bot creating duplicate tasks for the same RAID failure - https://phabricator.wikimedia.org/T387754#10605718 (10fgiunchedi) [15:05:59] (03CR) 10Ecarg: [C:03+2] wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124509 (owner: 10Jforrester) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:15] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/services/mw-debug: apply [15:07:26] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/services/mw-debug: apply [15:07:35] (03Merged) 10jenkins-bot: wikifunctions: Raise orchestrator top CPU limit to 1 to see if that improves heap issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124509 (owner: 10Jforrester) [15:08:42] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet with OS bookworm [15:08:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [15:09:07] (03PS14) 10Clément Goubert: mediawiki::periodic_job: Split periodic job definition [puppet] - 10https://gerrit.wikimedia.org/r/1118080 (https://phabricator.wikimedia.org/T385869) [15:09:14] (03CR) 10Vgutierrez: [C:03+1] haproxy: Remove cipher regsub of "ECDHE-RSA-" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [15:09:17] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/services/mw-debug: apply [15:09:40] (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudvirt: increase conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/1124821 (https://phabricator.wikimedia.org/T387179) [15:09:52] !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:10:49] !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:11:28] !log installing openssh security updates [15:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:30] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719) (owner: 10Btullis) [15:11:36] (03PS4) 10Ebernhardson: flink-app chart: Support per-chart logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546 [15:11:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:11:50] 06SRE, 07Kubernetes, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): etcd: adapt etcd-backup.py for etcd 3.4 - https://phabricator.wikimedia.org/T385727#10605753 (10fgiunchedi) [15:12:04] (03CR) 10Ebernhardson: "ahh, indeed. Done." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546 (owner: 10Ebernhardson) [15:12:06] (03CR) 10Scott French: [C:03+1] "No objections on my end!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124752 (owner: 10Effie Mouzeli) [15:12:19] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719) (owner: 10Btullis) [15:13:24] (03CR) 10Ssingh: [C:03+2] Release dnsdist 1.9.8-1~wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 (owner: 10Ssingh) [15:13:26] <_joe_> Dreamy_Jazz: yeah and I'm in multiple meetings in a row at this point, heh, I'll piggyback Amir later :) [15:13:27] (03CR) 10Kamila Součková: "@hnowlan@wikimedia.org could you please review the helm bits? thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1122961 (https://phabricator.wikimedia.org/T371214) (owner: 10Kamila Součková) [15:13:41] (03CR) 10Kamila Součková: "@hnowlan@wikimedia.org could you please review the helm bits? thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123010 (owner: 10Kamila Součková) [15:14:59] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10605768 (10MatthewVernon) >>! In T377827#10591134, @Ladsgroup wrote: > These are eqiad hosts which I haven't been deleting the thumbna... [15:16:35] !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:17:13] !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:17:29] !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:18:07] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118080 (https://phabricator.wikimedia.org/T385869) (owner: 10Clément Goubert) [15:18:21] !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:19:25] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/services/mw-debug: apply [15:19:31] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: haproxy: check ingress workers with the /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1124756 (https://phabricator.wikimedia.org/T387959) (owner: 10Arturo Borrero Gonzalez) [15:20:18] (03CR) 10Btullis: [C:03+2] Remove sudo privileges for journalctl from airflow instance admins [puppet] - 10https://gerrit.wikimedia.org/r/1124802 (https://phabricator.wikimedia.org/T387719) (owner: 10Btullis) [15:20:53] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10605808 (10cmooney) >>! In T384838#10603754, @Jhancock.wm wrote: > @Papaul i found a weird little thing. I racked ganeti2049 in B5, U40. There are three other serve... [15:21:06] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/services/mw-debug: apply [15:23:06] (03CR) 10Ecarg: [C:03+2] wikifunctions: Update evaluators from 2025-02-24-145135 to 2025-03-05-140259 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124817 (owner: 10Jforrester) [15:23:08] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/services/mw-debug: apply [15:24:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10605822 (10cmooney) @Jhancock.wm one thing to make sure is all ganeti hosts are added to **row-wide** vlans. So in the [[ https://netbox.wikimedia.org/extras/scrip... [15:24:20] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [15:24:32] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-02-24-145135 to 2025-03-05-140259 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124817 (owner: 10Jforrester) [15:24:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505) (owner: 10Arlolra) [15:26:05] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-ctrl1002.eqiad.wmnet with OS bookworm [15:26:10] RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 717 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [15:26:14] !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:26:22] RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 11 May 2025 11:48:24 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [15:27:39] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudelastic1007.eqiad.wmnet with OS bullseye [15:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:27:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Ganeti hosts added on codfw per-rack vlans - https://phabricator.wikimedia.org/T388005#10605846 (10cmooney) [15:28:02] (03PS5) 10Herron: KubernetesRsyslogDown: alert only if logs were sent before [alerts] - 10https://gerrit.wikimedia.org/r/1124453 (https://phabricator.wikimedia.org/T381417) [15:28:22] !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:30:20] !log upload dnsdist 1.9.8-1~wmf12u1 to apt.wm.org for bookworm [15:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:41] !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:31:40] !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:32:00] !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:32:55] !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:33:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [15:34:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10605869 (10phaultfinder) [15:34:59] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic [15:35:05] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic [15:35:45] (03CR) 10Ecarg: [C:03+2] wikifunctions: Update orchestrator from 2025-02-25-210518 to 2025-03-05-140247 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124818 (owner: 10Jforrester) [15:37:14] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-02-25-210518 to 2025-03-05-140247 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124818 (owner: 10Jforrester) [15:37:56] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [15:38:21] PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [15:38:33] PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [15:38:47] !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:39:08] !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:40:35] !log starting es backups on new hosts backup1013, backup2013 T387892 [15:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:38] T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892 [15:41:37] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-ctrl1002.eqiad.wmnet with reason: host reimage [15:41:40] (03PS3) 10JMeybohm: Add pod-security.wmg.org labels to wikikube mediawiki namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124416 (https://phabricator.wikimedia.org/T273507) [15:41:40] (03PS1) 10JMeybohm: admin_ng: Disable hostPath and capabilities baseline rules for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124830 (https://phabricator.wikimedia.org/T273507) [15:41:40] (03PS1) 10Arturo Borrero Gonzalez: toolforge: haproxy: don't use TLS on the HTTP check for k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/1124829 (https://phabricator.wikimedia.org/T387959) [15:41:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:42:00] !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:42:23] !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:42:51] !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:43:23] !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:45:28] (03CR) 10Muehlenhoff: [C:03+2] Add component/lshw on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1124763 (https://phabricator.wikimedia.org/T380295) (owner: 10Muehlenhoff) [15:46:23] (03CR) 10David Caro: [C:03+1] toolforge: haproxy: don't use TLS on the HTTP check for k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/1124829 (https://phabricator.wikimedia.org/T387959) (owner: 10Arturo Borrero Gonzalez) [15:47:53] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] toolforge: haproxy: don't use TLS on the HTTP check for k8s-ingress [puppet] - 10https://gerrit.wikimedia.org/r/1124829 (https://phabricator.wikimedia.org/T387959) (owner: 10Arturo Borrero Gonzalez) [15:48:36] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1007* for ban host prior to reimage - bking@cumin2002 - T387904 [15:48:36] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: cloudelastic1007* for ban host prior to reimage - bking@cumin2002 - T387904 [15:48:41] T387904: Migrate Cloudelastic to Opensearch - https://phabricator.wikimedia.org/T387904 [15:50:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [15:52:04] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707#10605945 (10Jclark-ctr) @MatthewVernon can this drive be replaced at any time it is arriving tonight/tomorrow morning? [15:53:55] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707#10605951 (10MatthewVernon) @Jclark-ctr yes, please go ahead :) [I intend that to be clear from "you can work on this system at any time without further input from me." in the ticke... [15:54:33] (03CR) 10David Caro: [C:03+1] "\o/ yay!" [puppet] - 10https://gerrit.wikimedia.org/r/1124733 (https://phabricator.wikimedia.org/T353457) (owner: 10Filippo Giunchedi) [15:59:33] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): openstack galera no recent writes 2025-03-04, suspected network hardware problem - https://phabricator.wikimedia.org/T387828#10605987 (10VRiley-WMF) Looks like the SFP failed. Swapped it out and it looks like it's communicating... [16:00:05] tgr: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) SUL deploy window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1600). [16:00:07] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-ctrl1002.eqiad.wmnet with OS bookworm [16:00:30] Yet another window for a custom window :D [16:00:43] jouncebot has the funniest messages [16:01:59] (03PS1) 10JMeybohm: staging-codfw: Unset image.tag for coredns to apply the default version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124831 (https://phabricator.wikimedia.org/T384450) [16:02:01] (03PS1) 10JMeybohm: admin_ng: Update dependencies between releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124832 (https://phabricator.wikimedia.org/T341984) [16:03:27] (03PS1) 10Lucas Werkmeister (WMDE): statistics::wmde::graphite: add syslog_identifier [puppet] - 10https://gerrit.wikimedia.org/r/1124833 (https://phabricator.wikimedia.org/T387514) [16:03:50] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10606045 (10Krinkle) [16:05:10] (03CR) 10Btullis: [C:03+1] "Great! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1124833 (https://phabricator.wikimedia.org/T387514) (owner: 10Lucas Werkmeister (WMDE)) [16:05:17] (03CR) 10Btullis: [C:03+2] statistics::wmde::graphite: add syslog_identifier [puppet] - 10https://gerrit.wikimedia.org/r/1124833 (https://phabricator.wikimedia.org/T387514) (owner: 10Lucas Werkmeister (WMDE)) [16:05:58] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [16:06:37] (03CR) 10JMeybohm: [C:03+2] admin_ng: Disable hostPath and capabilities baseline rules for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124830 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [16:06:49] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1124733 (https://phabricator.wikimedia.org/T353457) (owner: 10Filippo Giunchedi) [16:07:06] (03PS1) 10Jcrespo: dbbackups: Add additional m1 grants for backup[12]013 stats user [puppet] - 10https://gerrit.wikimedia.org/r/1124834 (https://phabricator.wikimedia.org/T387892) [16:07:11] RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 717 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [16:07:23] RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 11 May 2025 11:48:24 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [16:07:24] (03PS4) 10Federico Ceratto: dbctl.py, dbctl_test.py: Serialize dbctl changes [cookbooks] - 10https://gerrit.wikimedia.org/r/1124797 (https://phabricator.wikimedia.org/T387209) [16:08:08] (03PS14) 10Tiziano Fogli: snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) [16:10:04] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1124800 (https://phabricator.wikimedia.org/T354762) (owner: 10Filippo Giunchedi) [16:10:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [16:11:10] (03CR) 10Jcrespo: "I plan to deploy this tomorrow (I missed it during setup today)." [puppet] - 10https://gerrit.wikimedia.org/r/1124834 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [16:11:21] (03Merged) 10jenkins-bot: admin_ng: Disable hostPath and capabilities baseline rules for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124830 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [16:11:48] (03CR) 10Marostegui: [C:03+1] instances.yaml, db1253.yaml, db1254.yaml, site.pp: clone db1253 and db1254 [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [16:12:50] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:13:39] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1124790 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [16:14:19] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1124751 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [16:15:09] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1124749 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [16:15:40] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1124747 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [16:15:58] (03CR) 10Stevemunene: [C:03+1] hiera,wcqs: Enable IPIP on wcqs@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123663 (https://phabricator.wikimedia.org/T387313) (owner: 10Vgutierrez) [16:16:22] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite) [16:16:50] (03CR) 10Ebrahim: "If is possible please add the brand new table also to dump and specially replica similar to what is done to globalimagelinks https://githu" [puppet] - 10https://gerrit.wikimedia.org/r/1123022 (https://phabricator.wikimedia.org/T363581) (owner: 10Bvibber) [16:17:02] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2046-48 to codfw - jhancock@cumin2002" [16:17:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124781 (https://phabricator.wikimedia.org/T379909) (owner: 10Gergő Tisza) [16:17:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124782 (https://phabricator.wikimedia.org/T379909) (owner: 10Gergő Tisza) [16:17:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124783 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [16:17:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124784 (owner: 10Gergő Tisza) [16:17:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124785 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza) [16:17:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124786 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza) [16:18:41] (03PS1) 10Lucas Werkmeister (WMDE): snapshot: add syslog_identifier to Wikibase dumps [puppet] - 10https://gerrit.wikimedia.org/r/1124835 (https://phabricator.wikimedia.org/T387514) [16:19:01] (03CR) 10Lucas Werkmeister (WMDE): "Note: I haven’t looked at the rsyslog configurations in detail and am not very sure that this is 100% correct…" [puppet] - 10https://gerrit.wikimedia.org/r/1124835 (https://phabricator.wikimedia.org/T387514) (owner: 10Lucas Werkmeister (WMDE)) [16:19:06] (03Merged) 10jenkins-bot: CentralAuthIdLookup: Reuse cached object on single-value lookup [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124781 (https://phabricator.wikimedia.org/T379909) (owner: 10Gergő Tisza) [16:19:08] (03Merged) 10jenkins-bot: CentralAuthIdLookup: Use primary DB after writes [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124782 (https://phabricator.wikimedia.org/T379909) (owner: 10Gergő Tisza) [16:19:11] (03Merged) 10jenkins-bot: Use UserOptionsManager for SUL3 rollout flag [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124783 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [16:19:12] (03Merged) 10jenkins-bot: Make SUL3 global preference optional and simplify logic [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124784 (owner: 10Gergő Tisza) [16:19:13] (03Merged) 10jenkins-bot: Add passive central domain to edge login list [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124785 (https://phabricator.wikimedia.org/T375796) (owner: 10Gergő Tisza) [16:19:14] (03Merged) 10jenkins-bot: SUL3: Use a central wiki for autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124786 (https://phabricator.wikimedia.org/T387357) (owner: 10Gergő Tisza) [16:19:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2046-48 to codfw - jhancock@cumin2002" [16:19:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:19:27] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2045 [16:19:27] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10606183 (10Jdlrobson-WMF) >>! In T214998#10600094, @Peter wrote: > I've been looking into the data we get... [16:19:28] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2046 [16:19:29] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047 [16:19:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2045 [16:19:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2046 [16:19:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047 [16:19:50] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124781|CentralAuthIdLookup: Reuse cached object on single-value lookup (T379909 T380500 T387106)]], [[gerrit:1124782|CentralAuthIdLookup: Use primary DB after writes (T379909 T380500)]], [[gerrit:1124783|Use UserOptionsManager for SUL3 rollout flag (T384549)]], [[gerrit:1124784|Make SUL3 global preference optional and simplify logic]], [[gerrit:1124785|Ad [16:19:50] d passive central domain to edge login list (T375796)]], [[gerrit:1124786|SUL3: Use a central wiki for autologin (T387357)]] [16:19:58] T379909: Define where to add code that needs to run after a new central user has been created - https://phabricator.wikimedia.org/T379909 [16:19:58] T380500: CentralAuthUser returning outdated data after user creation - https://phabricator.wikimedia.org/T380500 [16:19:58] T387106: CentralAuthIdLookup should use a cache - https://phabricator.wikimedia.org/T387106 [16:19:59] T384549: Create a per-user flag for enabling SUL3 - https://phabricator.wikimedia.org/T384549 [16:19:59] T375796: Synchronize SUL2 and SUL3 central browser state - https://phabricator.wikimedia.org/T375796 [16:19:59] T387357: SUL3 signup results in autocreation on all edge login domains - https://phabricator.wikimedia.org/T387357 [16:20:13] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2048 [16:20:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2048 [16:20:56] wow those merges were fast. Did we finally stop running cross-repo Selenium tests for backports? [16:22:50] !log tgr@deploy2002 tgr: Backport for [[gerrit:1124781|CentralAuthIdLookup: Reuse cached object on single-value lookup (T379909 T380500 T387106)]], [[gerrit:1124782|CentralAuthIdLookup: Use primary DB after writes (T379909 T380500)]], [[gerrit:1124783|Use UserOptionsManager for SUL3 rollout flag (T384549)]], [[gerrit:1124784|Make SUL3 global preference optional and simplify logic]], [[gerrit:1124785|Add passive central do [16:22:51] main to edge login list (T375796)]], [[gerrit:1124786|SUL3: Use a central wiki for autologin (T387357)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:22:56] Yes, I dropped that a few weeks ago. [16:22:56] (03CR) 10Btullis: [C:03+2] snapshot: add syslog_identifier to Wikibase dumps [puppet] - 10https://gerrit.wikimedia.org/r/1124835 (https://phabricator.wikimedia.org/T387514) (owner: 10Lucas Werkmeister (WMDE)) [16:23:12] And on Monday I switched the wmf-quibble jobs from 7.4 to 8.1 which should speed things up a little. [16:23:28] (03CR) 10Stevemunene: [C:03+1] hiera,druid: Enable IPIP on druid-public-broker@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307) (owner: 10Vgutierrez) [16:23:46] thanks for that! it's neat to have extension backports merge in <2 min [16:23:56] Indeed, back to the good old days. [16:24:15] (03CR) 10Stevemunene: [C:03+1] hiera,wdqs: Enable IPIP on wdqs-internal-main@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123684 (https://phabricator.wikimedia.org/T387319) (owner: 10Vgutierrez) [16:24:19] But the main speed up is the re-use of existing cached job outputs that RelEng landed last week. [16:24:32] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1007.eqiad.wmnet with reason: host reimage [16:26:17] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wcqs::public@codfw [16:26:27] (03CR) 10Federico Ceratto: [C:03+1] instances.yaml, db1253.yaml, db1254.yaml, site.pp: clone db1253 and db1254 [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [16:26:29] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml, db1253.yaml, db1254.yaml, site.pp: clone db1253 and db1254 [puppet] - 10https://gerrit.wikimedia.org/r/1124740 (https://phabricator.wikimedia.org/T385141) (owner: 10Federico Ceratto) [16:26:45] (03CR) 10Vgutierrez: [C:03+2] hiera,wcqs: Enable IPIP on wcqs@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123663 (https://phabricator.wikimedia.org/T387313) (owner: 10Vgutierrez) [16:26:58] Huh. "WikimediaDebug is disabled. To re-enable it, accept the new permissions: Block content on any page." [16:27:22] I guess we aren't the only ones who have a hard time making our permission system intuitive. [16:28:45] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1007.eqiad.wmnet with reason: host reimage [16:29:57] tgr_: yeah that's being worked on afaik [16:30:10] indeed, T387822 [16:30:10] T387822: WikimediaDebug Firefox extension requires permission to block content on any page - https://phabricator.wikimedia.org/T387822 [16:30:14] https://phabricator.wikimedia.org/T387899 [16:30:40] Ah, I had the "parent" handy :p [16:30:46] ^^ [16:32:20] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [16:33:17] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [16:33:17] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wcqs::public@codfw [16:33:34] (03CR) 10Clément Goubert: [C:03+2] deployment server: Don't pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/1124462 (https://phabricator.wikimedia.org/T387823) (owner: 10Ahmon Dancy) [16:34:45] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wcqs::public@eqiad [16:34:59] (03CR) 10Vgutierrez: [C:03+2] hiera,wcqs: Enable IPIP on wcqs@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123664 (https://phabricator.wikimedia.org/T387313) (owner: 10Vgutierrez) [16:35:08] (03PS2) 10Vgutierrez: hiera,wcqs: Enable IPIP on wcqs@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123664 (https://phabricator.wikimedia.org/T387313) [16:38:09] (03CR) 10Vgutierrez: [C:03+2] hiera,wcqs: Enable IPIP on wcqs@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123664 (https://phabricator.wikimedia.org/T387313) (owner: 10Vgutierrez) [16:39:40] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for role: wcqs::public@eqiad [16:39:53] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wcqs::public@eqiad [16:40:15] tgr_: re https://gerrit.wikimedia.org/r/c/operations/puppet/+/1123029, I think sukhe offered to take a look [16:41:28] rzl: I am in a meeting right now but I wasn't aware of this so can look when I am done [16:41:42] vgutierrez ^ [16:41:47] can you take a look please? [16:43:23] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user ori - https://phabricator.wikimedia.org/T388029 (10acooper) 03NEW [16:43:25] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030 (10acooper) 03NEW [16:43:44] sukhe: sure [16:44:13] ,3 [16:44:13] <3 [16:44:39] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [16:45:36] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:45:42] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:45:43] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [16:45:44] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wcqs::public@eqiad [16:47:07] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user jdcc - https://phabricator.wikimedia.org/T388029#10606414 (10acooper) [16:47:09] 07Puppet, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032 (10Jdlrobson-WMF) 03NEW [16:48:01] !log tgr@deploy2002 tgr: Continuing with sync [16:48:25] (03CR) 10Vgutierrez: [C:03+1] Update CentralAuth multi-DC rules for SUL3, attempt 2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695) (owner: 10Gergő Tisza) [16:48:32] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034 (10acooper) 03NEW [16:48:47] sukhe, tgr_, rzl: it looks good to me [16:48:50] thanks sukhe, vgutierrez! It's not particularly urgent, just trying to figure out the next step [16:50:19] vgutierrez: :* [16:50:23] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:50:29] (03PS2) 10Sergio Gimeno: [Growth] Set default api lookahead size to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120180 (https://phabricator.wikimedia.org/T325990) [16:50:30] tgr_: no worries, feel free to add one of me or vgutierrez to such patches [16:50:44] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:50:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120180 (https://phabricator.wikimedia.org/T325990) (owner: 10Sergio Gimeno) [16:50:50] or both for extra TZ coverage :D [16:50:52] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:51:10] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10606475 (10Jdlrobson-WMF) [16:53:02] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10606477 (10acooper) Update - confirming with staff members whether this access is still required as they may still be actively doing volunteer work, will confirm back, so pause thi... [16:53:04] (03PS1) 10Kamila Součková: prometheus: charmuseum relabel config [puppet] - 10https://gerrit.wikimedia.org/r/1124843 (https://phabricator.wikimedia.org/T386808) [16:53:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:53:43] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:53:52] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:54:08] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10606486 (10acooper) a:03odimitrijevic [16:54:45] (03CR) 10Gergő Tisza: Update CentralAuth multi-DC rules for SUL3, attempt 2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695) (owner: 10Gergő Tisza) [16:54:47] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124781|CentralAuthIdLookup: Reuse cached object on single-value lookup (T379909 T380500 T387106)]], [[gerrit:1124782|CentralAuthIdLookup: Use primary DB after writes (T379909 T380500)]], [[gerrit:1124783|Use UserOptionsManager for SUL3 rollout flag (T384549)]], [[gerrit:1124784|Make SUL3 global preference optional and simplify logic]], [[gerrit:1124785|A [16:54:47] dd passive central domain to edge login list (T375796)]], [[gerrit:1124786|SUL3: Use a central wiki for autologin (T387357)]] (duration: 34m 57s) [16:54:54] T379909: Define where to add code that needs to run after a new central user has been created - https://phabricator.wikimedia.org/T379909 [16:54:55] T380500: CentralAuthUser returning outdated data after user creation - https://phabricator.wikimedia.org/T380500 [16:54:55] T387106: CentralAuthIdLookup should use a cache - https://phabricator.wikimedia.org/T387106 [16:54:55] T384549: Create a per-user flag for enabling SUL3 - https://phabricator.wikimedia.org/T384549 [16:54:56] T375796: Synchronize SUL2 and SUL3 central browser state - https://phabricator.wikimedia.org/T375796 [16:54:56] T387357: SUL3 signup results in autocreation on all edge login domains - https://phabricator.wikimedia.org/T387357 [16:55:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124757 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [16:56:05] (03Merged) 10jenkins-bot: CentralAuth: Enable SUL3 signup on group 0 (attempt 4) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124757 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [16:56:26] (03PS4) 10Gergő Tisza: Update CentralAuth multi-DC rules for SUL3, attempt 2 [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695) [16:56:35] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124757|CentralAuth: Enable SUL3 signup on group 0 (attempt 4) (T384007)]] [16:56:37] T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007 [16:56:48] (03CR) 10Vgutierrez: Update CentralAuth multi-DC rules for SUL3, attempt 2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695) (owner: 10Gergő Tisza) [16:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:59:34] !log tgr@deploy2002 tgr: Backport for [[gerrit:1124757|CentralAuth: Enable SUL3 signup on group 0 (attempt 4) (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:00:50] (03CR) 10Gergő Tisza: Update CentralAuth multi-DC rules for SUL3, attempt 2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695) (owner: 10Gergő Tisza) [17:01:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:03:33] (03CR) 10Tchanders: [C:03+1] "This duplicates Ic9f792e9749e299ff8257474a2c73ca549e3f4e7, but this has more explanation in the commit message. If we deploy this one inst" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124745 (https://phabricator.wikimedia.org/T380441) (owner: 10Máté Szabó) [17:04:13] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite) [17:05:09] (03PS1) 10Volans: sre.hosts.provision: disable HostHeaderCheck [cookbooks] - 10https://gerrit.wikimedia.org/r/1124845 (https://phabricator.wikimedia.org/T382416) [17:05:58] jouncebot: nowandnext [17:05:59] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [17:05:59] In 0 hour(s) and 54 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1800) [17:06:13] tgr_: Hi, let me know once you're fully done [17:09:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10606534 (10phaultfinder) [17:09:51] (03PS3) 10Cwhite: profile: add restbase scrape jobs to profile::prometheus::services [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) [17:10:00] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite) [17:12:00] 06SRE, 10SRE-Access-Requests, 05WMF-NDA: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10606547 (10acooper) [17:12:06] 06SRE, 10SRE-Access-Requests, 05WMF-NDA: Remove production data access for NDA expired user jdcc - https://phabricator.wikimedia.org/T388029#10606549 (10acooper) [17:12:16] 06SRE, 10SRE-Access-Requests, 05WMF-NDA: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10606550 (10acooper) [17:12:38] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10606551 (10bwang) p:05Triage→03High [17:12:40] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10606552 (10acooper) [17:12:44] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user jdcc - https://phabricator.wikimedia.org/T388029#10606553 (10acooper) [17:12:51] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10606555 (10acooper) [17:14:09] !log tgr@deploy2002 tgr: Continuing with sync [17:17:45] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user jdcc - https://phabricator.wikimedia.org/T388029#10606599 (10acooper) a:03MoritzMuehlenhoff [17:17:53] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10606600 (10acooper) a:03MoritzMuehlenhoff [17:20:48] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124757|CentralAuth: Enable SUL3 signup on group 0 (attempt 4) (T384007)]] (duration: 24m 13s) [17:20:52] T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007 [17:21:05] Amir1: done [17:21:54] thanks [17:25:42] (03CR) 10Ladsgroup: [C:03+2] Enable thumbnail steps in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124759 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [17:26:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124759 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [17:27:32] (03Merged) 10jenkins-bot: Enable thumbnail steps in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124759 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [17:28:02] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1124759|Enable thumbnail steps in testwiki (T360589)]] [17:28:05] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [17:29:28] (03PS1) 10Scott French: mw-(api-ext|web): right-size given current traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124848 (https://phabricator.wikimedia.org/T383845) [17:29:30] (03PS2) 10Scott French: mw-(api-ext|web): serve 5% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124849 (https://phabricator.wikimedia.org/T383845) [17:31:01] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1124759|Enable thumbnail steps in testwiki (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:31:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:31:58] (03CR) 10Scott French: "This is part one of two patches that together (1) right-size mw-web and mw-api-ext for the current migration state (2) clarify multi-DC se" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124848 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:32:21] (03CR) 10Scott French: "Thanks in advance for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124849 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [17:34:22] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [17:34:28] (03CR) 10Hnowlan: [C:03+2] shellbox-video: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124796 (owner: 10Hnowlan) [17:35:42] (03PS1) 10Michael Große: Growth: remove unused config wgGENewcomerTasksOresTopicConfigTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124836 [17:36:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124836 (owner: 10Michael Große) [17:36:33] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10606672 (10KFrancis) Hi @Ben.buchenau, please confirm this is your correct name and I will put the NDA agreement together. Thanks! [17:36:35] (03Merged) 10jenkins-bot: shellbox-video: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124796 (owner: 10Hnowlan) [17:38:16] (03CR) 10Cwhite: [C:03+2] profile: add restbase scrape jobs to profile::prometheus::services [puppet] - 10https://gerrit.wikimedia.org/r/1124533 (https://phabricator.wikimedia.org/T387343) (owner: 10Cwhite) [17:40:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Ganeti hosts added on codfw per-rack vlans - https://phabricator.wikimedia.org/T388005#10606698 (10Jhancock.wm) 47 and 48 are not live. new machines. so i can redo those [17:41:07] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124759|Enable thumbnail steps in testwiki (T360589)]] (duration: 13m 04s) [17:41:10] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [17:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:50:19] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:50:23] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:50:39] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [17:50:41] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [17:50:45] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [17:50:55] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [17:53:15] (03PS2) 10Bernard Wang: Enable Search AB test for en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 [17:56:05] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user aude - https://phabricator.wikimedia.org/T388034#10606809 (10Aklapper) @acooper: Hmm, where exactly does that information come from? https://phabricator.wikimedia.org/p/aude/ implies that they are //currently// a contractor (or ve... [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1800) [18:03:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:03:29] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10606837 (10Jdlrobson-WMF) [18:04:01] (03PS1) 10Fabfur: acme_chief: add parameter for destination path [puppet] - 10https://gerrit.wikimedia.org/r/1124855 (https://phabricator.wikimedia.org/T387929) [18:09:08] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): openstack galera no recent writes 2025-03-04, suspected network hardware problem - https://phabricator.wikimedia.org/T387828#10606881 (10VRiley-WMF) 05Open→03Resolved Confirmed that this unit came back online [18:10:29] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124855 (https://phabricator.wikimedia.org/T387929) (owner: 10Fabfur) [18:10:42] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10606893 (10Ladsgroup) The deletions will be quite slow and on top of that, we are introducing the thumbnail steps and bumping the defa... [18:12:45] (03CR) 10Hnowlan: [C:03+1] mw-(api-ext|web): right-size given current traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124848 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:13:40] (03CR) 10Hnowlan: [C:03+1] "lgtm, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124849 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:16:00] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for jhuneidi - https://phabricator.wikimedia.org/T388044 (10thcipriani) 03NEW [18:18:34] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10606946 (10Krinkle) I believe it would be a mistake to hardcode `MiuiBrowser` as a mobile browser, as this would break the browser UI and the end-users... [18:20:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Ganeti hosts added on codfw per-rack vlans - https://phabricator.wikimedia.org/T388005#10606953 (10cmooney) >>! In T388005#10606698, @Jhancock.wm wrote: > 47 and 48 are not live. new machines. so i can redo those Sorry I was being dumb, ganeti203... [18:20:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1007.eqiad.wmnet with OS bullseye [18:20:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Ganeti hosts added on codfw per-rack vlans - https://phabricator.wikimedia.org/T388005#10606957 (10cmooney) [18:21:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122990 (https://phabricator.wikimedia.org/T383774) (owner: 10Itamar Givon) [18:21:22] PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [18:21:34] PROBLEM - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Search%23Administration [18:21:47] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic [18:21:48] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic [18:29:12] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10607012 (10Krinkle) I've written up my analysis and proposal at: https://www.mediawiki.org/wiki/Requests_f... [18:33:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:45:22] !log import trafficserver 9.2.9-1wm1 into bullseye-wikimedia [18:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:48] !log import trafficserver 9.2.9-1wm1 into bullseye-wikimedia (T388035) [18:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:51] T388035: upgrade to trafficserver 9.2.9 - https://phabricator.wikimedia.org/T388035 [18:52:22] (03PS1) 10Dzahn: zuul: remove gearman wait queue monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1124857 (https://phabricator.wikimedia.org/T388041) [19:00:04] hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1900) [19:03:24] RECOVERY - MariaDB Replica Lag: s8 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:04:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74098 and previous config saved to /var/cache/conftool/dbconfig/20250305-190403-root.json [19:14:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3554 MB (3% inode=98%): /tmp 3554 MB (3% inode=98%): /var/tmp 3554 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [19:15:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74099 and previous config saved to /var/cache/conftool/dbconfig/20250305-191550-root.json [19:16:16] (03PS1) 10Gergő Tisza: Preserve usesul3 flag during autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124860 (https://phabricator.wikimedia.org/T375788) [19:16:35] (03PS1) 10Gergő Tisza: Preserve usesul3 flag during autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124861 (https://phabricator.wikimedia.org/T375788) [19:17:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124860 (https://phabricator.wikimedia.org/T375788) (owner: 10Gergő Tisza) [19:17:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124861 (https://phabricator.wikimedia.org/T375788) (owner: 10Gergő Tisza) [19:19:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74100 and previous config saved to /var/cache/conftool/dbconfig/20250305-191909-root.json [19:22:11] (03CR) 10BCornwall: [V:03+1 C:03+2] haproxy: Remove cipher regsub of "ECDHE-RSA-" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [19:27:16] (03PS1) 10Bking: cloudelastic: migrate cloudelastic1008 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1124863 (https://phabricator.wikimedia.org/T387904) [19:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:27:55] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124863 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [19:30:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74101 and previous config saved to /var/cache/conftool/dbconfig/20250305-193056-root.json [19:30:59] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1008* for ban host prior to reimage - bking@cumin2002 - T387904 [19:31:02] T387904: Migrate Cloudelastic to Opensearch - https://phabricator.wikimedia.org/T387904 [19:31:02] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1008* for ban host prior to reimage - bking@cumin2002 - T387904 [19:34:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74102 and previous config saved to /var/cache/conftool/dbconfig/20250305-193414-root.json [19:41:18] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:46:01] (03PS1) 10Gergő Tisza: Clean up SUL3 config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124865 (https://phabricator.wikimedia.org/T384007) [19:46:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74103 and previous config saved to /var/cache/conftool/dbconfig/20250305-194601-root.json [19:46:02] (03PS1) 10Gergő Tisza: Roll out SUL3 signup to 1% of users on most group 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124866 (https://phabricator.wikimedia.org/T384007) [19:46:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124865 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [19:46:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124866 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [19:49:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74104 and previous config saved to /var/cache/conftool/dbconfig/20250305-194920-root.json [19:54:18] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdr) failed on ms-be1080 - https://phabricator.wikimedia.org/T387707#10607356 (10VRiley-WMF) 05Open→03Resolved This hard drive has been replaced. [20:01:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74105 and previous config saved to /var/cache/conftool/dbconfig/20250305-200106-root.json [20:04:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74106 and previous config saved to /var/cache/conftool/dbconfig/20250305-200426-root.json [20:04:55] (03PS1) 10Arlolra: Invert Parsoid read view wiktionary configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 [20:07:33] jouncebot: nowandnext [20:07:33] For the next 0 hour(s) and 52 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T1900) [20:07:33] In 0 hour(s) and 52 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T2100) [20:09:12] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4052.ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [20:09:45] dduvall: I see group1 rolled during the primary train window earlier today. any objections if I use the rest of your window to make some mediawiki capacity right-sizing changes? [20:11:24] swfrench-wmf: no objection from me [20:11:32] dduvall: great, thank you! [20:11:53] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4052.ulsfo.wmnet} and A:cp for 9.2.9-1wm1 [20:12:39] (03CR) 10Subramanya Sastry: [C:04-1] Invert Parsoid read view wiktionary configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra) [20:13:58] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Grant Access to wmf; analytics-privatedata-users for HCoplin-WMF - https://phabricator.wikimedia.org/T387459#10607402 (10HCoplin-WMF) Just tested with dashboards I previously didn't have access to, and... [20:15:33] (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124848 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [20:15:41] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): right-size given current traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124848 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [20:16:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74107 and previous config saved to /var/cache/conftool/dbconfig/20250305-201612-root.json [20:16:52] (03CR) 10Subramanya Sastry: [C:04-1] "Otherwise, the list matches what I see on metawiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra) [20:17:19] (03Merged) 10jenkins-bot: mw-(api-ext|web): right-size given current traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124848 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [20:19:32] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [20:20:04] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [20:20:16] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [20:20:26] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [20:21:55] (03CR) 10Arlolra: Invert Parsoid read view wiktionary configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra) [20:22:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra) [20:28:05] (03CR) 10Ryan Kemper: [C:03+1] cloudelastic: migrate cloudelastic1008 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1124863 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [20:34:22] rzl: I'm back now [20:34:30] * swfrench-wmf shakes fist at computer [20:34:33] 👍 [20:38:00] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [20:38:12] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [20:38:26] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [20:38:35] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [20:39:54] !log right-sized capacity distribution between mw-(api-ext|web) main and next releases - T383845 [20:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:57] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [20:45:51] (03CR) 10Jdlrobson: [C:04-1] Enable Search AB test for en wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang) [20:51:29] (03CR) 10Subramanya Sastry: [C:04-1] Invert Parsoid read view wiktionary configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra) [20:55:19] (03CR) 10Bernard Wang: Enable Search AB test for en wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang) [20:56:39] (03PS2) 10Arlolra: Turn on Parsoid Read Views for 44 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505) [20:56:40] (03PS2) 10Arlolra: Invert Parsoid read view wiktionary configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 [20:57:28] (03CR) 10Subramanya Sastry: [C:03+1] Turn on Parsoid Read Views for 44 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505) (owner: 10Arlolra) [20:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:57:50] (03PS3) 10Arlolra: Invert Parsoid read view wiktionary configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 [20:57:59] (03CR) 10Arlolra: Invert Parsoid read view wiktionary configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra) [20:59:00] (03CR) 10Subramanya Sastry: [C:03+1] Invert Parsoid read view wiktionary configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T2100). [21:00:05] bwang, arlolra, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:21] o/ [21:01:54] hi [21:01:55] Im here [21:03:22] o/ [21:04:32] Sorry i just have 1 patch to back port but its in the table 3 times haha [21:05:10] (03CR) 10Bernard Wang: [C:04-1] Enable Search AB test for en wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang) [21:05:51] Sorry its not ready yet [21:06:28] I can deploy [21:06:53] (03PS3) 10Bernard Wang: Enable Search AB test for en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 [21:07:51] arlolra: can I deploy the two config changes together? [21:07:57] Yes pelase [21:08:00] please [21:08:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505) (owner: 10Arlolra) [21:08:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra) [21:09:35] (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for 44 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124551 (https://phabricator.wikimedia.org/T387505) (owner: 10Arlolra) [21:09:37] (03Merged) 10jenkins-bot: Invert Parsoid read view wiktionary configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124867 (owner: 10Arlolra) [21:10:06] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124551|Turn on Parsoid Read Views for 44 wiktionaries (T387505)]], [[gerrit:1124867|Invert Parsoid read view wiktionary configs]] [21:10:09] T387505: Parsoid Read Views to Wiktionary deploy ~2025-03-03 - https://phabricator.wikimedia.org/T387505 [21:13:26] !log tgr@deploy2002 tgr, arlolra: Backport for [[gerrit:1124551|Turn on Parsoid Read Views for 44 wiktionaries (T387505)]], [[gerrit:1124867|Invert Parsoid read view wiktionary configs]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:13:57] (03PS1) 10Scott French: mw-web: additional right-sizing tweaks for next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124872 (https://phabricator.wikimedia.org/T383845) [21:14:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10607511 (10phaultfinder) [21:14:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3649 MB (3% inode=98%): /tmp 3649 MB (3% inode=98%): /var/tmp 3649 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [21:15:42] tgr_: looks good to continue [21:16:28] !log tgr@deploy2002 tgr, arlolra: Continuing with sync [21:17:00] (03CR) 10Scott French: "Thanks in advance for the review, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124872 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [21:17:26] (03CR) 10Gergő Tisza: [C:04-1] "We forgot to deploy this, oops." [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [21:20:34] (03CR) 10Ebernhardson: [C:03+2] flink-app chart: Support per-chart logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546 (owner: 10Ebernhardson) [21:21:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/next at eqiad: 15.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:21:21] (03CR) 10RLazarus: [C:03+1] mw-web: additional right-sizing tweaks for next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124872 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [21:21:33] ^ this was me - I have a patch to resize [21:22:04] tgr_: after this backport completes, could you please pause so I can tweak this [21:22:28] (03Merged) 10jenkins-bot: flink-app chart: Support per-chart logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124546 (owner: 10Ebernhardson) [21:22:29] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124551|Turn on Parsoid Read Views for 44 wiktionaries (T387505)]], [[gerrit:1124867|Invert Parsoid read view wiktionary configs]] (duration: 12m 23s) [21:22:32] T387505: Parsoid Read Views to Wiktionary deploy ~2025-03-03 - https://phabricator.wikimedia.org/T387505 [21:22:40] tgr_: please pause here [21:22:51] ack [21:22:55] (03CR) 10Scott French: [C:03+2] mw-web: additional right-sizing tweaks for next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124872 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [21:23:17] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:23:30] tgr_: thanks [21:24:09] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:24:17] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:24:28] (03Merged) 10jenkins-bot: mw-web: additional right-sizing tweaks for next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124872 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [21:25:42] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [21:25:54] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [21:26:03] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [21:26:11] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [21:26:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:26:34] arlolra: what exactly do these patches do in practice? [21:26:50] we're seeing a rather large bump in latency [21:28:29] oh, we sure are https://grafana.wikimedia.org/goto/TAIaXPpNg [21:28:54] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:29:03] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:30:27] I see a small bump in worker saturation at 20:18 associated with swfrench-wmf's deployment, and a bigger one at 21:17 associated with tgr_/arlolra's [21:31:03] the initial spike in latency didn't hang around, but it's stabilizing *much* higher than previously, and the worker saturation isn't going anywhere [21:31:14] if this wasn't expected, please strongly consider a rollback while investigating [21:31:15] FIRING: [3x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 0% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:31:48] tgr_, arlolra: (and if you're here digging, please say so -- if it gets worse and I don't hear from you, I may roll back) [21:32:51] I'm concerned that container network rx on mw-web is way up vs. before [21:33:11] cache text got invalidated I guess https://grafana.wikimedia.org/d/O9zAmeOWz/ats-cache-operations?orgId=1&from=now-3h&to=now&viewPanel=4 [21:33:38] there is a large bump of "cache_text fresh backend" [21:33:43] at a glance it should only affect a bunch of smaller wiktionaries [21:33:47] good catch, hashar! [21:34:04] so not a lot of traffic [21:35:08] for context the difference in php resource consumption is about 10% of the fleet -- we were using a little under 40% of workers and are now using a little under 50% [21:35:28] should I roll back? [21:35:36] even if it wasn't a lot of traffic in CDN terms, if we invalidated the cache for all of it, it's a lot of traffic in app layer terms [21:35:48] I am not familiar with the feature and I guess arlolra left already [21:36:01] roll back so [21:36:07] I am happy to assist :) [21:36:26] then we can see whether the latency is restored [21:36:29] but 10% of CPU usage for cache invalidation of what's probably less than 1% of our content would be surprising [21:36:32] ok [21:36:33] There's been a significant jump in job insertion https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%20prometheus%2Fk8s&from=now-3h&to=now [21:37:05] that would explain why jobrunners are running hot [21:37:31] eh, scap backport --revert can't handle stacked commits [21:37:32] tgr_: yes, let's roll back please -- this isn't immediately an emergency, so feel free to do so in a leisurely fashion [21:37:34] just a sec [21:37:47] if arlolra wants to collect any data first that's fine, maybe ping them out of IRC [21:38:04] (03PS1) 10Gergő Tisza: Revert "Invert Parsoid read view wiktionary configs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124874 [21:38:23] (03PS1) 10Gergő Tisza: Revert "Turn on Parsoid Read Views for 44 wiktionaries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124875 [21:38:31] (03CR) 10CI reject: [V:04-1] Revert "Turn on Parsoid Read Views for 44 wiktionaries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124875 (owner: 10Gergő Tisza) [21:38:42] (03PS2) 10Gergő Tisza: Revert "Turn on Parsoid Read Views for 44 wiktionaries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124875 [21:38:52] I have poked content-transformers in their thread on Slack [21:38:55] swfrench-wmf: they make parsoid the default wikitext parser [21:39:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124874 (owner: 10Gergő Tisza) [21:39:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124875 (owner: 10Gergő Tisza) [21:39:59] arlolra: that's in theory limited to 44 wiktionaries? [21:40:13] yup [21:40:31] (03Merged) 10jenkins-bot: Revert "Invert Parsoid read view wiktionary configs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124874 (owner: 10Gergő Tisza) [21:40:37] (03Merged) 10jenkins-bot: Revert "Turn on Parsoid Read Views for 44 wiktionaries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124875 (owner: 10Gergő Tisza) [21:41:02] swfrench-wmf: it's already enabled for all of wikivoyage and most other wiktionaries [21:41:03] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124874|Revert "Invert Parsoid read view wiktionary configs"]], [[gerrit:1124875|Revert "Turn on Parsoid Read Views for 44 wiktionaries"]] [21:41:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 6.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:41:31] hmm [21:41:39] the alarm on the canary resolved as part of the deploy? [21:42:18] hashar: latency is slowly trending down, so I think that one just slipped below the alert threshold [21:42:48] arlolra: got it, thanks. have other similarly sized enrollments in parsoid read views caused similar latency impact previously? [21:42:59] (03CR) 10Jdlrobson: [C:03+1] Enable Search AB test for en wiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang) [21:43:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:43:19] we've been deploying to a set of ~40 wikis the past few weeks [21:43:25] that might be an alerting bug, 6.25% idle is still over the-- yeah okay [21:44:01] * subbu catches up with backlog [21:44:02] !log tgr@deploy2002 tgr: Backport for [[gerrit:1124874|Revert "Invert Parsoid read view wiktionary configs"]], [[gerrit:1124875|Revert "Turn on Parsoid Read Views for 44 wiktionaries"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:44:06] and that alarm has a bit.ly link 🤭 [21:44:08] "bug" or some kind of transient state associated with the deploy, but either way they're very much consistently saturated [21:44:12] !log tgr@deploy2002 tgr: Continuing with sync [21:44:21] hashar: if you click on it, it will explain why :) [21:45:16] per arlo, we have been deploying these for the last 3 weeks ... and haven't had any alerts thus far, and these are much smaller wikis than the previous wikis we rolled out to. [21:45:24] rzl: I imagine if we have a cluster fuck issue, we surely want a link and a doc hosted outside of our domains/cluster etc :-] [21:45:53] swfrench-wmf: this is the first time we've seen any alert [21:45:58] subbu: yeah, for clarity -- not opposed to the content of the change, I just want to make sure we understand why this had such a big effect on the app layer [21:46:35] understood .. i am just thinking aloud here ... those recent wikis all have a few thousand pages at most. [21:46:44] rzl: +1, understanding why it's happening is important. [21:46:57] (03CR) 10Bernard Wang: Enable Search AB test for en wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang) [21:46:59] arlolra, wonder if the invert patch had something else we missed. [21:47:12] ^ this is what I'm wondering [21:47:14] Hi, so this patch is ready https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1124510/2 [21:47:21] there were 2 config pages we deployed .. one was the rollout to 44 wiktionaries. [21:47:28] If there’s still time after in this window [21:47:34] the second one was to invert the config to simplify the config. [21:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:47:50] we could roll them out individually [21:48:05] ya. [21:48:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-jobrunner/canary at codfw: 3.125% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-jobrunner&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:49:21] alright, so if we see a second wave of cache invalidation, we may experience a second bump [21:49:28] yeah, was thinking the same [21:49:30] I wonder if maybe wikimedia-config does not handle three-level settings (default => false, wiktionary => true, => false) correctly? [21:49:50] we did that for wikivoyages though? [21:50:09] swfrench-wmf: the bad news is it'll be the same size as the first bump, so we're already committed -- but the good news is we know we have the resources to handle that [21:50:25] agreed, yeah [21:50:34] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124874|Revert "Invert Parsoid read view wiktionary configs"]], [[gerrit:1124875|Revert "Turn on Parsoid Read Views for 44 wiktionaries"]] (duration: 09m 30s) [21:50:53] is there actual cache invalidation involved? if the cache is merely split on parser type, the old parser entries should still be there in the parser cache [21:51:02] yes. [21:51:15] yes to the cache is split by parser type. [21:51:17] ah, that's good to know [21:51:19] oh, great [21:52:18] * subbu doesn't see anything obviously broken with the invert. [21:52:31] so, once all the other patches are deployed, can we re-try the first config patch? [21:53:22] assuming we have time and there is nothing else pending after this window. if not, we can try this again tomorrow. [21:53:25] we're still doing it with wikivoyage and zhwikivoyage flase [21:53:31] ya [21:53:33] I'd like to get swfrench-wmf's resizing followup out too, if that's still outstanding, but otherwise no objection from me, as long as we're paying attention [21:53:49] not sure if this matter but there is a graph showing "Parser cache save reason" had a raise for "view" | https://grafana.wikimedia.org/d/a97c66ff-0e10-4d2a-b9e1-37b96b7b4d35/parser-cache-misses?orgId=1&from=now-3h&to=now&viewPanel=32 [21:53:58] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.eqiad.wmnet with OS bullseye [21:53:59] like I say this didn't actually break anything, it just swung the graph unexpectedly [21:54:19] rzl: it's applied - I did that somewhat urgently when I though I was the source of the alert :) [21:54:23] jouncebot: next [21:54:23] In 0 hour(s) and 5 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T2200) [21:54:26] okay cool, thanks [21:54:34] only outstanding in the other sense, then :D [21:54:37] so there's that, not sure if they plan to use it [21:54:38] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1008.eqiad.wmnet with OS bullseye [21:54:57] James_F: ^ [21:55:01] (03CR) 10Bking: [C:03+2] cloudelastic: migrate cloudelastic1008 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1124863 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [21:55:47] tgr_, so the revert is now live for the last 5 mins, right? [21:56:05] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye [21:56:14] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10607629 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1181.eqiad.wmnet with OS bullseye [21:56:20] yeah, should be live [21:57:06] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.eqiad.wmnet with OS bullseye [21:59:23] then if lot of jobs are still in the queue, that would take a while to process them [21:59:36] so, that parsercache panel that hashar shared is not showing any change in the # of saves because of view even after the revert .. it is still double what it was was at the start of the hour. [21:59:38] ah, okay. [21:59:39] I could not find a graph showing the size of the pending queues though, only rates [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T2200) [22:00:08] that is all fringe theory, cause really I have long forgot/lost contact with jobs/jobqueue/parser etc [22:01:04] what job is this? htmlCacheUpdate? [22:01:29] parserCachePreWarm apparently [22:01:52] hnowlan shared https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%20prometheus%2Fk8s&from=now-3h&to=now [22:02:01] (03PS1) 10Dzahn: aptrepo: replace http with https in downloads.linux.hpe.com URLs [puppet] - 10https://gerrit.wikimedia.org/r/1124877 (https://phabricator.wikimedia.org/T388042) [22:02:39] parsoidCachePrewarm went from 70 jobs / s to up to 226 jobs / s [22:03:42] if I zoom it out we had a similar behavior this morning around 7:20 [22:04:06] ya .. and it hasn't stopped now after the revert. [22:04:43] (03CR) 10Dzahn: "curl http://downloads.linux.hpe.com/SDR/repo/mcp/" [puppet] - 10https://gerrit.wikimedia.org/r/1124877 (https://phabricator.wikimedia.org/T388042) (owner: 10Dzahn) [22:05:44] (03PS2) 10Dzahn: aptrepo: replace http with https in downloads.linux.hpe.com URLs [puppet] - 10https://gerrit.wikimedia.org/r/1124877 (https://phabricator.wikimedia.org/T388042) [22:06:49] Can we retry just the first config patch now? [22:07:02] looks like wikifunctions doesn't have anything to deploy now? [22:07:13] (03PS1) 10Ebernhardson: flink-app chart: Repair custom log configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124878 [22:07:37] (I see the same bump at 7:08 this morning for the cache_text fresh backend which I have pasted earlier https://grafana.wikimedia.org/d/O9zAmeOWz/ats-cache-operations?orgId=1&viewPanel=4 ) [22:07:40] so that looks similar [22:07:52] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10607677 (10Jclark-ctr) 05Open→03Resolved Received additional drives and replaced [22:08:27] these jobs are triggered on page view, right? [22:09:48] I added the first 10 of the 44 wiktionaries to the pageview tool and it says 24k views a day (so <100/min for all 44 unless there is a huge outlier) [22:10:14] (03CR) 10Ebernhardson: [C:03+2] flink-app chart: Repair custom log configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124878 (owner: 10Ebernhardson) [22:10:16] but job stats went up by like 200/sec [22:10:35] (03PS1) 10Daimona Eaytoy: Revert "Let sysops add/remove the event-organizer group by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124879 (https://phabricator.wikimedia.org/T386738) [22:10:44] (03CR) 10CI reject: [V:04-1] Revert "Let sysops add/remove the event-organizer group by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124879 (https://phabricator.wikimedia.org/T386738) (owner: 10Daimona Eaytoy) [22:10:46] (03PS1) 10Cwhite: grafana: add quotes around interpolated log variables [puppet] - 10https://gerrit.wikimedia.org/r/1124880 [22:11:30] rzl: hashar: ok to give it another try? [22:11:40] (03PS2) 10Daimona Eaytoy: Revert "Let sysops add/remove the event-organizer group by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124879 (https://phabricator.wikimedia.org/T386738) [22:11:58] fine by me [22:12:05] (03Merged) 10jenkins-bot: flink-app chart: Repair custom log configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124878 (owner: 10Ebernhardson) [22:12:16] +1 I guess [22:12:23] but I will stop here, it is too late for me [22:12:36] unless you need someone to drive scap? [22:12:55] (03PS1) 10Gergő Tisza: Revert^2 "Turn on Parsoid Read Views for 44 wiktionaries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124881 (https://phabricator.wikimedia.org/T387505) [22:13:03] no, I can do it [22:14:12] thanks tgr_ [22:14:39] great thanks [22:15:43] I ll check tomorrow morning when I run the train :) [22:15:57] (03PS1) 10Ebernhardson: flink-app: Update chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124882 [22:16:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124881 (https://phabricator.wikimedia.org/T387505) (owner: 10Gergő Tisza) [22:16:17] (03PS3) 10Daimona Eaytoy: Revert "Let sysops add/remove the event-organizer group by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124879 (https://phabricator.wikimedia.org/T386738) [22:17:14] (03Merged) 10jenkins-bot: Revert^2 "Turn on Parsoid Read Views for 44 wiktionaries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124881 (https://phabricator.wikimedia.org/T387505) (owner: 10Gergő Tisza) [22:17:45] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124881|Revert^2 "Turn on Parsoid Read Views for 44 wiktionaries" (T387505)]] [22:17:49] T387505: Parsoid Read Views to Wiktionary deploy ~2025-03-03 - https://phabricator.wikimedia.org/T387505 [22:18:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124879 (https://phabricator.wikimedia.org/T386738) (owner: 10Daimona Eaytoy) [22:18:22] PROBLEM - Disk space on deploy2002 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/3fed3640d35b7e68de691c1a8e75c92260c0dc2c19c4eabc8af14bfa6f7bb315/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [22:18:26] (03CR) 10Ebernhardson: [C:03+2] flink-app: Update chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124882 (owner: 10Ebernhardson) [22:20:16] (03Merged) 10jenkins-bot: flink-app: Update chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124882 (owner: 10Ebernhardson) [22:20:45] !log tgr@deploy2002 tgr: Backport for [[gerrit:1124881|Revert^2 "Turn on Parsoid Read Views for 44 wiktionaries" (T387505)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:22:30] subbu: do you want to inspect something or can it go live? [22:23:26] it can go live .. i don't think canaries will reflect any change in latencies or jobs. [22:23:39] !log tgr@deploy2002 tgr: Continuing with sync [22:24:24] parsoidCachePrewarm seems to be running on all wikis btw [22:24:25] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:24:34] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:24:43] or at least logstash has a steady stream of trigger:parsoidCachePrewarm events on large Wikipedias [22:25:42] they are queued for Parsoid when a legacy parser view generates a fresh parse .. to ensure that Parsoid's HTML is ready for when a page might be opened in VE [22:26:56] https://phabricator.wikimedia.org/T327164 [22:28:09] so maybe there was a scraper with unfortunate timing, and it's not at all related to wiktionaries? [22:29:05] (03Abandoned) 10D3r1ck01: [SUL3] varnish: Split frontend cache on `sul3OptIn` cookie [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [22:29:18] (03CR) 10D3r1ck01: "Ack!" [puppet] - 10https://gerrit.wikimedia.org/r/1092323 (https://phabricator.wikimedia.org/T375788) (owner: 10D3r1ck01) [22:29:21] granted the timing matches very well [22:29:29] ya ... [22:29:51] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124881|Revert^2 "Turn on Parsoid Read Views for 44 wiktionaries" (T387505)]] (duration: 12m 06s) [22:29:55] T387505: Parsoid Read Views to Wiktionary deploy ~2025-03-03 - https://phabricator.wikimedia.org/T387505 [22:30:53] https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&var-dc=codfw%2520prometheus%252Fk8s&from=now-30d&to=now&viewPanel=18 shows a cycle with non-zero peaks around 19:20 [22:31:06] which is also when the wiktionary config changes went live. [22:31:07] scap says "21:17:24 Started sync-prod-k8s [22:31:24] and the spike starts at 17:30 [22:31:24] sorry 21:20 [22:32:19] so, one additional observation: one of the reasons the effect of this was "amplified" is that it seems the PHP 8.1 deployments of mediawiki took the brunt of the load from this [22:32:39] (03PS1) 10Ebernhardson: flink-app: Provide full log4j-console.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124884 [22:32:41] you can see a sizable (~ 20%) bump in RPS on them when the backports went out [22:32:49] which is not visible on the 7.4 deployments [22:33:00] (though both experience elevated latency) [22:33:27] so, looks like this retry went through fine so far? [22:34:29] the "only" difference between external traffic directed to the 8.1-based deployments vs. 7.4 is the kinds of clients: these are all "real people using browsers" (e.g., accept cookies and run javascript) [22:35:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [22:36:13] 8.1 is real people? [22:36:28] yeah, no spike this time [22:37:12] subbu: as a simplification, yeah - in the sense that only clients presenting an enrollment cookie (which is granted by js that runs in-browser) are routed there [22:37:32] got it. [22:38:03] anyway, we can rule out wiktionaries config having been the source of the spike. The only thing left to rule out is the invert patch. should we try that tomorrow or now? [22:38:40] what's weird is, there was a spike in GET requests: https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m&from=now-3h&to=now&viewPanel=62 [22:38:51] how can parsoid cause that? [22:39:01] jobs are POST requests, right? [22:39:12] (03PS2) 10Ebernhardson: flink-app: Provide full log4j-console.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124884 [22:39:28] tgr_: that's exactly the increase I was talking about, yeah [22:39:37] or maybe the method selector is just not working for that metric [22:40:10] ah, that's possible too [22:40:48] tgr_: importantly, as you point out, that's traffic to mw-web, not mw-jobrunner [22:40:56] so, the source should not be jobs [22:42:39] (03PS3) 10Ebernhardson: flink-app: Provide full log4j-console.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124884 [22:42:53] hm, right [22:43:04] there was a spike for both jobs and web [22:43:21] at around 500/sec [22:43:50] which seems ridiculously high for a set of wiktionaries that don't include the big European ones [22:44:00] maybe the job is making an API request? [22:44:31] the job spike is smaller .. but, expected because if the web spike causes a bunch of cache misses on wikis. [22:45:02] if it's a large wiki, I'd assume a major template got reparsed and then that triggered a bunch of recursive parses, but these wiktionaries are fairly small, right? [22:46:22] (03PS4) 10Ebernhardson: flink-app: Provide full log4j-console.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124884 [22:46:30] anyway, should we try the other patch? [22:46:34] yes, many of the bigger wikis went out in earlier deploys (enwikt is still not on parsoid). [22:46:41] works for me. arlolra ? [22:46:44] which is theoretically a noop [22:46:48] yes. [22:49:14] sure [22:49:38] (03CR) 10Ebernhardson: [C:03+2] flink-app: Provide full log4j-console.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124884 (owner: 10Ebernhardson) [22:50:10] (03PS1) 10Gergő Tisza: Revert^2 "Invert Parsoid read view wiktionary configs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124885 [22:50:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124885 (owner: 10Gergő Tisza) [22:50:47] all of this is good training for us on CTT :-) [22:51:20] (03Merged) 10jenkins-bot: Revert^2 "Invert Parsoid read view wiktionary configs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124885 (owner: 10Gergő Tisza) [22:51:24] (03Merged) 10jenkins-bot: flink-app: Provide full log4j-console.properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124884 (owner: 10Ebernhardson) [22:51:47] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124885|Revert^2 "Invert Parsoid read view wiktionary configs"]] [22:53:34] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:53:49] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:54:45] !log tgr@deploy2002 tgr: Backport for [[gerrit:1124885|Revert^2 "Invert Parsoid read view wiktionary configs"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:57:35] !log tgr@deploy2002 tgr: Continuing with sync [22:58:17] (03PS1) 10Ebernhardson: cirrus: Drop cloudelastic custom logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124886 [22:59:44] subbu: is it expected that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1124885 is not a noop, a least according to the config diff check? [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250305T2300) [23:00:08] (03CR) 10Ebernhardson: [C:03+2] cirrus: Drop cloudelastic custom logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124886 (owner: 10Ebernhardson) [23:00:12] for example, it seems to flip frwiktionary from wgParserMigrationEnableParsoidDiscussionTools: true to false [23:00:22] https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig/3807/console [23:01:15] swfrench-wmf: that was overlooked in a previous revert we had made, it's fine [23:01:38] (03Merged) 10jenkins-bot: cirrus: Drop cloudelastic custom logConfiguration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124886 (owner: 10Ebernhardson) [23:01:47] but maybe it explains some things [23:02:00] arlolra: ah, got - so the "invert" patch is not itself a noop [23:02:56] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:03:04] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:03:16] it was intended to be a noop but we missed that that was changing [23:03:23] the change is fine though [23:04:01] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124885|Revert^2 "Invert Parsoid read view wiktionary configs"]] (duration: 12m 13s) [23:04:02] arlolra, but looks like there are a bunch of other wiktionaries that flipped to true ... so, there is more going on there. [23:04:11] so pageviews of cold frwiktionary talk pages? [23:04:16] in any case, no spike this time [23:04:36] to clarify, frwiktionary is just one example [23:05:38] swfrench-wmf: is it ok to continue with the other (non-parsoid) backports or does someone intend to investigate more? [23:06:38] (03CR) 10Jdlrobson: [C:03+1] Enable Search AB test for en wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang) [23:06:49] looks like they are all wiktionaries with < 100 pages .. [23:07:13] tgr_: no objections on my end - things continue to stabilize [23:07:23] rzl: any concerns? [23:07:32] I think this is good even if there are other non-noop changes. those other changes did surprise me though. [23:08:16] okay by me [23:08:41] bwang: still around? [23:08:43] the jobworkers are still hot but trending in the right direction [23:08:46] swfrench-wmf: sorry, surprising that we were workinf off incomplete data [23:09:03] *jobrunner workers [23:10:06] arlolra, https://aa.wiktionary.org/wiki/Special:AllPages is just Main Page .. so, an empty wiki but still lists 100 pages in Special:Statistics. [23:10:14] (03PS1) 10Ebernhardson: flink-app chart: Use ECS logging configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124887 [23:11:04] Same for a couple others I checked. So, I think there are a number of "empty" wiktionaries which all flipped to Parsoid Read Views on the invert of config. We didn't realize it, but looks fine. [23:12:01] I expect the same happened with Scott inverted the wikivoyage config. [23:12:10] *when [23:13:27] tgr_, swfrench-wmf rzl thanks so much for hanging around and helping us work through this and keeping a close eye for issues. [23:13:27] consider: (1) https://usercontent.irccloud-cdn.com/file/zhjArS9X/image.png [23:13:31] (2) https://en.wikipedia.org/wiki/The_Great_Wave_off_Kanagawa [23:13:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2208:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2208 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:13:52] subbu: of course! thanks for digging into it when it didn't look as expected [23:13:55] (03CR) 10Krinkle: Profiler: emit both statsd and dogstatsd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite) [23:14:02] subbu: these are all closed wikis [23:14:21] arlolra, aha .. that explains it. [23:14:32] yes, thank you tgr_ swfrench-wmf [23:14:35] rzl, that is funny (great wave off kanagawa). [23:15:00] thank you both for sticking around as well, and tgr_ for rolling forward-and-back :) [23:15:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124860 (https://phabricator.wikimedia.org/T375788) (owner: 10Gergő Tisza) [23:15:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124861 (https://phabricator.wikimedia.org/T375788) (owner: 10Gergő Tisza) [23:15:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124865 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [23:15:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [23:15:53] (03Merged) 10jenkins-bot: Clean up SUL3 config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124865 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [23:16:20] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [23:16:34] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.019e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [23:16:35] (03Merged) 10jenkins-bot: Preserve usesul3 flag during autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1124860 (https://phabricator.wikimedia.org/T375788) (owner: 10Gergő Tisza) [23:16:36] (03Merged) 10jenkins-bot: Preserve usesul3 flag during autologin [extensions/CentralAuth] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1124861 (https://phabricator.wikimedia.org/T375788) (owner: 10Gergő Tisza) [23:17:11] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124860|Preserve usesul3 flag during autologin (T375788)]], [[gerrit:1124861|Preserve usesul3 flag during autologin (T375788)]], [[gerrit:1124865|Clean up SUL3 config (T384007)]] [23:17:15] T375788: Implement SUL3 central autologin - https://phabricator.wikimedia.org/T375788 [23:17:16] T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007 [23:17:25] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1008.eqiad.wmnet with OS bullseye [23:19:56] tgr_: bwang's out for the day - we're gonna deploy tomorrow so no worries on that one [23:20:02] thank you though!! [23:20:05] !log tgr@deploy2002 tgr: Backport for [[gerrit:1124860|Preserve usesul3 flag during autologin (T375788)]], [[gerrit:1124861|Preserve usesul3 flag during autologin (T375788)]], [[gerrit:1124865|Clean up SUL3 config (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:23:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2208:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2208 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:29:10] * subbu slowly backs away from the computer [23:29:43] !log tgr@deploy2002 tgr: Continuing with sync [23:36:04] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124860|Preserve usesul3 flag during autologin (T375788)]], [[gerrit:1124861|Preserve usesul3 flag during autologin (T375788)]], [[gerrit:1124865|Clean up SUL3 config (T384007)]] (duration: 18m 53s) [23:36:08] T375788: Implement SUL3 central autologin - https://phabricator.wikimedia.org/T375788 [23:36:09] T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007 [23:38:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124866 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [23:39:14] (03Merged) 10jenkins-bot: Roll out SUL3 signup to 1% of users on most group 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124866 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [23:39:41] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye [23:39:41] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124866|Roll out SUL3 signup to 1% of users on most group 1 wikis (T384007)]] [23:39:49] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10608104 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1181.eqiad.wmnet with OS bullseye [23:41:23] 07Puppet, 06SRE, 06Web-Team: Certain mobile devices including XiaoMi are not being redirected to our mobile site - https://phabricator.wikimedia.org/T388032#10608114 (10Jdlrobson-WMF) > This hits both the android and mobile tokens in our regex, and is correctly routed to the mobile site. Yes that's correct,... [23:41:30] (03CR) 10Cwhite: Profiler: emit both statsd and dogstatsd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite) [23:41:36] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10608111 (10Jdlrobson-WMF) > As part of my analysis at T214998#10551073, I went through much of the long ta... [23:42:20] (03PS1) 10Daimona Eaytoy: Drop $wmgCampaignEventsProgramsAndEventsDashboardEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124893 (https://phabricator.wikimedia.org/T387025) [23:42:41] !log tgr@deploy2002 tgr: Backport for [[gerrit:1124866|Roll out SUL3 signup to 1% of users on most group 1 wikis (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:42:44] T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007 [23:43:39] (03PS2) 10Daimona Eaytoy: Drop $wmgCampaignEventsProgramsAndEventsDashboardEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124893 (https://phabricator.wikimedia.org/T387025) [23:44:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124893 (https://phabricator.wikimedia.org/T387025) (owner: 10Daimona Eaytoy) [23:47:37] (03PS1) 10Fabfur: Fix previous commit [debs/benthos] - 10https://gerrit.wikimedia.org/r/1124894 (https://phabricator.wikimedia.org/T256098) [23:49:34] (03CR) 10Fabfur: [C:03+1] "seems a good idea to me, thanks for taking care of this!" [puppet] - 10https://gerrit.wikimedia.org/r/1124764 (https://phabricator.wikimedia.org/T380295) (owner: 10Muehlenhoff) [23:49:59] (03PS1) 10Bartosz Dziewoński: Remove unused $wgDiscussionToolsABTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124895 [23:50:42] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1100193 (https://phabricator.wikimedia.org/T370837) (owner: 10BCornwall) [23:51:11] (03PS7) 10Cwhite: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) [23:51:13] (03PS1) 10Bartosz Dziewoński: Remove unused $wgOATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124896 [23:51:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:51:59] (03PS8) 10Cwhite: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) [23:54:00] (03CR) 10Cwhite: Profiler: emit both statsd and dogstatsd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite) [23:55:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński) [23:55:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124895 (owner: 10Bartosz Dziewoński) [23:55:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124896 (owner: 10Bartosz Dziewoński)