[00:01:06] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [00:04:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10649820 (10phaultfinder) [00:10:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:51] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:14:04] (03CR) 10Btullis: [C:03+1] Prevent mysql passwords from being logged to stdout [dumps] - 10https://gerrit.wikimedia.org/r/1128918 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [00:15:42] 10ops-codfw, 06SRE, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T388454#10649864 (10Jhancock.wm) 05Open→03Declined [00:16:19] RECOVERY - Disk space on maps1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=maps1009&var-datasource=eqiad+prometheus/ops [00:17:13] !log restart logrotate on cp3080 [00:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:19] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: disk (sdb) failed in moss-be2002 - https://phabricator.wikimedia.org/T389236#10649869 (10Jhancock.wm) a:03Jhancock.wm [00:26:26] 06SRE: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#10649877 (10Novem_Linguae) [00:38:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128993 [00:38:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128993 (owner: 10TrainBranchBot) [00:45:25] RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:51:09] (03PS1) 10BryanDavis: LabsServices: use appservers service name for parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128996 (https://phabricator.wikimedia.org/T389252) [00:52:31] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1128993 (owner: 10TrainBranchBot) [00:53:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128996 (https://phabricator.wikimedia.org/T389252) (owner: 10BryanDavis) [01:08:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1129001 [01:08:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1129001 (owner: 10TrainBranchBot) [01:29:04] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1129001 (owner: 10TrainBranchBot) [02:11:51] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:13:04] (03CR) 10Andrew Bogott: [C:03+2] Update codfw1dev deploy to openstack version 'dalmation' [puppet] - 10https://gerrit.wikimedia.org/r/1128945 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [02:13:07] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps vms: remove definition for Buster and older distros [puppet] - 10https://gerrit.wikimedia.org/r/1128948 (owner: 10Andrew Bogott) [02:24:51] (03CR) 10Ahmon Dancy: "Heads up Alexandros: To be able to log into SpiderPig for testing after deployment, idp.wikimedia.org will need to be configured to allow" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [02:26:40] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: disk (sdb) failed in moss-be2002 - https://phabricator.wikimedia.org/T389236#10650126 (10Papaul) @MatthewVernon this server is out of warranty we will check to see if we have any spare disk onsite that we can use for replacement [02:34:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:34:53] (03PS9) 10Andrew Bogott: cloud-vps vms: remove definition for Buster and older distros [puppet] - 10https://gerrit.wikimedia.org/r/1128948 [02:34:53] (03PS1) 10Andrew Bogott: Revert "Update codfw1dev deploy to openstack version 'dalmation'" [puppet] - 10https://gerrit.wikimedia.org/r/1129004 [02:34:53] (03PS1) 10Andrew Bogott: Revert "glance-api.conf: remove commented mention of metadata_encryption_key" [puppet] - 10https://gerrit.wikimedia.org/r/1129005 [02:34:54] (03PS1) 10Andrew Bogott: Revert "nova.conf for dalmation: update pybasedir" [puppet] - 10https://gerrit.wikimedia.org/r/1129006 [02:34:55] (03PS1) 10Andrew Bogott: Revert "Openstack: add new files for openstack version 'dalmation'" [puppet] - 10https://gerrit.wikimedia.org/r/1129007 [02:34:57] (03PS1) 10Andrew Bogott: Openstack: add new files for openstack version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1129008 (https://phabricator.wikimedia.org/T381499) [02:35:02] (03PS1) 10Andrew Bogott: nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1129009 (https://phabricator.wikimedia.org/T381499) [02:35:06] (03PS1) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1129010 (https://phabricator.wikimedia.org/T381499) [02:35:55] (03PS2) 10Andrew Bogott: Revert "Update codfw1dev deploy to openstack version 'dalmation'" [puppet] - 10https://gerrit.wikimedia.org/r/1129004 [02:35:55] (03PS2) 10Andrew Bogott: Revert "glance-api.conf: remove commented mention of metadata_encryption_key" [puppet] - 10https://gerrit.wikimedia.org/r/1129005 [02:35:55] (03PS2) 10Andrew Bogott: Revert "nova.conf for dalmation: update pybasedir" [puppet] - 10https://gerrit.wikimedia.org/r/1129006 [02:35:55] (03PS2) 10Andrew Bogott: Revert "Openstack: add new files for openstack version 'dalmation'" [puppet] - 10https://gerrit.wikimedia.org/r/1129007 [02:35:56] (03PS2) 10Andrew Bogott: Openstack: add new files for openstack version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1129008 (https://phabricator.wikimedia.org/T381499) [02:35:58] (03PS2) 10Andrew Bogott: nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1129009 (https://phabricator.wikimedia.org/T381499) [02:36:02] (03PS2) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1129010 (https://phabricator.wikimedia.org/T381499) [02:36:06] (03CR) 10CI reject: [V:04-1] Openstack: add new files for openstack version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1129008 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [02:36:10] (03CR) 10CI reject: [V:04-1] nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1129009 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [02:36:18] (03CR) 10CI reject: [V:04-1] glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1129010 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [02:38:08] (03CR) 10CI reject: [V:04-1] Openstack: add new files for openstack version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1129008 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [02:39:07] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:39:07] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:40:59] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:41:03] (03PS3) 10Andrew Bogott: Openstack: add new files for openstack version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1129008 (https://phabricator.wikimedia.org/T381499) [02:41:03] (03PS3) 10Andrew Bogott: nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1129009 (https://phabricator.wikimedia.org/T381499) [02:41:03] (03PS3) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1129010 (https://phabricator.wikimedia.org/T381499) [02:41:26] (03CR) 10Andrew Bogott: [C:03+2] Revert "Update codfw1dev deploy to openstack version 'dalmation'" [puppet] - 10https://gerrit.wikimedia.org/r/1129004 (owner: 10Andrew Bogott) [02:41:29] (03CR) 10Andrew Bogott: [C:03+2] Revert "glance-api.conf: remove commented mention of metadata_encryption_key" [puppet] - 10https://gerrit.wikimedia.org/r/1129005 (owner: 10Andrew Bogott) [02:41:32] (03CR) 10Andrew Bogott: [C:03+2] Revert "nova.conf for dalmation: update pybasedir" [puppet] - 10https://gerrit.wikimedia.org/r/1129006 (owner: 10Andrew Bogott) [02:41:36] (03CR) 10Andrew Bogott: [C:03+2] Revert "Openstack: add new files for openstack version 'dalmation'" [puppet] - 10https://gerrit.wikimedia.org/r/1129007 (owner: 10Andrew Bogott) [02:41:44] (03CR) 10CI reject: [V:04-1] Openstack: add new files for openstack version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1129008 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [02:42:49] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:42:57] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53657 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:42:57] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:45:02] (03PS4) 10Andrew Bogott: Openstack: add new files for openstack version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1129008 (https://phabricator.wikimedia.org/T381499) [02:45:02] (03PS4) 10Andrew Bogott: nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1129009 (https://phabricator.wikimedia.org/T381499) [02:45:03] (03PS4) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1129010 (https://phabricator.wikimedia.org/T381499) [02:45:33] (03CR) 10CI reject: [V:04-1] Openstack: add new files for openstack version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1129008 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [02:47:40] (03PS5) 10Andrew Bogott: Openstack: add new files for openstack version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1129008 (https://phabricator.wikimedia.org/T381499) [02:47:40] (03PS5) 10Andrew Bogott: nova.conf for dalmation: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1129009 (https://phabricator.wikimedia.org/T381499) [02:47:40] (03PS5) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1129010 (https://phabricator.wikimedia.org/T381499) [02:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:51:34] (03PS6) 10Andrew Bogott: nova.conf for dalmatian: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1129009 (https://phabricator.wikimedia.org/T381499) [02:51:34] (03PS6) 10Andrew Bogott: glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1129010 (https://phabricator.wikimedia.org/T381499) [02:51:34] (03PS1) 10Andrew Bogott: Update codfw1dev deploy to openstack version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1129013 (https://phabricator.wikimedia.org/T381499) [02:52:14] (03CR) 10Andrew Bogott: [C:03+2] Openstack: add new files for openstack version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1129008 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [02:52:30] (03CR) 10Andrew Bogott: [C:03+2] nova.conf for dalmatian: update pybasedir [puppet] - 10https://gerrit.wikimedia.org/r/1129009 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [02:52:39] (03CR) 10Andrew Bogott: [C:03+2] glance-api.conf: remove commented mention of metadata_encryption_key [puppet] - 10https://gerrit.wikimedia.org/r/1129010 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [02:52:46] (03CR) 10Andrew Bogott: [C:03+2] Update codfw1dev deploy to openstack version 'dalmatian' [puppet] - 10https://gerrit.wikimedia.org/r/1129013 (https://phabricator.wikimedia.org/T381499) (owner: 10Andrew Bogott) [03:03:23] (03PS1) 10Andrew Bogott: Remove wmcs-live-migrate.py [puppet] - 10https://gerrit.wikimedia.org/r/1129015 [03:04:18] (03CR) 10Andrew Bogott: [C:03+2] Remove wmcs-live-migrate.py [puppet] - 10https://gerrit.wikimedia.org/r/1129015 (owner: 10Andrew Bogott) [03:10:29] PROBLEM - Disk space on an-druid1004 is CRITICAL: DISK CRITICAL - free space: /srv 97603 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1004&var-datasource=eqiad+prometheus/ops [03:19:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10650251 (10phaultfinder) [04:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [04:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10650286 (10phaultfinder) [04:48:45] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:48:55] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:52:45] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:52:55] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:59:29] Minor cxserver deployment.. [05:01:37] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-03-14-045617-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128066 (https://phabricator.wikimedia.org/T382294) (owner: 10KartikMistry) [05:03:03] (03Merged) 10jenkins-bot: Update cxserver to 2025-03-14-045617-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128066 (https://phabricator.wikimedia.org/T382294) (owner: 10KartikMistry) [05:04:25] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:04:49] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:06:19] PROBLEM - Disk space on maps1009 is CRITICAL: DISK CRITICAL - free space: / 2556 MB (3% inode=96%): /tmp 2556 MB (3% inode=96%): /var/tmp 2556 MB (3% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=maps1009&var-datasource=eqiad+prometheus/ops [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:13:06] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:13:36] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:14:57] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:15:31] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:16:51] !log Updated cxserver to 2025-03-14-045617-production (T382294) [05:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:55] T382294: Use openapi compliant examples in swagger spec - https://phabricator.wikimedia.org/T382294 [05:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10650373 (10phaultfinder) [05:34:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:45:03] PROBLEM - Host cloudnet2005-dev is DOWN: PING CRITICAL - Packet loss = 100% [05:45:45] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:45:55] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:46:25] RECOVERY - Host cloudnet2005-dev is UP: PING OK - Packet loss = 0%, RTA = 30.29 ms [05:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:49:21] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1248 crash - https://phabricator.wikimedia.org/T388837#10650376 (10Marostegui) No, you can proceed whenever it's more convenient for you Thanks [05:49:45] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:49:54] (03CR) 10Fabfur: [C:03+1] pontoon: update and clarify instructions [puppet] - 10https://gerrit.wikimedia.org/r/1128857 (owner: 10Filippo Giunchedi) [05:49:55] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:59:07] PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100% [05:59:47] RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 30.33 ms [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T0600) [06:09:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10650379 (10phaultfinder) [06:37:09] !log Upgrading cp4042 to Varnish 7 (T378737) [06:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:13] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [06:52:48] !log Upgrading cp4045 to Varnish 7 (T378737) [06:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:52] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [07:03:57] (03PS4) 10Filippo Giunchedi: pontoon: update and clarify instructions [puppet] - 10https://gerrit.wikimedia.org/r/1128857 [07:11:23] (03CR) 10Arnaudb: [C:03+2] nftables: add a newline at the end of GERRIT_ABUSERS lines [puppet] - 10https://gerrit.wikimedia.org/r/1128338 (https://phabricator.wikimedia.org/T388783) (owner: 10Arnaudb) [07:12:33] (03PS2) 10Slyngshede: data.yaml: Offboarding jebe [puppet] - 10https://gerrit.wikimedia.org/r/1128783 [07:14:26] 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10650455 (10fgiunchedi) >>! In T374711#10649030, @jhathaway wrote: > @fgiunchedi should we consider this issue resolved, since the arming step for keyhold... [07:14:32] (03CR) 10Ayounsi: [C:03+2] Sandbox vlan, allow return http(s) monitoring traffic [homer/public] - 10https://gerrit.wikimedia.org/r/1128401 (https://phabricator.wikimedia.org/T388419) (owner: 10Ayounsi) [07:15:05] (03Merged) 10jenkins-bot: Sandbox vlan, allow return http(s) monitoring traffic [homer/public] - 10https://gerrit.wikimedia.org/r/1128401 (https://phabricator.wikimedia.org/T388419) (owner: 10Ayounsi) [07:15:47] PROBLEM - ganeti-noded running on ganeti2040 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [07:16:47] RECOVERY - ganeti-noded running on ganeti2040 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [07:19:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10650462 (10phaultfinder) [07:20:13] (03PS1) 10Filippo Giunchedi: logstash: move filter_truncate before indexing/output [puppet] - 10https://gerrit.wikimedia.org/r/1129128 (https://phabricator.wikimedia.org/T389072) [07:21:59] 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10650469 (10fgiunchedi) [07:26:41] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding jebe [puppet] - 10https://gerrit.wikimedia.org/r/1128783 (owner: 10Slyngshede) [07:34:08] (03CR) 10Hashar: "Hi Jesse and Valentin, my guess is you respectively know about Hiera and Varnish `abuse_networks`. I need the blocks defined in differen" [puppet] - 10https://gerrit.wikimedia.org/r/1128859 (https://phabricator.wikimedia.org/T389181) (owner: 10Hashar) [07:39:56] (03CR) 10Slyngshede: [C:03+2] P:firewall remove connection tracking monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1128405 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:40:28] (03PS1) 10Filippo Giunchedi: hieradata: move prometheus k8s instances off prometheus2006 [puppet] - 10https://gerrit.wikimedia.org/r/1129173 (https://phabricator.wikimedia.org/T383232) [07:41:08] (03CR) 10Filippo Giunchedi: "To be merged early next week" [puppet] - 10https://gerrit.wikimedia.org/r/1129173 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [07:43:31] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1128892 (https://phabricator.wikimedia.org/T389210) (owner: 10Vgutierrez) [07:51:21] !log rebalance ganeti eqiad/B following reimages T382507 [07:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:25] T382507: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507 [07:53:59] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jennifer Ebe out of all services on: 949 hosts [07:54:54] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jennifer Ebe out of all services on: 1294 hosts [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [08:07:00] I will do the set of unused-config patches I wanted to deploy yesterday [08:07:31] hashar: I have a security patch I'd like to sync [08:08:16] can I do that when you're finished? [08:10:22] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: disable 'accelerator' cadvisor metric [puppet] - 10https://gerrit.wikimedia.org/r/1128319 (https://phabricator.wikimedia.org/T388632) (owner: 10Filippo Giunchedi) [08:10:27] kostajh: I think we can do them all at the same time [08:10:37] all my patches are theorically noops [08:10:42] hashar: would you be able to do the sync? (I can also do it if you want) [08:10:51] they remove some $wg variables that are no more existing in extensions [08:11:04] hashar: https://phabricator.wikimedia.org/T389235 is the task [08:11:04] what is your change ? [08:11:16] and the patch is in https://phabricator.wikimedia.org/T389235#10648209 [08:11:21] ohhh [08:12:05] kostajh: go for it :) [08:12:31] else I gotta read the doc about security patches cause I haven't done it in a while [08:12:43] you and me both :) [08:12:59] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Security_patches [08:12:59] ahah [08:13:07] looks like there's now a script [08:13:22] so, OK, I will start now [08:13:34] we can pair over a video call if you want [08:13:41] sure [08:14:06] sent you a link on slack [08:18:24] (03PS1) 10Ayounsi: sandbox1-b3-magru: add v4 and v6 includes [dns] - 10https://gerrit.wikimedia.org/r/1129180 (https://phabricator.wikimedia.org/T385560) [08:18:52] (03CR) 10CI reject: [V:04-1] sandbox1-b3-magru: add v4 and v6 includes [dns] - 10https://gerrit.wikimedia.org/r/1129180 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [08:18:54] 06SRE, 06Infrastructure-Foundations: /etc/wikimedia/logout.d/50-systemdlogoutd sometimes fails to terminate user session on stat hosts - https://phabricator.wikimedia.org/T389324 (10MoritzMuehlenhoff) 03NEW [08:20:54] !log Upgrading cp4046 to Varnish 7 (T378737) [08:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:58] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [08:21:18] (03CR) 10Ayounsi: [C:04-1] "To be merged only when the first IP is created, as this is required to generate the file to be included." [dns] - 10https://gerrit.wikimedia.org/r/1129180 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [08:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10650566 (10phaultfinder) [08:27:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:06] (03PS1) 10DCausse: cirrus: explicitly route search traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129181 (https://phabricator.wikimedia.org/T388610) [08:30:08] (03PS1) 10DCausse: cirrus: explicitly route search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) [08:30:09] (03PS1) 10DCausse: cirrus: switch search traffic back to multi-DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129183 (https://phabricator.wikimedia.org/T388610) [08:34:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:37:51] !log kharlan Deployed security patch for T389235 [08:38:16] (03PS1) 10Elukey: sre.discovery.datacenter: remove kartotherian from EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1129184 [08:41:04] (03CR) 10MVernon: [C:03+2] swift: remove ms-be2075 from rings [puppet] - 10https://gerrit.wikimedia.org/r/1128907 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [08:41:43] (03PS2) 10DCausse: cirrus-streaming-updater: consume from new v1 & legacy rc0 streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124484 (https://phabricator.wikimedia.org/T375821) [08:41:43] (03PS2) 10DCausse: cirrus-streaming-updater: produce to v1 update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124485 (https://phabricator.wikimedia.org/T375821) [08:41:43] (03PS2) 10DCausse: cirrus-streaming-updater: stop consuming from legacy streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124486 (https://phabricator.wikimedia.org/T375821) [08:42:56] (03CR) 10DCausse: cirrus-streaming-updater: consume from new v1 & legacy rc0 streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124484 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [08:43:52] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 (owner: 10Slyngshede) [08:46:19] RECOVERY - Disk space on maps1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=maps1009&var-datasource=eqiad+prometheus/ops [08:47:25] RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:57] (03CR) 10Elukey: [C:03+2] role::ml_k8s::staging::master: move to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128462 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [08:48:05] (03Abandoned) 10Filippo Giunchedi: Revert^2 "P:firewall absent check_conntrack script." [puppet] - 10https://gerrit.wikimedia.org/r/1126057 (owner: 10Slyngshede) [08:48:53] (03PS2) 10Elukey: role::ml_k8s::worker: move ml-serve2001 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128463 (https://phabricator.wikimedia.org/T387854) [08:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:49:53] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-staging-ctrl2001.codfw.wmnet with OS bookworm [08:51:16] !log kharlan Deployed security patch for T389235 [08:51:29] hashar: done [08:53:36] (03CR) 10Elukey: [C:03+2] service: set kartotherian and kartotherian-ssl to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128344 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey) [08:53:48] (03CR) 10Alexandros Kosiaris: [C:04-1] create a namespace for codesearch on k8s-aux cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [08:54:05] (03CR) 10Slyngshede: [C:03+2] Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 (owner: 10Slyngshede) [08:54:22] (03CR) 10Alexandros Kosiaris: [C:03+1] "+1, feel free to merge this" [puppet] - 10https://gerrit.wikimedia.org/r/1126170 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [08:54:31] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129184 (owner: 10Elukey) [08:54:31] (03PS2) 10Elukey: service: set kartotherian and kartotherian-ssl to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128345 (https://phabricator.wikimedia.org/T389042) [08:55:33] kostajh: congratulations! [08:55:49] 07Puppet, 06Data-Persistence, 10database-backups: Possible weird interaction between es backups and puppet runs leading to failures - https://phabricator.wikimedia.org/T367882#10650693 (10jcrespo) 05Open→03Resolved a:03jcrespo This has not happened since, the rate of backup errors are very low. It... [08:55:50] I'll skip deploying the series of patch to remove unused config flags since the train is starting in 5 minues [08:56:18] I guess I will do them next week :) [08:56:54] (03Merged) 10jenkins-bot: Update unittests to handle BituLDAP update [software/bitu] - 10https://gerrit.wikimedia.org/r/1105018 (owner: 10Slyngshede) [08:57:00] (03CR) 10Brouberol: [C:03+2] Fix settings deserializing by adjusing the indices [dumps] - 10https://gerrit.wikimedia.org/r/1128897 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [08:57:03] (03CR) 10Brouberol: [C:03+2] Prevent mysql passwords from being logged to stdout [dumps] - 10https://gerrit.wikimedia.org/r/1128918 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [08:57:23] (03Merged) 10jenkins-bot: Fix settings deserializing by adjusing the indices [dumps] - 10https://gerrit.wikimedia.org/r/1128897 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [08:57:25] (03Merged) 10jenkins-bot: Prevent mysql passwords from being logged to stdout [dumps] - 10https://gerrit.wikimedia.org/r/1128918 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [08:57:29] (03CR) 10Brouberol: [C:03+1] Remove docker related referrences on dse-k8s worker and master [puppet] - 10https://gerrit.wikimedia.org/r/1119106 (https://phabricator.wikimedia.org/T377875) (owner: 10Stevemunene) [08:58:19] (03Abandoned) 10Brouberol: global_config: add external services for opensearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/1122900 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [08:58:55] (03CR) 10Elukey: [C:03+2] sre.discovery.datacenter: remove kartotherian from EXCLUDED_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1129184 (owner: 10Elukey) [08:58:57] hashar: thanks, and sorry to have taken over the window from you [08:59:31] (03PS1) 10Brouberol: airflow-main: increase the scheduler resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129190 (https://phabricator.wikimedia.org/T386282) [08:59:32] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1126517 (https://phabricator.wikimedia.org/T387826) (owner: 10Fabfur) [08:59:33] (03PS1) 10Brouberol: airflow-test-k8s/main: allow egress to cassandra-analytics-query-service-storage-a-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129191 (https://phabricator.wikimedia.org/T386282) [09:00:05] jnuche and jeena: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T0900). [09:00:10] !log remove kartotherian.discovery.wmnet:{80,443} ports from LVS config (some extra noise may be registered) [09:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:28] good morning, I'll rollout the train in a few minutes [09:02:18] (03PS2) 10Elukey: service, conftool-data: final removal for unused Kartotherian configs [puppet] - 10https://gerrit.wikimedia.org/r/1128346 (https://phabricator.wikimedia.org/T389042) [09:02:23] (03PS2) 10Elukey: maps: remove Kartotherian from bare metal nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128348 (https://phabricator.wikimedia.org/T389042) [09:02:37] (03PS1) 10Ayounsi: asw1-b3-magru: Add sandbox vlan filter [homer/public] - 10https://gerrit.wikimedia.org/r/1129192 (https://phabricator.wikimedia.org/T385560) [09:03:12] (03Abandoned) 10DCausse: airflow: enable show_trigger_form_if_no_params [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114327 (https://phabricator.wikimedia.org/T384805) (owner: 10DCausse) [09:03:15] (03CR) 10Alexandros Kosiaris: [C:03+1] add ingress service aliases for codesearch on k8s-aux [dns] - 10https://gerrit.wikimedia.org/r/1126177 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [09:03:38] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1128346 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey) [09:03:39] (03CR) 10Elukey: maps: remove Kartotherian from bare metal nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1128348 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey) [09:03:45] (03CR) 10Vgutierrez: [C:03+1] cache,haproxy: use parametrized tmpfiles cert dir [puppet] - 10https://gerrit.wikimedia.org/r/1126517 (https://phabricator.wikimedia.org/T387826) (owner: 10Fabfur) [09:03:46] (03CR) 10Ayounsi: [C:04-1] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1129180 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [09:04:18] (03CR) 10CI reject: [V:04-1] sandbox1-b3-magru: add v4 and v6 includes [dns] - 10https://gerrit.wikimedia.org/r/1129180 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [09:04:29] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129193 (https://phabricator.wikimedia.org/T386216) [09:04:30] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129193 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [09:05:05] (03CR) 10Alexandros Kosiaris: [C:03+1] "Adding Keith per the last comment in https://gerrit.wikimedia.org/r/c/operations/dns/+/1126182 since these appear to be related and which " [dns] - 10https://gerrit.wikimedia.org/r/1126180 (https://phabricator.wikimedia.org/T345894) (owner: 10Dzahn) [09:05:30] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129193 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [09:06:42] !log installing pdns-recursor security updates on DoH hosts [09:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1128799 (owner: 10Slyngshede) [09:08:33] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging-ctrl2001.codfw.wmnet with reason: host reimage [09:10:47] (03PS1) 10Elukey: services: update Kartotherian's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129194 [09:11:43] (03CR) 10Jgiannelos: [C:03+1] services: update Kartotherian's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129194 (owner: 10Elukey) [09:12:29] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging-ctrl2001.codfw.wmnet with reason: host reimage [09:13:01] !log dcausse@deploy2002 Started deploy [airflow-dags/search@e55954c]: publish search artifacts [09:13:40] !log dcausse@deploy2002 Finished deploy [airflow-dags/search@e55954c]: publish search artifacts (duration: 00m 38s) [09:15:19] kostajh: no worries, my patches are noop and definitely not urgent :] [09:15:36] I will probably end up deploy them out of a normal window [09:15:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10650741 (10phaultfinder) [09:16:09] 10SRE-tools, 10Spicerack: Allow to discover/test in more isolation spicerack features - https://phabricator.wikimedia.org/T389329 (10Volans) 03NEW p:05Triage→03Medium [09:18:58] (03CR) 10Cathal Mooney: [C:03+1] asw1-b3-magru: Add sandbox vlan filter [homer/public] - 10https://gerrit.wikimedia.org/r/1129192 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [09:19:51] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.21 refs T386216 [09:19:55] T386216: 1.44.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T386216 [09:25:06] I'm seeing a spike of new errors, gonna roll back [09:25:34] (03CR) 10Ayounsi: [C:03+2] asw1-b3-magru: Add sandbox vlan filter [homer/public] - 10https://gerrit.wikimedia.org/r/1129192 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [09:25:44] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129196 (https://phabricator.wikimedia.org/T386216) [09:25:46] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129196 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [09:26:13] (03Merged) 10jenkins-bot: asw1-b3-magru: Add sandbox vlan filter [homer/public] - 10https://gerrit.wikimedia.org/r/1129192 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [09:26:39] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129196 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [09:29:46] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129197 [09:29:46] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129198 [09:30:22] !log Stop MariaDB on db1248 T388837 [09:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:26] T388837: db1248 crash - https://phabricator.wikimedia.org/T388837 [09:30:53] !log Upgrading cp4048 to Varnish 7 (T378737) [09:30:55] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging-ctrl2001.codfw.wmnet with OS bookworm [09:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:56] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [09:30:57] (03PS1) 10Marostegui: db1248: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1129199 (https://phabricator.wikimedia.org/T388837) [09:31:11] (03CR) 10Fabfur: "tnx!" [puppet] - 10https://gerrit.wikimedia.org/r/1126517 (https://phabricator.wikimedia.org/T387826) (owner: 10Fabfur) [09:31:13] (03CR) 10Gmodena: [C:03+1] cirrus-streaming-updater: consume from new v1 & legacy rc0 streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124484 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [09:31:13] (03CR) 10Fabfur: [C:03+2] cache,haproxy: use parametrized tmpfiles cert dir [puppet] - 10https://gerrit.wikimedia.org/r/1126517 (https://phabricator.wikimedia.org/T387826) (owner: 10Fabfur) [09:31:25] (03CR) 10Marostegui: [C:03+2] db1248: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1129199 (https://phabricator.wikimedia.org/T388837) (owner: 10Marostegui) [09:31:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance [09:31:35] fabfur: good to merge your changes? [09:31:43] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: update and clarify instructions [puppet] - 10https://gerrit.wikimedia.org/r/1128857 (owner: 10Filippo Giunchedi) [09:31:51] marostegui yep, I was going but you did first! [09:31:58] fabfur: merging! [09:32:02] tnx [09:32:08] fabfur: prego [09:32:16] :) [09:32:36] marostegui: 🤌🤌🤌 [09:32:55] (i'm in line to merge too) [09:32:58] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-staging-ctrl2002.codfw.wmnet with OS bookworm [09:32:58] godog: ciao bambino! [09:33:04] lol [09:33:18] :? [09:33:21] godog: My merge finished [09:33:30] \o/ thank you [09:33:46] 👍 [09:36:34] (03CR) 10Btullis: [C:03+1] airflow-main: increase the scheduler resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129190 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:36:56] 06SRE, 10DNS, 06Traffic: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333 (10MoritzMuehlenhoff) 03NEW [09:39:35] (03CR) 10Btullis: "Is there not an equivalent aqs1010-b, aqs1011-b etc ?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129191 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [09:40:16] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.21 refs T386216 [09:40:21] T386216: 1.44.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T386216 [09:47:39] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10650888 (10ayounsi) Port moved and still the same issue. I asked them (in French) if the patch got properly changed, and to call me on my mobile to discuss it more in details. [09:48:20] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging-ctrl2002.codfw.wmnet with reason: host reimage [09:48:25] !log add sandbox vlan on asw1-b3-magru - T385560 [09:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:28] T385560: Create RIPE Atlas anchors VMs - https://phabricator.wikimedia.org/T385560 [09:52:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging-ctrl2002.codfw.wmnet with reason: host reimage [09:53:01] (03CR) 10Elukey: [C:03+2] services: update Kartotherian's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129194 (owner: 10Elukey) [09:55:19] PROBLEM - Disk space on an-druid1003 is CRITICAL: DISK CRITICAL - free space: /srv 105579 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1003&var-datasource=eqiad+prometheus/ops [09:55:54] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host elastic1111.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:57:01] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1111.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:00:17] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:02:44] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host elastic1111.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:04:42] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [10:04:57] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:05:03] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:05:03] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:05:22] (03CR) 10Slyngshede: [C:03+2] Always create a new connection to LDAP [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1128799 (owner: 10Slyngshede) [10:05:57] (03PS1) 10Jon Harald Søby: Revert^4 "Add Portal namespace to kaawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129204 [10:06:16] (03CR) 10Ayounsi: [C:04-1] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1129180 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [10:06:44] (03CR) 10CI reject: [V:04-1] sandbox1-b3-magru: add v4 and v6 includes [dns] - 10https://gerrit.wikimedia.org/r/1129180 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [10:06:48] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [10:07:03] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:07:03] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:07:10] (03Merged) 10jenkins-bot: Always create a new connection to LDAP [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1128799 (owner: 10Slyngshede) [10:09:03] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [10:09:17] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:10:37] !log Upgrading cp4049 to Varnish 7 (T378737) [10:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:41] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [10:11:04] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [10:11:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging-ctrl2002.codfw.wmnet with OS bookworm [10:13:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1111.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [10:13:09] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pickup new magru sandbox includes files - ayounsi@cumin1002" [10:13:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: pickup new magru sandbox includes files - ayounsi@cumin1002" [10:13:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:13:44] (03CR) 10Ayounsi: [C:04-1] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1129180 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [10:13:51] (03CR) 10Ayounsi: sandbox1-b3-magru: add v4 and v6 includes [dns] - 10https://gerrit.wikimedia.org/r/1129180 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [10:18:41] !log trunk sandbox vlan to ganeti7001/3 - T385560 [10:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:45] T385560: Create RIPE Atlas anchors VMs - https://phabricator.wikimedia.org/T385560 [10:21:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10651049 (10elukey) @Jclark-ctr thanks a lot for reporting! From the logs I see two issues for the hosts: 1) RuntimeError: JunOS config co... [10:22:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10651062 (10elukey) I've ran the cookbook as `cookbook sre.hosts.provision elastic1111 --no-switch --no-users --uefi` and it worked nicely,... [10:29:15] (03CR) 10Brouberol: "There is, but it seems to be a different cluster, with different IPs. I can add it, though, no problem." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129191 (https://phabricator.wikimedia.org/T386282) (owner: 10Brouberol) [10:30:35] (03CR) 10Volans: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1129180 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [10:31:18] (03PS2) 10Brouberol: airflow-test-k8s/main: allow egress to cassandra-analytics-query-service-storage-a-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129191 (https://phabricator.wikimedia.org/T386282) [10:32:07] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 456279416 and 29 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:33:07] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:34:10] (03CR) 10Ayounsi: [C:03+2] sandbox1-b3-magru: add v4 and v6 includes [dns] - 10https://gerrit.wikimedia.org/r/1129180 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [10:35:54] !log ayounsi@dns1004 START - running authdns-update [10:38:06] !log ayounsi@dns1004 END - running authdns-update [10:44:42] 06SRE, 06serviceops, 10Wikidata, 10Wikidata Integration in Wikimedia projects, 10Wikimedia-Site-requests: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455#10651140 (10thiemowmde) The #wikidata_integration_in_wikimedia_projects team essentially does have two... [10:44:47] (03CR) 10Clément Goubert: [C:03+1] "LGTM, a little concerned about the noise in public channels, but that's a client-side worry. Thanks!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 (owner: 10Volans) [10:48:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet [10:49:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet [10:55:46] !log Upgrading cp4050 to Varnish 7 (T378737) [10:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:51] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [10:56:07] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 654733304 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:59:07] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 62440 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:59:14] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet [10:59:16] hey folks! I'm wanting to run a maintenance script later in the day that's in an extension..... what's the right way to use `mwscript-k8s` to refer to `Flow/FlowMoveBoardsToSubpages`? [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1100) [11:01:22] (03PS2) 10Muehlenhoff: osm_replica: Fix Hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/1128891 (https://phabricator.wikimedia.org/T381565) [11:04:36] (03CR) 10Muehlenhoff: "check" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1117896 (owner: 10Muehlenhoff) [11:06:48] Krinkle: can I pick on you for this as you wrote the wiki page? [11:06:56] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet [11:07:10] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet [11:09:14] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129197 (owner: 10PipelineBot) [11:10:43] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129197 (owner: 10PipelineBot) [11:13:43] (03PS1) 10Gergő Tisza: wikitech: Remove $wgCookieDomain override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129215 (https://phabricator.wikimedia.org/T389318) [11:14:27] PROBLEM - Swift https frontend on ms-fe1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Swift [11:14:45] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet [11:14:55] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.066 second response time https://wikitech.wikimedia.org/wiki/Swift [11:14:57] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:15:03] PROBLEM - Swift https frontend on ms-fe1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [11:15:03] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe1010.eqiad.wmnet, ms-fe1014.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:15:25] Emperor: ^^? [11:15:25] ^ Emperor known? [11:15:44] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [11:15:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:15:53] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Swift [11:15:55] RECOVERY - Swift https frontend on ms-fe1013 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.805 second response time https://wikitech.wikimedia.org/wiki/Swift [11:16:03] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:16:05] checking impact [11:16:18] spike of 5xx [11:16:27] RECOVERY - Swift https frontend on ms-fe1009 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Swift [11:16:42] I will check if going down, otherwise we should open an incident [11:17:34] not going down, I think upload has problems [11:17:38] I couldn't see a thumbnail and was giving "upstream error" (I assume overload) but it's back now [11:18:33] I cannot see thumbs at https://commons.wikimedia.org/wiki/Special:NewFiles [11:18:36] https://grafana.wikimedia.org/goto/V18HlQhNR?orgId=1 [11:18:51] swift is struggling big time [11:19:08] I am going to put a notice on status page for now [11:19:38] 503s are decreasing though [11:19:49] frontend i/o went up from 2ish GB to 6 GB [11:19:57] RESOLVED: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:20:01] waiting before opening a formal incident though [11:20:05] (03CR) 10CI reject: [V:04-1] Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1117896 (owner: 10Muehlenhoff) [11:20:11] but going down again [11:20:30] and checking for user reports [11:20:36] thumbor suffering because of the backend being overloaded so I'm not surprised you're not seeing thumbs jynus [11:20:44] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [11:20:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:20:48] big spike on thumbor queues also [11:21:00] not seing any ticket yer [11:21:03] Went from 1GiB IO write to basically 0 [11:21:21] (03PS1) 10Cathal Mooney: PuppetDB Import: Fix check to find child ints no longer in puppetdb [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1129216 (https://phabricator.wikimedia.org/T388770) [11:21:37] 741 errors/second [11:22:13] <300 already [11:22:47] ok, I was going to propose a rolling restart, but if it went down maybe it went through [11:23:00] yeah, seeing recoveries [11:23:57] is the spike because of traffic patterns? [11:26:32] sorry, was AFK [11:28:09] that is a weird spike [11:28:23] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1129216 (https://phabricator.wikimedia.org/T388770) (owner: 10Cathal Mooney) [11:30:38] 06SRE, 06Data-Engineering, 10DPE-Mediawiki-Content, 10Dumps-Generation, 07Epic: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10651347 (10BTullis) Just to follow up on this, we have confirmed that there is a performance regression when using d... [11:30:59] (03PS1) 10Ladsgroup: Bump thumbnail steps ratio to 25% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129218 (https://phabricator.wikimedia.org/T360589) [11:32:06] looks to have self-resolved though [11:33:55] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:34:13] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:34:19] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:35:12] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [11:35:16] (03PS1) 10Ayounsi: network/data.yaml: add sandbox1-b3-magru [puppet] - 10https://gerrit.wikimedia.org/r/1129219 (https://phabricator.wikimedia.org/T385560) [11:36:35] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:37:03] !log switch ganeti master for magru01 to ganeti7001 [11:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:19] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:39:06] (03PS1) 10Kosta Harlan: JobExecutor: Activate wrapping span [extensions/EventBus] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129221 (https://phabricator.wikimedia.org/T389331) [11:39:40] jnuche: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/1129221 should unblock the train [11:39:59] PROBLEM - ganeti-wconfd running on ganeti7003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:40:12] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host atlas7001.wikimedia.org [11:40:14] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [11:40:39] (03PS1) 10Fabfur: WIP: testing for custom cert path for acmecerts and unified [puppet] - 10https://gerrit.wikimedia.org/r/1129223 [11:41:57] ^ 7003 is expected [11:43:52] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (owner: 10Fabfur) [11:44:22] jouncebot: now [11:44:22] For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1100) [11:44:29] (03CR) 10Fabfur: [C:04-2] "Do not consider for merging now, it's just a WIP for test" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (owner: 10Fabfur) [11:44:54] kostajh: ty for creating the backport change [11:46:52] (03CR) 10Volans: [C:03+2] "thanks for the reviews" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 (owner: 10Volans) [11:46:58] (03CR) 10Volans: [C:03+2] tests: remove unnecessary vulture setting [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125954 (owner: 10Volans) [11:47:23] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:48:05] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas7001.wikimedia.org - ayounsi@cumin1002" [11:48:11] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas7001.wikimedia.org - ayounsi@cumin1002" [11:48:11] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:48:11] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache atlas7001.wikimedia.org on all recursors [11:48:13] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:48:14] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas7001.wikimedia.org on all recursors [11:48:48] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas7001.wikimedia.org - ayounsi@cumin1002" [11:48:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas7001.wikimedia.org - ayounsi@cumin1002" [11:48:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host atlas7001.wikimedia.org [11:49:13] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:50:59] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:51:32] (03Merged) 10jenkins-bot: interactive: notify when waiting for input [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125953 (owner: 10Volans) [11:51:46] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:51:58] (03Merged) 10jenkins-bot: tests: remove unnecessary vulture setting [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1125954 (owner: 10Volans) [11:52:17] (03PS1) 10Btullis: Allow analytice-wmde-users limited journalctl access to their units [puppet] - 10https://gerrit.wikimedia.org/r/1129228 (https://phabricator.wikimedia.org/T387514) [11:52:43] !log installing gtk+3.0 bugfix updates from Bookworm point release [11:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:53] (03CR) 10CI reject: [V:04-1] Allow analytice-wmde-users limited journalctl access to their units [puppet] - 10https://gerrit.wikimedia.org/r/1129228 (https://phabricator.wikimedia.org/T387514) (owner: 10Btullis) [11:52:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7003.magru.wmnet [11:54:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7003.magru.wmnet [11:54:26] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:56:38] (03PS2) 10Btullis: Allow analytice-wmde-users limited journalctl access to their units [puppet] - 10https://gerrit.wikimedia.org/r/1129228 (https://phabricator.wikimedia.org/T387514) [11:57:11] (03PS1) 10Reedy: CommonSettings: Migrate CentralNotice to Virtual Domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129229 (https://phabricator.wikimedia.org/T389348) [11:57:12] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10651504 (10MoritzMuehlenhoff) [11:57:13] (03PS1) 10Reedy: CommonSettings.php: Remove old $wgCentralDBname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129230 (https://phabricator.wikimedia.org/T389348) [11:57:22] (03CR) 10Reedy: [C:04-2] "definitely a not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129230 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy) [12:00:04] mvolz: Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1200). Please do the needful. [12:00:31] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti7003.magru.wmnet [12:00:48] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [12:01:11] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:05:22] jouncebot: nowandnext [12:05:22] For the next 0 hour(s) and 54 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1200) [12:05:22] In 0 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1300) [12:06:10] (03CR) 10Cathal Mooney: [C:03+2] PuppetDB Import: Fix check to find child ints no longer in puppetdb [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1129216 (https://phabricator.wikimedia.org/T388770) (owner: 10Cathal Mooney) [12:07:00] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [12:07:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [12:07:50] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [12:08:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129218 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [12:08:08] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7003.magru.wmnet [12:08:18] (03Merged) 10jenkins-bot: PuppetDB Import: Fix check to find child ints no longer in puppetdb [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1129216 (https://phabricator.wikimedia.org/T388770) (owner: 10Cathal Mooney) [12:08:19] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [12:08:26] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:08:35] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [12:08:48] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [12:08:51] (03Merged) 10jenkins-bot: Bump thumbnail steps ratio to 25% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129218 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [12:08:56] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:09:29] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [12:09:36] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:09:43] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:09:55] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1129218|Bump thumbnail steps ratio to 25% (T360589)]] [12:09:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [12:09:59] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [12:10:03] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:10:28] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:12:59] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127571 (owner: 10PipelineBot) [12:13:03] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128381 (owner: 10PipelineBot) [12:13:09] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129198 (owner: 10PipelineBot) [12:14:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10651595 (10phaultfinder) [12:14:51] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1129218|Bump thumbnail steps ratio to 25% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:15:19] PROBLEM - Hadoop NodeManager on an-worker1141 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:18:31] (03PS2) 10Fabfur: WIP: testing for custom cert path for acmecerts and unified [puppet] - 10https://gerrit.wikimedia.org/r/1129223 [12:19:19] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:21:50] (03CR) 10Fabfur: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (owner: 10Fabfur) [12:23:08] (03CR) 10Alexandros Kosiaris: [C:03+2] profile::scap::spiderpig: New profile for setting up SpiderPig (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [12:23:31] !log installing openjdk 17 security updates on puppet servers (the necessary restarts may cause a few interrupted puppet runs and will be splayed out) [12:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:25:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10651625 (10phaultfinder) [12:26:35] (03PS3) 10Fabfur: WIP: testing for custom cert path for acmecerts and unified [puppet] - 10https://gerrit.wikimedia.org/r/1129223 [12:26:44] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129218|Bump thumbnail steps ratio to 25% (T360589)]] (duration: 16m 49s) [12:26:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:26:48] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [12:28:43] jouncebot: now [12:28:43] For the next 0 hour(s) and 31 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1200) [12:29:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [12:30:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy2002 using scap backport" [extensions/EventBus] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129221 (https://phabricator.wikimedia.org/T389331) (owner: 10Kosta Harlan) [12:30:32] <_joe_> ^ puppetserver1003 is unresponsive [12:31:26] <_joe_> oh, just restarted I guess [12:31:48] <_joe_> moritzm: that's you I guess? [12:32:02] <_joe_> yeah [12:32:04] (03Merged) 10jenkins-bot: JobExecutor: Activate wrapping span [extensions/EventBus] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129221 (https://phabricator.wikimedia.org/T389331) (owner: 10Kosta Harlan) [12:32:38] !log jnuche@deploy2002 Started scap sync-world: Backport for [[gerrit:1129221|JobExecutor: Activate wrapping span (T389331)]] [12:32:42] T389331: Wikimedia\Assert\PreconditionException: Precondition failed: Cannot end a span that has not been started - https://phabricator.wikimedia.org/T389331 [12:35:54] yeah, the upgrade process after upgrading OpenJDK requires an immediate restart of puppetserver, I'm splaying these out a bit [12:36:10] doesn't help that it takes 100 seconds to restart either :-) [12:36:11] 06SRE, 06Data-Engineering, 10DPE-Mediawiki-Content, 10Dumps-Generation, 07Epic: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10651667 (10Marostegui) >>! In T368098#10651347, @BTullis wrote: > Just to follow up on this, we have confirmed that... [12:37:27] FIRING: [2x] SystemdUnitFailed: spiderpig-apiserver.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:38:19] RECOVERY - Hadoop NodeManager on an-worker1141 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:38:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [12:38:44] Deployment thumbor-main in thumbor at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor&var-deployment=thumbor-main - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplica [12:39:24] !log jnuche@deploy2002 kharlan, jnuche: Backport for [[gerrit:1129221|JobExecutor: Activate wrapping span (T389331)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:39:28] T389331: Wikimedia\Assert\PreconditionException: Precondition failed: Cannot end a span that has not been started - https://phabricator.wikimedia.org/T389331 [12:39:55] jnuche: if you need to do something in this window, I am done. [12:40:25] !log jnuche@deploy2002 kharlan, jnuche: Continuing with sync [12:40:55] (03CR) 10Reedy: [C:03+1] Switch the footer link to wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126548 (https://phabricator.wikimedia.org/T387573) (owner: 10Ladsgroup) [12:41:03] 10ops-eqiad, 06DC-Ops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10651677 (10ayounsi) Moving it over to DC-Ops for decom. [12:41:17] mvolz: ack, thx! sry, I saw the citoid window but thought it was safe to go ahead with the backport [12:42:00] oh no worries, i just saw you checked and wanted to make sure you knew it was fine :P [12:43:26] ack, ty :) [12:44:51] !log trunk the sandbox vlan to ganeti500X - T385560 [12:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:54] T385560: Create RIPE Atlas anchors VMs - https://phabricator.wikimedia.org/T385560 [12:45:34] (03PS1) 10Alexandros Kosiaris: Spiderpig: Declare the spiderpig group [puppet] - 10https://gerrit.wikimedia.org/r/1129239 (https://phabricator.wikimedia.org/T383945) [12:47:43] (03CR) 10CI reject: [V:04-1] Spiderpig: Declare the spiderpig group [puppet] - 10https://gerrit.wikimedia.org/r/1129239 (https://phabricator.wikimedia.org/T383945) (owner: 10Alexandros Kosiaris) [12:47:45] !log jnuche@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129221|JobExecutor: Activate wrapping span (T389331)]] (duration: 15m 06s) [12:47:49] T389331: Wikimedia\Assert\PreconditionException: Precondition failed: Cannot end a span that has not been started - https://phabricator.wikimedia.org/T389331 [12:48:48] (03CR) 10Lucas Werkmeister (WMDE): "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1129228 (https://phabricator.wikimedia.org/T387514) (owner: 10Btullis) [12:50:22] train fix deployed, I'll rollout to group1 in the next couple mins [12:53:04] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129241 (https://phabricator.wikimedia.org/T386216) [12:53:06] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129241 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [12:53:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [12:56:27] not sure about these alerts ^ they're recovering before I can see them. possibly scale-up related, thumbor itself seems to be performing fine [12:56:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:59:45] (03Abandoned) 10Jaime Nuche: group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129241 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1300) [13:00:05] Jhs: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] present [13:00:13] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129244 (https://phabricator.wikimedia.org/T386216) [13:00:14] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129244 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [13:00:16] o/ [13:00:29] jouncebot: nowandnext [13:00:29] For the next 0 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1300) [13:00:29] In 0 hour(s) and 59 minute(s): codfw-> eqiad datacentre switchover: Mediawiki edition (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1400) [13:00:29] In 0 hour(s) and 59 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1400) [13:00:33] I can deploy (but I assume we wait for jnuche to finish rolling out group1 first) [13:00:42] be mindful that we have the switchover in 1h [13:00:46] ack [13:00:54] Lucas_WMDE: yeah, please hold on, will let you know [13:00:57] o/ [13:00:58] ok, thanks! [13:01:13] sorry for the spillover [13:01:13] I'll add a patch shortly [13:03:06] (03CR) 10Jaime Nuche: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129244 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [13:04:26] (03CR) 10Jaime Nuche: [V:03+2] group1 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129244 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [13:05:31] (03CR) 10Jaime Nuche: [V:03+2] "Zuul was failing to submit the gate jobs. Merged manually" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129244 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [13:05:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10651778 (10phaultfinder) [13:05:49] (03CR) 10Bartosz Dziewoński: [C:03+1] wikitech: Remove $wgCookieDomain override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129215 (https://phabricator.wikimedia.org/T389318) (owner: 10Gergő Tisza) [13:09:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM (sans the CI nitpicking)" [puppet] - 10https://gerrit.wikimedia.org/r/1129239 (https://phabricator.wikimedia.org/T383945) (owner: 10Alexandros Kosiaris) [13:10:50] jnuche: question, does rolling the train forward do a full image build or not? [13:12:00] !log installing sqlite3 security updates [13:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:24] claime: in most cases it doesn't, scap should detect when a full image build is needed (e.g. we are approaching the maximum number of container layers) and perform it then [13:12:54] 06SRE, 10DNS, 06Traffic: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#10651808 (10ssingh) Thanks for filing this task! This is a good idea and we can do it under the work planned for the pdns-recursor 5.x upgrade, mentioned in T381608. [13:12:56] 06SRE, 10DNS, 06Traffic: Migrate PDNS recursor config to use /etc/powerdns/recursor.d ? - https://phabricator.wikimedia.org/T389333#10651810 (10ssingh) [13:12:57] 06SRE, 06Traffic: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) - https://phabricator.wikimedia.org/T381608#10651811 (10ssingh) [13:13:09] jnuche: ok but like, a change repo that is cloned in the Dockerfile would probably not get picked up? [13:13:17] a change to a repo* [13:14:23] hmm, it should, if the change generates a layer in the image that delta will be picked up. It's just the whole thing wont' be rebuilt from scratch [13:15:02] hmmm [13:16:57] (03PS2) 10Tchanders: Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) [13:16:58] pinging dancy, he's our resident expert on that part of the code [13:18:09] (03CR) 10Tchanders: [C:04-2] "This should not be merged until abusefilter-helper and abusefilter-maintainer have been renamed to global-abusefilter-helper and global-ab" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders) [13:18:19] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.21 refs T386216 [13:18:23] T386216: 1.44.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T386216 [13:18:40] 06SRE, 10Thumbor: Thumbnail failures on some SVGs - https://phabricator.wikimedia.org/T389060#10651841 (10ssingh) p:05Triage→03Medium [13:19:22] 06SRE, 13Patch-For-Review: pywikipedia.org is using wildcard cert for production projects - https://phabricator.wikimedia.org/T388809#10651844 (10ssingh) a:03BCornwall [13:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10651845 (10phaultfinder) [13:20:10] train is on group1, give me a couple of minutes to verify things look healthy [13:21:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129215 (https://phabricator.wikimedia.org/T389318) (owner: 10Gergő Tisza) [13:23:42] 06SRE, 06Traffic: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) - https://phabricator.wikimedia.org/T381608#10651857 (10MoritzMuehlenhoff) Just a note: Debian trixie will be released in June or July and as for past releases we'll most certainly have the base layer... [13:23:55] things look stable, done with the train for now [13:24:00] Lucas_WMDE: over to you, thanks for waiting [13:24:05] great, thanks! [13:24:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129204 (owner: 10Jon Harald Søby) [13:24:47] let’s hope we have time for both that and tgr_’s wikitech fix [13:25:31] (03PS1) 10Gergő Tisza: Allowlist Special:WikimediaDebug on the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129255 [13:25:42] (03PS1) 10Gergő Tisza: Allowlist Special:WikimediaDebug on the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129256 [13:25:57] (03Merged) 10jenkins-bot: Revert^4 "Add Portal namespace to kaawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129204 (owner: 10Jon Harald Søby) [13:26:22] (03PS1) 10Gergő Tisza: Fix SUL3 login cohort logic [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129257 (https://phabricator.wikimedia.org/T384215) [13:26:23] claime: maybe the missing context is that the main purpose of a full rebuild is to flatten image layers. But a normal rebuild will also include all changes to the repos [13:26:25] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1129204|Revert^4 "Add Portal namespace to kaawiki"]] [13:26:33] (03PS1) 10Gergő Tisza: Fix SUL3 login cohort logic [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129258 (https://phabricator.wikimedia.org/T384215) [13:27:13] 06SRE, 06Traffic: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) - https://phabricator.wikimedia.org/T381608#10651869 (10ssingh) >>! In T381608#10651857, @MoritzMuehlenhoff wrote: > Just a note: Debian trixie will be released in June or July and as for past releases... [13:27:25] jnuche: looks like something's not working properly then, but I'm passing message, brouberol or btullis will come see you about it [13:27:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129255 (owner: 10Gergő Tisza) [13:27:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129256 (owner: 10Gergő Tisza) [13:27:49] oh sorry,. I just asked in #-releng [13:28:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129257 (https://phabricator.wikimedia.org/T384215) (owner: 10Gergő Tisza) [13:28:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129258 (https://phabricator.wikimedia.org/T384215) (owner: 10Gergő Tisza) [13:28:52] Lucas_WMDE: so I have a big bundle of patches but they can all go in a single scap [13:29:02] we should still have time for that, right? [13:29:16] if so, I'll start merging the extension backports [13:29:28] tgr_: I think so, yeah [13:29:43] though the sync-testservers-k8s feels slightly slow right now… [13:29:45] let’s hope for the best [13:30:08] but yeah you can go ahead with +2ing the backports imho [13:30:09] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:30:09] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:30:19] (03CR) 10Gergő Tisza: [C:03+2] Allowlist Special:WikimediaDebug on the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129255 (owner: 10Gergő Tisza) [13:30:25] (03CR) 10Gergő Tisza: [C:03+2] Allowlist Special:WikimediaDebug on the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129256 (owner: 10Gergő Tisza) [13:30:30] (03CR) 10Gergő Tisza: [C:03+2] Fix SUL3 login cohort logic [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129257 (https://phabricator.wikimedia.org/T384215) (owner: 10Gergő Tisza) [13:30:34] (03CR) 10Gergő Tisza: [C:03+2] Fix SUL3 login cohort logic [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129258 (https://phabricator.wikimedia.org/T384215) (owner: 10Gergő Tisza) [13:30:59] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53656 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:30:59] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:31:22] Jhs: I think you can already start testing the change on WikimediaDebug [13:31:38] jouncebot: next [13:31:38] In 0 hour(s) and 28 minute(s): codfw-> eqiad datacentre switchover: Mediawiki edition (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1400) [13:31:38] In 0 hour(s) and 28 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1400) [13:32:02] \o/ [13:32:05] Lucas_WMDE, not seeing any changes yet on mwdebug1001 [13:32:19] reminder to be wary of the switchover and please be clear by 1400 [13:32:36] Jhs: they should be live on k8s-mwdebug [13:32:43] Lucas_WMDE, yeah, now i see them [13:32:48] (and by now also on mwdebug1001 I think) [13:32:49] ok [13:33:01] !log lucaswerkmeister-wmde@deploy2002 jhsoby, lucaswerkmeister-wmde: Backport for [[gerrit:1129204|Revert^4 "Add Portal namespace to kaawiki"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:33:04] !log lucaswerkmeister-wmde@deploy2002 jhsoby, lucaswerkmeister-wmde: Continuing with sync [13:35:50] (03PS1) 10Muehlenhoff: Switch ganeti5004 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1129261 [13:35:58] sync-prod-k8s at 33% atm [13:36:29] (03PS1) 10Muehlenhoff: Switch ganeti5005 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1129262 [13:37:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: disk (sdb) failed in moss-be2002 - https://phabricator.wikimedia.org/T389236#10651912 (10Jhancock.wm) @MatthewVernon found a replacement. let us know if that cleared it up for you. [13:38:23] (03Merged) 10jenkins-bot: Allowlist Special:WikimediaDebug on the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129255 (owner: 10Gergő Tisza) [13:38:32] (03PS1) 10Muehlenhoff: Switch ganeti5006 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1129263 [13:38:45] (03Merged) 10jenkins-bot: Allowlist Special:WikimediaDebug on the shared domain [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129256 (owner: 10Gergő Tisza) [13:38:46] (03Merged) 10jenkins-bot: Fix SUL3 login cohort logic [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129257 (https://phabricator.wikimedia.org/T384215) (owner: 10Gergő Tisza) [13:38:59] (03PS1) 10Bking: Cirrus: Prepare production hosts for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1129264 (https://phabricator.wikimedia.org/T388610) [13:39:25] (03CR) 10CI reject: [V:04-1] Cirrus: Prepare production hosts for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1129264 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:39:35] 91% – moment of truth… [13:39:41] will it make it past 94%, place your bets ;) [13:40:07] (03CR) 10Fabfur: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (owner: 10Fabfur) [13:40:27] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129204|Revert^4 "Add Portal namespace to kaawiki"]] (duration: 14m 01s) [13:40:44] \o/ [13:40:51] {◕ ◡ ◕} [13:40:58] now running the maint scripts [13:41:19] !log lucaswerkmeister-wmde@deploy2002 /srv/mediawiki-staging (master $ u=) $ mwscript-k8s --comment=T388158 --follow -- namespaceDupes kaawiki --fix | tee ~/T388158 [13:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:22] T388158: Create Portal namespace on kaa.wikipedia - https://phabricator.wikimedia.org/T388158 [13:41:25] (03Merged) 10jenkins-bot: Fix SUL3 login cohort logic [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129258 (https://phabricator.wikimedia.org/T384215) (owner: 10Gergő Tisza) [13:41:43] (love to see mwscript-k8s being used btw) [13:41:48] :) [13:42:03] hm, the dry run worked fine, the real run is now “waiting for the container to start...”… [13:42:19] also it's now got a --php_version switch that allows you to select 7.4 or 8.1 [13:42:23] so that's 14 mins for the scap? narrow but one more will fit [13:42:30] no [13:42:34] seriously [13:42:40] claime: nice [13:42:48] tgr_: yup, 14m01s according to scap [13:42:54] next time I'm cancelling the backports window before the switchover [13:42:56] do not run another sync-world please [13:43:18] then I guess those patches need to be reverted :/ [13:43:20] okay, I'll revert the patches then [13:43:22] (several merged already afaict) [13:43:25] ok maint script finished [13:43:44] but we really need to discuss how backports are handled, this is being disfunctional [13:43:51] This happens twice a year [13:44:34] we should schedule the backport windows differently then [13:44:36] and is fairly mission critical. We'll be cancelling backport windows in future to avoid similar happening [13:44:48] I was thinking earlier, it was certainly less stressful when the datacenter switches were no-deploy weeks [13:44:56] Lucas_WMDE: yes. [13:44:59] canceling those backport windows sounds reasonable to me [13:45:04] Lucas_WMDE, everything looks good for kaawiki now [13:45:04] We should just make them no-deploy days [13:45:07] instead of weeks [13:45:10] !log UTC afternoon backport+config window done [13:45:11] Jhs: \o/ [13:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:48] FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:46:14] (but I’m also glad we finally got that kaawiki patch through) [13:46:29] and I hope the wikitech fixes can happen later today rather than next week [13:46:34] (03PS1) 10Gergő Tisza: Revert "Allowlist Special:WikimediaDebug on the shared domain" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129265 [13:46:35] (03PS1) 10Gergő Tisza: Revert "Allowlist Special:WikimediaDebug on the shared domain" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129266 [13:46:37] (03PS1) 10Gergő Tisza: Revert "Fix SUL3 login cohort logic" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129267 [13:46:39] (03PS1) 10Gergő Tisza: Revert "Fix SUL3 login cohort logic" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129268 [13:47:27] 06SRE: Unable to deploy config changes due to timeout - https://phabricator.wikimedia.org/T389203#10651944 (10Lucas_Werkmeister_WMDE) 05Open→03Resolved Yeah I think this got resolved one way or another. Let’s close. [13:47:31] (03CR) 10Herron: "Thanks Alex! Planning to pair up with traffic to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128937 this afternoon and b" [dns] - 10https://gerrit.wikimedia.org/r/1126180 (https://phabricator.wikimedia.org/T345894) (owner: 10Dzahn) [13:48:10] (03PS2) 10Bking: Cirrus: Prepare production hosts for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1129264 (https://phabricator.wikimedia.org/T388610) [13:48:28] (03CR) 10Majavah: "This seems to have broken Puppet on the deployment servers due to an useradd error:" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [13:48:32] (03CR) 10CI reject: [V:04-1] Cirrus: Prepare production hosts for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1129264 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:48:44] !log hnowlan@deploy2002 Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover - T385155 [13:48:48] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [13:50:41] 10ops-codfw, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: disk (sdb) failed in moss-be2002 - https://phabricator.wikimedia.org/T389236#10651951 (10MatthewVernon) 05Open→03Resolved All looks good, thanks :) ` root@moss-be2001:/# ceph -s cluster: id: 59ea825c-2a67-11ef-9c1c-bc97e1bbace4... [13:50:55] (03PS1) 10Hnowlan: Revert "profile::scap::spiderpig: New profile for setting up SpiderPig" [puppet] - 10https://gerrit.wikimedia.org/r/1129269 [13:51:09] (03CR) 10Gergő Tisza: [C:03+2] Revert "Allowlist Special:WikimediaDebug on the shared domain" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129265 (owner: 10Gergő Tisza) [13:51:15] (03CR) 10Gergő Tisza: [C:03+2] Revert "Allowlist Special:WikimediaDebug on the shared domain" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129266 (owner: 10Gergő Tisza) [13:51:19] (03CR) 10Gergő Tisza: [C:03+2] Revert "Fix SUL3 login cohort logic" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129267 (owner: 10Gergő Tisza) [13:51:25] (03CR) 10Gergő Tisza: [C:03+2] Revert "Fix SUL3 login cohort logic" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129268 (owner: 10Gergő Tisza) [13:51:58] (03CR) 10Clément Goubert: [C:03+1] Revert "profile::scap::spiderpig: New profile for setting up SpiderPig" [puppet] - 10https://gerrit.wikimedia.org/r/1129269 (owner: 10Hnowlan) [13:52:09] (03PS2) 10Hnowlan: Revert "profile::scap::spiderpig: New profile for setting up SpiderPig" [puppet] - 10https://gerrit.wikimedia.org/r/1129269 [13:53:03] (03PS3) 10Bking: Cirrus: Prepare production hosts for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1129264 (https://phabricator.wikimedia.org/T388610) [13:53:25] (03CR) 10CI reject: [V:04-1] Cirrus: Prepare production hosts for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1129264 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:53:36] actually, I might need to run one more maintenance script for kaawiki [13:53:42] (cleanupTitles) [13:53:54] Lucas_WMDE: can it wait? [13:53:54] (03CR) 10CI reject: [V:04-1] Revert "profile::scap::spiderpig: New profile for setting up SpiderPig" [puppet] - 10https://gerrit.wikimedia.org/r/1129269 (owner: 10Hnowlan) [13:53:55] claime, hnowlan: is that okay? [13:54:07] depends on your point of view, I guess [13:54:17] the status quo without the script is that a handful of pages became unreadable [13:54:24] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] Revert "profile::scap::spiderpig: New profile for setting up SpiderPig" [puppet] - 10https://gerrit.wikimedia.org/r/1129269 (owner: 10Hnowlan) [13:54:25] *inaccessible [13:54:30] ok how long should it take? [13:54:33] https://kaa.wikipedia.org/wiki/Talk:Portal:Feminizm?uselang=en is one, [13:54:41] the dry run finished in a minute or two [13:54:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10651978 (10phaultfinder) [13:54:47] I expect the non-dry run should run just as fast [13:54:49] (03PS4) 10Bking: Cirrus: Prepare production hosts for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1129264 (https://phabricator.wikimedia.org/T388610) [13:54:50] hnowlan: ^ heads up [13:54:57] One maintenance script needed to finish up [13:55:18] (03PS1) 10Phuedx: ext-EventStreamConfig: Reduce product_metrics.web_base data collection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129270 [13:55:19] as long as you can run it immediately, go on [13:55:21] ok [13:55:24] !log lucaswerkmeister-wmde@deploy2002 /srv/mediawiki-staging (master $ u=) $ mwscript-k8s --follow --comment=T388158 -- cleanupTitles kaawiki [13:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:28] T388158: Create Portal namespace on kaa.wikipedia - https://phabricator.wikimedia.org/T388158 [13:55:31] done [13:55:40] let us know when it completes [13:55:52] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] "Bypassed CI manually and removed jenkins vote on purpose to allow the revert to proceed. It was complaining about line length anyway." [puppet] - 10https://gerrit.wikimedia.org/r/1129269 (owner: 10Hnowlan) [13:55:59] it already finished [13:56:09] the non-dry run went even faster – i guess the db was warmed up? ^^ [13:56:17] phew [13:56:19] thanks [13:56:35] (03CR) 10Alexandros Kosiaris: [C:03+2] "Had to revert to avoid messing with the DC switchover as it turns out it needed a followup" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [13:56:58] we will be starting imminently [13:57:18] oh ffs, now https://kaa.wikipedia.org/wiki/Portal_talq%C4%B1law%C4%B1:Feminizm is throwing a different exception [13:57:23] I think kaawiki will just have to live with that [13:57:26] until the switch is over [13:57:29] definitely too late now [13:57:33] for at least an hour or so yeah [13:57:52] I’ll go look for it in logstash in the meantime [13:57:59] Switchover coordination will be in -sre [13:58:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [13:58:46] (03PS5) 10Bking: Cirrus: Prepare production hosts for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1129264 (https://phabricator.wikimedia.org/T388610) [13:58:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [13:59:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [13:59:09] brouberol: I'm assuming dumps won't risk much but heads-up as regards switchover [13:59:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:00:05] hnowlan and jasmine_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) codfw-> eqiad datacentre switchover: MediaWiki edition deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1400). [14:00:21] (03PS2) 10Hnowlan: geo-maps: update map default to list eqiad first [dns] - 10https://gerrit.wikimedia.org/r/1127069 (https://phabricator.wikimedia.org/T385155) [14:00:25] (03PS3) 10Hnowlan: wmnet: update CNAME record for maintenance host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127068 (https://phabricator.wikimedia.org/T385155) [14:00:28] (03Merged) 10jenkins-bot: Revert "Allowlist Special:WikimediaDebug on the shared domain" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129265 (owner: 10Gergő Tisza) [14:00:29] (03Merged) 10jenkins-bot: Revert "Allowlist Special:WikimediaDebug on the shared domain" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129266 (owner: 10Gergő Tisza) [14:00:29] (03PS3) 10Hnowlan: wmnet: update CNAME records for DB masters to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127067 (https://phabricator.wikimedia.org/T385155) [14:00:30] (03Merged) 10jenkins-bot: Revert "Fix SUL3 login cohort logic" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129267 (owner: 10Gergő Tisza) [14:00:31] (03Merged) 10jenkins-bot: Revert "Fix SUL3 login cohort logic" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129268 (owner: 10Gergő Tisza) [14:01:11] FYI the “incorrectly specified talk page” in logspam-watch is due to the kaawiki stuff, no need to worry about that during the switchover [14:02:29] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1129228 (https://phabricator.wikimedia.org/T387514) (owner: 10Btullis) [14:03:24] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-disable-puppet for datacenter switchover from codfw to eqiad [14:03:27] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) for datacenter switchover from codfw to eqiad [14:03:32] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks for datacenter switchover from codfw to eqiad [14:03:52] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-downtime-db-readonly-checks (exit_code=0) for datacenter switchover from codfw to eqiad [14:04:06] (03CR) 10DCausse: [C:03+1] Cirrus: Prepare production hosts for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1129264 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:04:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10652017 (10phaultfinder) [14:06:05] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl for datacenter switchover from codfw to eqiad [14:07:02] (03CR) 10Elukey: [C:03+1] osm_replica: Fix Hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/1128891 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:07:26] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [14:08:29] o/ [14:09:44] (03PS3) 10Btullis: Allow analytice-wmde-users limited journalctl access to their units [puppet] - 10https://gerrit.wikimedia.org/r/1129228 (https://phabricator.wikimedia.org/T387514) [14:11:56] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) for datacenter switchover from codfw to eqiad [14:12:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:12:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:12:53] (03CR) 10Ebernhardson: [C:03+1] "a bit more commit message might be nice, but the bug number is good enough" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129181 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [14:12:54] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from codfw to eqiad [14:13:06] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=99) for datacenter switchover from codfw to eqiad [14:13:16] 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10652054 (10jhathaway) >>! In T374711#10650455, @fgiunchedi wrote: > There's two parts to keyholder, `-proxy` and `-auth`. You are correct the latter requ... [14:14:02] brouberol: it's probably best if you pause deployments during the DC Switchover. Yours are probably technically not gonna have anything to do with it, but in case things turn sour, we all probably don't want to even remotely ponder whether it's related. [14:14:03] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from codfw to eqiad [14:14:18] akosiaris: understood! [14:14:31] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) for datacenter switchover from codfw to eqiad [14:14:32] sorry about that [14:14:48] ack [14:14:59] (03CR) 10Herron: [C:03+1] logstash: move filter_truncate before indexing/output [puppet] - 10https://gerrit.wikimedia.org/r/1129128 (https://phabricator.wikimedia.org/T389072) (owner: 10Filippo Giunchedi) [14:15:18] (03CR) 10Lucas Werkmeister (WMDE): Allow analytice-wmde-users limited journalctl access to their units (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1129228 (https://phabricator.wikimedia.org/T387514) (owner: 10Btullis) [14:15:31] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.02-set-readonly for datacenter switchover from codfw to eqiad [14:15:31] !log hnowlan@cumin2002 MediaWiki read-only period starts at: 2025-03-19 14:15:30.955779 [14:15:50] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) for datacenter switchover from codfw to eqiad [14:15:52] hnowlan@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [14:15:55] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly for datacenter switchover from codfw to eqiad [14:15:56] hnowlan@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [14:16:17] expected ^ [14:16:25] (03PS4) 10Btullis: Allow analytice-wmde-users limited journalctl access to their units [puppet] - 10https://gerrit.wikimedia.org/r/1129228 (https://phabricator.wikimedia.org/T387514) [14:16:30] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) for datacenter switchover from codfw to eqiad [14:16:32] hnowlan@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [14:16:36] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki for datacenter switchover from codfw to eqiad [14:16:37] hnowlan@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [14:17:27] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) for datacenter switchover from codfw to eqiad [14:17:28] hnowlan@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [14:17:31] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite for datacenter switchover from codfw to eqiad [14:17:32] hnowlan@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [14:17:35] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) for datacenter switchover from codfw to eqiad [14:17:36] hnowlan@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [14:17:43] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite for datacenter switchover from codfw to eqiad [14:17:44] hnowlan@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [14:17:55] !log hnowlan@cumin2002 MediaWiki read-only period ends at: 2025-03-19 14:17:55.451583 [14:17:57] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) for datacenter switchover from codfw to eqiad [14:18:07] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner for datacenter switchover from codfw to eqiad [14:18:08] !log root@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: sync [14:18:36] !log root@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: sync [14:18:39] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-restart-mw-jobrunner (exit_code=0) for datacenter switchover from codfw to eqiad [14:19:08] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.08-start-maintenance for datacenter switchover from codfw to eqiad [14:19:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10652115 (10phaultfinder) [14:19:45] (03CR) 10Bearloga: "I think the two other contextual attributes that we should collect in the web base stream are:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129270 (owner: 10Phuedx) [14:21:32] !log root@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [14:21:36] !log root@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [14:21:38] !log root@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:21:43] !log root@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:21:48] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=0) for datacenter switchover from codfw to eqiad [14:22:23] (03CR) 10Bearloga: "I'm also on the fence about performer_session_id and performer_active_browsing_session_token" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129270 (owner: 10Phuedx) [14:22:57] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.09-restore-ttl for datacenter switchover from codfw to eqiad [14:23:38] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-restore-ttl (exit_code=0) for datacenter switchover from codfw to eqiad [14:24:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:25:16] (03CR) 10Hnowlan: [C:03+2] wmnet: update CNAME records for DB masters to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127067 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:25:46] (03PS1) 10Brouberol: mediawiki-dumps-legacy: add missing values files in the helmfile values list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129279 (https://phabricator.wikimedia.org/T388378) [14:25:47] (03PS1) 10Brouberol: mediawiki-dumps-legacy: add missing network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129280 (https://phabricator.wikimedia.org/T388378) [14:25:48] FIRING: [2x] PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:25:49] (03PS1) 10Brouberol: mediwiki-dumps-legacy: add missing configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129281 (https://phabricator.wikimedia.org/T388378) [14:25:53] (03PS5) 10Btullis: Allow analytice-wmde-users limited journalctl access to their units [puppet] - 10https://gerrit.wikimedia.org/r/1129228 (https://phabricator.wikimedia.org/T387514) [14:25:58] !log hnowlan@dns1004 START - running authdns-update [14:25:59] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2089.codfw.wmnet with OS bullseye [14:26:17] (03CR) 10Btullis: Allow analytice-wmde-users limited journalctl access to their units (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1129228 (https://phabricator.wikimedia.org/T387514) (owner: 10Btullis) [14:26:34] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10652163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2089.codfw.wmn... [14:28:12] !log hnowlan@dns1004 END - running authdns-update [14:28:55] (03CR) 10Btullis: [C:03+1] mediwiki-dumps-legacy: add missing configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129281 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:29:02] (03PS4) 10Muehlenhoff: Rebuild against latest package versions in bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1117896 [14:29:58] !log hnowlan@cumin2002 START - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters for datacenter switchover from codfw to eqiad [14:30:21] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "thanks \o/" [puppet] - 10https://gerrit.wikimedia.org/r/1129228 (https://phabricator.wikimedia.org/T387514) (owner: 10Btullis) [14:32:32] (03CR) 10Btullis: "I don't see etcd in this list, which is the only one we know that we need, at the moment." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129280 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:33:00] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 24, active_shards: 44, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 1, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_ [14:33:00] in_queue_millis: 0, active_shards_percent_as_number: 93.61702127659575 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:33:18] RECOVERY - OpenSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, discovered_master: True, active_primary_shards: 24, active_shards: 44, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 1, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig [14:33:18] : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 93.61702127659575 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:35:11] (03Abandoned) 10Brouberol: mediawiki-dumps-legacy: add missing network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129280 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:35:40] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:35:48] RESOLVED: [2x] PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:35:51] (03PS1) 10Bking: opensearch: fix logic for creating sudachi symlink [puppet] - 10https://gerrit.wikimedia.org/r/1129284 (https://phabricator.wikimedia.org/T386868) [14:36:08] (03PS1) 10Gkyziridis: inference-services: edit-check GPU version deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) [14:36:14] (03CR) 10Btullis: "I'm not sure that this is better than just adding the single value we need." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129279 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:37:44] (03CR) 10Brouberol: "I'm not sure, but where are the `global.yaml` values coming from in the first place? Should we trim these values as well?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129279 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:37:55] (03CR) 10DCausse: [C:03+1] opensearch: fix logic for creating sudachi symlink [puppet] - 10https://gerrit.wikimedia.org/r/1129284 (https://phabricator.wikimedia.org/T386868) (owner: 10Bking) [14:39:53] (03CR) 10Bking: [C:03+2] opensearch: fix logic for creating sudachi symlink [puppet] - 10https://gerrit.wikimedia.org/r/1129284 (https://phabricator.wikimedia.org/T386868) (owner: 10Bking) [14:40:37] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.switchdc.mediawiki.09-run-puppet-on-db-masters (exit_code=0) for datacenter switchover from codfw to eqiad [14:41:24] !log hnowlan@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover - T385155 (duration: 52m 40s) [14:41:28] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [14:41:52] PROBLEM - Host lsw1-d3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:41:58] (03PS4) 10Hnowlan: wmnet: update CNAME record for maintenance host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127068 (https://phabricator.wikimedia.org/T385155) [14:42:10] PROBLEM - Host ps1-d3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:42:23] did we just lose d3-codfw? [14:44:01] (03CR) 10Hnowlan: [C:03+2] wmnet: update CNAME record for maintenance host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1127068 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:44:20] !log hnowlan@dns1004 START - running authdns-update [14:44:26] (03PS2) 10Brouberol: mediwiki-dumps-legacy: add missing configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129281 (https://phabricator.wikimedia.org/T388378) [14:44:26] (03PS1) 10Brouberol: mediawiki-legacy-dumps: only enable egress to etcd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129287 (https://phabricator.wikimedia.org/T388378) [14:44:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10652244 (10phaultfinder) [14:45:09] (03CR) 10Brouberol: "I've submitted https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1129287/1?usp=related-change to cleanup the networkpolicies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129279 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:45:25] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1129288 (https://phabricator.wikimedia.org/T389367) [14:46:23] claime: I think lsw1-d3-codfw flapped a bit yesterday also [14:46:47] I cannot connect to lsw1-d3-codfw.mgmt.codfw.wmnet [14:46:51] (03PS1) 10Alexandros Kosiaris: Revert^2 "profile::scap::spiderpig: New profile for setting up SpiderPig" [puppet] - 10https://gerrit.wikimedia.org/r/1129289 [14:47:02] looks hard down [14:47:14] we lost the ps1 too afaics [14:47:27] lemme ping dcops [14:47:30] !log hnowlan@dns1004 END - running authdns-update [14:47:47] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10652276 (10RobH) I saw your reply and was about to ping in IRC to thank you for discussing in French with them directly. My fear is there is a language barrier and perhaps... [14:48:19] claime: Papaul is working on it, all good [14:48:31] cool <3 [14:48:39] (03CR) 10Btullis: [C:03+2] Allow analytice-wmde-users limited journalctl access to their units [puppet] - 10https://gerrit.wikimedia.org/r/1129228 (https://phabricator.wikimedia.org/T387514) (owner: 10Btullis) [14:48:46] RECOVERY - Host ps1-d3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.11 ms [14:49:06] RECOVERY - Host lsw1-d3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.87 ms [14:49:16] (03PS3) 10Hnowlan: geo-maps: update map default to list eqiad first [dns] - 10https://gerrit.wikimedia.org/r/1127069 (https://phabricator.wikimedia.org/T385155) [14:49:40] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:50:03] try to call me [14:51:46] !log hnowlan@dns1004 START - running authdns-update [14:51:55] (03CR) 10Hnowlan: [C:03+2] geo-maps: update map default to list eqiad first [dns] - 10https://gerrit.wikimedia.org/r/1127069 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:52:08] (03CR) 10Btullis: mediawiki-legacy-dumps: only enable egress to etcd (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129287 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [14:52:41] (03CR) 10Cathal Mooney: "Overall LGTM. One comment in-line about the query range, and how fast we want to react to an increase in usage. In terms of the differen" [alerts] - 10https://gerrit.wikimedia.org/r/1128429 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [14:53:35] (03CR) 10Cathal Mooney: [C:03+1] network/data.yaml: add sandbox1-b3-magru [puppet] - 10https://gerrit.wikimedia.org/r/1129219 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [14:53:46] PROBLEM - Host ps1-d3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:53:54] PROBLEM - Host lsw1-d3-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:54:11] (03PS1) 10Muehlenhoff: CAS: Add service definition for spiderpig [puppet] - 10https://gerrit.wikimedia.org/r/1129292 (https://phabricator.wikimedia.org/T383947) [14:55:06] !log hnowlan@dns1004 END - running authdns-update [14:55:27] (03CR) 10Hnowlan: [C:03+2] debug: reorder debug backends for eqiad switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127072 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:56:21] (03Merged) 10jenkins-bot: debug: reorder debug backends for eqiad switchover [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127072 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [14:56:22] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [14:56:33] jouncebot: nowandnext [14:56:33] For the next 1 hour(s) and 3 minute(s): codfw-> eqiad datacentre switchover: MediaWiki edition (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1400) [14:56:34] In 2 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1700) [14:56:38] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872#10652441 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2075.codf... [14:56:49] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2075 [14:57:28] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [14:59:33] !log shutdown sessions to SGIX RS - T386987 [14:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:06] (03PS2) 10Brouberol: mediawiki-legacy-dumps: only enable egress to etcd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129287 [15:00:06] (03PS3) 10Brouberol: mediwiki-dumps-legacy: add missing configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129281 (https://phabricator.wikimedia.org/T388378) [15:00:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10652482 (10phaultfinder) [15:00:58] (03CR) 10Brouberol: mediawiki-legacy-dumps: only enable egress to etcd (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129287 (owner: 10Brouberol) [15:01:57] tgr_: I am seeing a bunch of submodule updates when I run scap backport - are those your rollbacks or should I not deploy? [15:02:35] hnowlan: the originals + rollbacks, in theory? [15:02:39] the diff should be empty [15:04:00] (03CR) 10Slyngshede: [C:03+1] "Looks good. Remember to add secret to private and private-dummy repos." [puppet] - 10https://gerrit.wikimedia.org/r/1129292 (https://phabricator.wikimedia.org/T383947) (owner: 10Muehlenhoff) [15:04:12] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2203 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1129293 (https://phabricator.wikimedia.org/T389373) [15:04:20] 07Puppet, 06SRE: puppet error at the end of the run on prometheus2008: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null - https://phabricator.wikimedia.org/T388629#10652513 (10MoritzMuehlenhoff) >>! In T388629#10648788, @jhathaway wrote: > Un... [15:04:38] I guess we should have done a git pull to leave things in a clean state [15:04:51] (03CR) 10Slyngshede: [C:03+1] "This is suppose to run CAS protocol correct?" [puppet] - 10https://gerrit.wikimedia.org/r/1129292 (https://phabricator.wikimedia.org/T383947) (owner: 10Muehlenhoff) [15:06:01] (03CR) 10Muehlenhoff: "Yes, the Spiderpig app uses a native CAS client,so we don't need to commit any additional secrets" [puppet] - 10https://gerrit.wikimedia.org/r/1129292 (https://phabricator.wikimedia.org/T383947) (owner: 10Muehlenhoff) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:52] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2075 - mvernon@cumin2002" [15:06:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2075 - mvernon@cumin2002" [15:06:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:57] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2075.codfw.wmnet 147.0.192.10.in-addr.arpa 7.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:07:01] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2075.codfw.wmnet 147.0.192.10.in-addr.arpa 7.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:07:02] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2075 [15:07:07] tgr_: it's a little hard to read tbh, I see 4 changes listed, all adjusting the subproject commit. no other changes in the diff [15:07:20] FIRING: CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=59&orgId=1&from=now-6M&to=now&var-search_cluster=cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateT [15:08:10] hnowlan: there were 4 patches to CentralAuth (2 to wmf.20 and 2 to wmf.21) and then reverts for those four [15:08:25] so there should be two empty submodule patches [15:08:44] each with 4 changes (half of which are reverts) [15:09:10] I think you can just go into the repo in a separate terminal window and inspect the patches? [15:09:18] the submodule patches I mean [15:09:26] ...or I can, just a sec [15:09:37] (03PS3) 10Brouberol: mediawiki-legacy-dumps: only enable egress to etcd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129287 [15:09:37] (03PS4) 10Brouberol: mediwiki-dumps-legacy: add missing configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129281 (https://phabricator.wikimedia.org/T388378) [15:10:20] hm apparently not [15:10:28] or is it deploy1002 now? [15:10:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10652550 (10phaultfinder) [15:11:25] not yet, that switches tomorrow [15:11:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2075 [15:11:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2075 [15:13:14] in that case no, the submodule patch is not there yet [15:13:24] (03CR) 10Btullis: [C:03+1] mediawiki-legacy-dumps: only enable egress to etcd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129287 (owner: 10Brouberol) [15:13:54] maybe you can just continue to the testserver sync, and at that point the changes will definitely be inspectable [15:14:14] (03PS4) 10Brouberol: mediawiki-legacy-dumps: only enable egress to etcd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129287 [15:14:14] (03PS5) 10Brouberol: mediwiki-dumps-legacy: add missing configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129281 (https://phabricator.wikimedia.org/T388378) [15:15:03] but unless backport during switchover is working differently, what scap deploys whould match the tip of the git master, and there all non-deployed changes have been reverted [15:15:12] so I think it should be fine to proceed anyway [15:16:17] okay, grand [15:16:22] (03PS5) 10Brouberol: mediawiki-legacy-dumps: only enable egress to etcd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129287 [15:16:22] (03PS6) 10Brouberol: mediwiki-dumps-legacy: add missing configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129281 (https://phabricator.wikimedia.org/T388378) [15:16:22] (03PS1) 10Brouberol: mediawiki-legacy: disable many features enabled by default in global(-eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129295 [15:17:23] (03PS4) 10Fabfur: WIP: testing for custom cert path for acmecerts and unified [puppet] - 10https://gerrit.wikimedia.org/r/1129223 [15:17:24] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: add missing values files in the helmfile values list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129279 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [15:17:29] (03CR) 10Dzahn: [C:03+2] phabricator weekly changes email: Also exclude Security's Vuln-* tags [puppet] - 10https://gerrit.wikimedia.org/r/1128005 (https://phabricator.wikimedia.org/T387508) (owner: 10Aklapper) [15:17:41] (03CR) 10Btullis: [C:03+1] mediawiki-legacy-dumps: only enable egress to etcd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129287 (owner: 10Brouberol) [15:18:04] !log hnowlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1127072|debug: reorder debug backends for eqiad switchover (T385155)]] [15:18:09] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [15:18:31] (03CR) 10Btullis: [C:03+1] mediawiki-legacy: disable many features enabled by default in global(-eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129295 (owner: 10Brouberol) [15:19:46] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: add missing values files in the helmfile values list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129279 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [15:19:49] (03CR) 10Brouberol: [C:03+2] mediawiki-legacy-dumps: only enable egress to etcd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129287 (owner: 10Brouberol) [15:19:51] (03CR) 10Brouberol: [C:03+2] mediawiki-legacy: disable many features enabled by default in global(-eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129295 (owner: 10Brouberol) [15:19:52] (03CR) 10Fabfur: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (owner: 10Fabfur) [15:19:54] (03CR) 10Brouberol: [C:03+2] mediwiki-dumps-legacy: add missing configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129281 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [15:20:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10652598 (10phaultfinder) [15:21:14] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: add missing values files in the helmfile values list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129279 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [15:21:19] (03Merged) 10jenkins-bot: mediawiki-legacy-dumps: only enable egress to etcd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129287 (owner: 10Brouberol) [15:21:20] (03Merged) 10jenkins-bot: mediawiki-legacy: disable many features enabled by default in global(-eqiad) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129295 (owner: 10Brouberol) [15:22:11] (03Merged) 10jenkins-bot: mediwiki-dumps-legacy: add missing configuration templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129281 (https://phabricator.wikimedia.org/T388378) (owner: 10Brouberol) [15:22:21] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [15:22:42] (verified that both deploy branches are in the correct state, FWIW) [15:22:42] (03CR) 10Elukey: [C:03+2] role::ml_k8s::worker: move ml-serve2001 to containerd [puppet] - 10https://gerrit.wikimedia.org/r/1128463 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [15:23:33] !log hnowlan@deploy2002 hnowlan: Backport for [[gerrit:1127072|debug: reorder debug backends for eqiad switchover (T385155)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:23:33] thanks tgr_ [15:23:36] RECOVERY - Host ps1-d3-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.93 ms [15:23:38] RECOVERY - Host lsw1-d3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.85 ms [15:23:40] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [15:24:00] sorry for the confusion [15:24:10] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 681497520 and 32 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:24:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:24:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:24:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:25:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:25:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 169504 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:25:27] !log hnowlan@deploy2002 Sync cancelled. [15:25:54] PROBLEM - Restbase root url on restbase1028 is CRITICAL: connect to address 10.64.0.208 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [15:26:13] (03PS1) 10Hnowlan: debug: fix config syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129296 (https://phabricator.wikimedia.org/T385155) [15:27:31] (03PS1) 10Hnowlan: Revert "debug: reorder debug backends for eqiad switchover" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129297 [15:27:38] (03CR) 10CI reject: [V:04-1] debug: fix config syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129296 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [15:29:43] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2075.codfw.wmnet with reason: host reimage [15:30:45] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10652639 (10phaultfinder) [15:30:46] 06SRE, 06serviceops, 10Wikidata, 10Wikidata Integration in Wikimedia projects, 10Wikimedia-Site-requests: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455#10652638 (10Ladsgroup) Hii, from DBA point of view, I ask that please hold off bumping the number in co... [15:31:06] (03CR) 10Giuseppe Lavagetto: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129296 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [15:31:14] (03CR) 10Giuseppe Lavagetto: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129296 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [15:31:33] !log hnowlan@deploy2002 Locking from deployment [ALL REPOSITORIES]: Switchover followup [15:31:40] (03PS2) 10Slyngshede: Release version 0.1.1 [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1115334 [15:32:20] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [15:33:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2075.codfw.wmnet with reason: host reimage [15:34:21] !log hnowlan@deploy2002 Unlocked for deployment [ALL REPOSITORIES]: Switchover followup (duration: 02m 47s) [15:34:21] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1129298 (https://phabricator.wikimedia.org/T389376) [15:34:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hnowlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129296 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [15:35:14] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve2001.codfw.wmnet with OS bookworm [15:35:21] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1129299 (https://phabricator.wikimedia.org/T389377) [15:35:28] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host ml-serve2001 [15:35:58] (03Merged) 10jenkins-bot: debug: fix config syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129296 (https://phabricator.wikimedia.org/T385155) (owner: 10Hnowlan) [15:36:28] !log hnowlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1129296|debug: fix config syntax (T385155)]] [15:36:31] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [15:36:32] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:35] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2240 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1129301 (https://phabricator.wikimedia.org/T389378) [15:40:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10652736 (10phaultfinder) [15:40:41] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2001 - elukey@cumin1002" [15:40:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ml-serve2001 - elukey@cumin1002" [15:40:47] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:40:47] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache ml-serve2001.codfw.wmnet 21.0.192.10.in-addr.arpa 1.2.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:40:50] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ml-serve2001.codfw.wmnet 21.0.192.10.in-addr.arpa 1.2.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:40:51] !log elukey@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve2001 [15:41:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve2001 [15:41:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve2001 [15:41:32] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1115334 (owner: 10Slyngshede) [15:41:33] !log hnowlan@deploy2002 hnowlan: Backport for [[gerrit:1129296|debug: fix config syntax (T385155)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:41:36] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [15:46:04] !log hnowlan@deploy2002 hnowlan: Continuing with sync [15:47:14] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 615869120 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:48:12] FIRING: [5x] JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:48:14] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 23400 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:48:44] (03PS5) 10Fabfur: WIP: testing for custom cert path for acmecerts and unified [puppet] - 10https://gerrit.wikimedia.org/r/1129223 [15:50:29] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1129307 (https://phabricator.wikimedia.org/T389381) [15:50:53] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2229 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1129309 (https://phabricator.wikimedia.org/T389382) [15:51:19] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2218 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1129310 (https://phabricator.wikimedia.org/T389383) [15:52:59] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2075.codfw.wmnet with OS bullseye [15:53:14] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw row A and B. - https://phabricator.wikimedia.org/T354872#10652870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2075.codfw.wm... [15:53:40] !log hnowlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129296|debug: fix config syntax (T385155)]] (duration: 17m 11s) [15:53:43] T385155: 🧭 Northward Datacentre Switchover (March 2025) - https://phabricator.wikimedia.org/T385155 [15:54:08] (03CR) 10Fabfur: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129223 (owner: 10Fabfur) [15:54:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10652873 (10phaultfinder) [15:55:20] zip, btullis: all done, thank you [15:57:24] (03CR) 10MVernon: [C:03+2] swift: re-add ms-be2075 to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1128908 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [15:57:44] (03PS2) 10MVernon: swift: re-add ms-be2075 to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1128908 (https://phabricator.wikimedia.org/T354872) [15:58:54] (03CR) 10MVernon: [C:03+2] swift: re-add ms-be2075 to the rings [puppet] - 10https://gerrit.wikimedia.org/r/1128908 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [15:59:20] thank you [15:59:32] just in time for the big long meeting to end [15:59:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10652930 (10phaultfinder) [16:00:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10652948 (10Jhancock.wm) @MatthewVernon I ran into an issue with this install. The OS installed and passed the puppet certifica... [16:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [16:03:46] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10652957 (10MoritzMuehlenhoff) [16:04:57] well then [16:05:01] YOLO [16:07:10] welp https://phabricator.wikimedia.org/P74252 [16:08:17] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10652969 (10Papaul) @wiki_willy During the testing process of msw2 we noticed that et-0/1/0 which is a 100G interface doesn't support auto-negotiation so making it impossible to communicate... [16:12:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2089.codfw.wmnet with OS bullseye [16:13:06] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10652996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be2089.codfw.... [16:13:25] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2001.codfw.wmnet with reason: host reimage [16:13:25] !log mvernon@cumin2002 START - Cookbook sre.hosts.move-vlan for host ms-be2089 [16:14:30] !log mvernon@cumin2002 START - Cookbook sre.dns.netbox [16:16:52] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2001.codfw.wmnet with reason: host reimage [16:18:30] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2089 - mvernon@cumin2002" [16:18:35] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host ms-be2089 - mvernon@cumin2002" [16:18:36] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:18:36] !log mvernon@cumin2002 START - Cookbook sre.dns.wipe-cache ms-be2089.codfw.wmnet 15.48.192.10.in-addr.arpa 5.1.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:18:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ms-be2089.codfw.wmnet 15.48.192.10.in-addr.arpa 5.1.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:18:40] !log mvernon@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-be2089 [16:19:16] got it working – that's my dry runs done [16:20:11] (03PS1) 10Hokwelum: Fix error evaluating function `unit` [skins/Modern] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129315 (https://phabricator.wikimedia.org/T389384) [16:21:59] Flow is readonly on four of these wikis, so my script would be ineffectual. Proceeding to move Flow pages for officewiki only... [16:23:12] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:24:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be2089 [16:24:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ms-be2089 [16:24:40] (03PS2) 10Alexandros Kosiaris: Spiderpig: Declare the spiderpig group [puppet] - 10https://gerrit.wikimedia.org/r/1129239 (https://phabricator.wikimedia.org/T383945) [16:24:53] (03CR) 10CI reject: [V:04-1] Spiderpig: Declare the spiderpig group [puppet] - 10https://gerrit.wikimedia.org/r/1129239 (https://phabricator.wikimedia.org/T383945) (owner: 10Alexandros Kosiaris) [16:25:01] and that had zero effect; https://phabricator.wikimedia.org/P74258 [16:25:08] (03PS3) 10Alexandros Kosiaris: Spiderpig: Declare the spiderpig group [puppet] - 10https://gerrit.wikimedia.org/r/1129239 (https://phabricator.wikimedia.org/T383945) [16:25:17] I have a meeting; will stop here. [16:27:27] (03CR) 10CI reject: [V:04-1] Spiderpig: Declare the spiderpig group [puppet] - 10https://gerrit.wikimedia.org/r/1129239 (https://phabricator.wikimedia.org/T383945) (owner: 10Alexandros Kosiaris) [16:27:38] (03CR) 10JHathaway: [C:03+1] "looks good, thanks for taking the time to explore the vmail approach." [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:27:57] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:codfw and A:cp for 9.2.9-1wm1 [16:28:23] (03PS4) 10Alexandros Kosiaris: Spiderpig: Declare the spiderpig group [puppet] - 10https://gerrit.wikimedia.org/r/1129239 (https://phabricator.wikimedia.org/T383945) [16:28:33] (03CR) 10AikoChou: [C:03+1] inference-services: edit-check GPU version deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [16:29:05] (03CR) 10JHathaway: [C:03+1] community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1128565 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [16:31:15] !log restart pybal on low-traffic eqiad/codfw to remove two old/unused kartotherian ports [16:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:30] (03CR) 10Elukey: [C:03+2] service: set kartotherian and kartotherian-ssl to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128345 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey) [16:31:31] jouncebot: now [16:31:31] No deployments scheduled for the next 0 hour(s) and 28 minute(s) [16:31:40] jouncebot: nowandnext [16:31:40] No deployments scheduled for the next 0 hour(s) and 28 minute(s) [16:31:40] In 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1700) [16:31:59] zip, btullis: I got another train blocker fix I want to deploy, ok with you to start a backport now? [16:32:40] I'm done for now [16:33:13] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 188186816 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:34:04] 10ops-drmrs, 06Infrastructure-Foundations, 10netops: cr1-drmrs to asw1-b12-drmrs link down - https://phabricator.wikimedia.org/T389071#10653129 (10RobH) [16:35:13] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 187608 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:35:16] (03PS1) 10BCornwall: cdn: Add roll-upgrade-varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) [16:35:30] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2001.codfw.wmnet with OS bookworm [16:35:38] (03CR) 10Alexandros Kosiaris: [C:03+2] Revert^2 "profile::scap::spiderpig: New profile for setting up SpiderPig" [puppet] - 10https://gerrit.wikimedia.org/r/1129289 (owner: 10Alexandros Kosiaris) [16:35:42] (03CR) 10Alexandros Kosiaris: [C:03+2] Spiderpig: Declare the spiderpig group [puppet] - 10https://gerrit.wikimedia.org/r/1129239 (https://phabricator.wikimedia.org/T383945) (owner: 10Alexandros Kosiaris) [16:36:25] zip: thx [16:36:30] gonna backport a train fix in the next 5m if there are no objections [16:37:27] FIRING: [2x] SystemdUnitFailed: spiderpig-apiserver.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:38:47] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:39:15] this is me --^ [16:39:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10653180 (10phaultfinder) [16:40:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy2002 using scap backport" [skins/Modern] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129315 (https://phabricator.wikimedia.org/T389384) (owner: 10Hokwelum) [16:40:28] !log restart pybal on lvs201[3,4] and run ipvsadm --delete-service --tcp-service 10.2.1.13:{443,6533} [16:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:39] (03Merged) 10jenkins-bot: Fix error evaluating function `unit` [skins/Modern] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129315 (https://phabricator.wikimedia.org/T389384) (owner: 10Hokwelum) [16:42:11] !log jnuche@deploy2002 Started scap sync-world: Backport for [[gerrit:1129315|Fix error evaluating function `unit` (T389384)]] [16:42:14] T389384: Less_Exception_Compiler: error evaluating function `unit` The first argument to unit must be a number. Have you forgotten parenthesis? index: 153 in ext.echo.styles.badge.less on line 6, column 14| @size-icon: 14px;5| @font-s - https://phabricator.wikimedia.org/T389384 [16:42:46] (03CR) 10BCornwall: "Example dry-run output:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall) [16:42:53] (03CR) 10BCornwall: [V:03+1] cdn: Add roll-upgrade-varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall) [16:43:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2089.codfw.wmnet with reason: host reimage [16:43:26] !log restart pybal on lvs10[19,20] and run ipvsadm --delete-service --tcp-service 10.2.1.13:{443,6533} [16:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:45] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:44:39] RESOLVED: [2x] SystemdUnitFailed: spiderpig-apiserver.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:14] (03PS1) 10Ayounsi: Remove HE through SG.IX [homer/public] - 10https://gerrit.wikimedia.org/r/1129320 (https://phabricator.wikimedia.org/T386987) [16:45:21] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:45:30] (03PS2) 10Gkyziridis: ml-services: edit-check GPU version deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) [16:46:26] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: edit-check GPU version deployment on staging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129285 (https://phabricator.wikimedia.org/T386100) (owner: 10Gkyziridis) [16:46:30] (03PS2) 10BCornwall: cdn: Add roll-upgrade-varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) [16:46:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2089.codfw.wmnet with reason: host reimage [16:46:55] !log jnuche@deploy2002 hokwelum, jnuche: Backport for [[gerrit:1129315|Fix error evaluating function `unit` (T389384)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:46:56] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:47:04] !log jnuche@deploy2002 hokwelum, jnuche: Continuing with sync [16:48:46] (03PS3) 10Elukey: service, conftool-data: final removal for unused Kartotherian configs [puppet] - 10https://gerrit.wikimedia.org/r/1128346 (https://phabricator.wikimedia.org/T389042) [16:50:23] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10653237 (10cmooney) @Papaul those ports do support 40G QSFP+ modules. They're just a bit awkward to get going. First of all I deleted all the VC-ports on the system, although it seems you'... [16:50:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10653241 (10phaultfinder) [16:50:58] (03CR) 10Elukey: [C:03+2] service, conftool-data: final removal for unused Kartotherian configs [puppet] - 10https://gerrit.wikimedia.org/r/1128346 (https://phabricator.wikimedia.org/T389042) (owner: 10Elukey) [16:51:16] (03PS5) 10Bking: ES/rolling-operation: add a optional flag to ask for confirmation before running operation [cookbooks] - 10https://gerrit.wikimedia.org/r/1119058 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [16:51:52] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:53:16] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:53:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [16:53:52] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:54:43] !log restart pybal on lvs-low-traffic and secondary in eqiad/codfw [16:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:03] !log jnuche@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129315|Fix error evaluating function `unit` (T389384)]] (duration: 12m 52s) [16:55:07] T389384: Less_Exception_Compiler: error evaluating function `unit` The first argument to unit must be a number. Have you forgotten parenthesis? index: 153 in ext.echo.styles.badge.less on line 6, column 14| @size-icon: 14px;5| @font-s - https://phabricator.wikimedia.org/T389384 [16:55:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10653282 (10phaultfinder) [16:55:42] (03PS3) 10Elukey: maps: remove Kartotherian from bare metal nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128348 (https://phabricator.wikimedia.org/T389042) [16:56:04] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:56:17] (03CR) 10Bking: [C:03+2] Cirrus: Prepare production hosts for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1129264 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [16:56:42] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:58:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_kartotherian-ssl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:58:49] ok what is this now [16:59:11] this is config-master2001 that didn't like me [16:59:15] (03Merged) 10jenkins-bot: ES/rolling-operation: add a optional flag to ask for confirmation before running operation [cookbooks] - 10https://gerrit.wikimedia.org/r/1119058 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [16:59:34] hmm, might be one of the validation errors we saw with the wdqs pools [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1700) [17:00:07] (03CR) 10Volans: "I did a quick pass, I'll leave the specific logic and the change in functionalities to traffic. The overall structure looks ok." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall) [17:00:09] sukhe: under /var/run/confd-template on config-master2001 there are some err files, I think due to the removal [17:00:15] yep we should rm -rf those [17:00:16] but checking [17:00:59] yeah, karthotherian-ssl ones [17:01:01] safe to remove [17:01:07] elukey: want to do the honours? :) [17:01:43] late for you, I can also do it [17:01:48] just don't want to step on your feet [17:01:48] nono doing it [17:01:52] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10653316 (10Papaul) [17:01:53] ok [17:03:41] RESOLVED: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_kartotherian-ssl.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:04:05] !log remove spurious kartotherian err files under config-master2001:/var/run/confd-template [17:04:08] sukhe: --^ [17:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:11] should be good! [17:04:50] yep, thanks! [17:04:53] enjoy your evening [17:05:48] (03PS2) 10DCausse: cirrus: explicitly route search traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129181 (https://phabricator.wikimedia.org/T388610) [17:05:48] (03PS2) 10DCausse: cirrus: explicitly route search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) [17:05:48] (03PS2) 10DCausse: cirrus: switch search traffic back to multi-DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129183 (https://phabricator.wikimedia.org/T388610) [17:06:26] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [17:06:30] (03PS3) 10BCornwall: cdn: Add roll-upgrade-varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) [17:06:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [17:06:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2089.codfw.wmnet with OS bullseye [17:07:00] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10653337 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be2089.codfw.wmne... [17:09:26] (03PS2) 10Vgutierrez: liberica: Allow configuring UDP services [puppet] - 10https://gerrit.wikimedia.org/r/1128892 (https://phabricator.wikimedia.org/T389210) [17:09:26] (03PS1) 10Vgutierrez: wmflib,liberica: Add support for DNS healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1129326 (https://phabricator.wikimedia.org/T389211) [17:10:11] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Recommission testhost2001.codfw.wmnet as ms-be2089.codfw.wmnet - https://phabricator.wikimedia.org/T388221#10653343 (10MatthewVernon) @Jhancock.wm OK, I think I've got it sorted now (and moved it to a new-style VLAN while I was at it)... [17:10:16] (03PS2) 10Vgutierrez: wmflib,liberica: Add support for DNS healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1129326 (https://phabricator.wikimedia.org/T389211) [17:11:16] (03PS1) 10Alexandros Kosiaris: spiderpig: Switch envoy servername to FQDN [puppet] - 10https://gerrit.wikimedia.org/r/1129327 (https://phabricator.wikimedia.org/T383945) [17:11:42] (03CR) 10CI reject: [V:04-1] spiderpig: Switch envoy servername to FQDN [puppet] - 10https://gerrit.wikimedia.org/r/1129327 (https://phabricator.wikimedia.org/T383945) (owner: 10Alexandros Kosiaris) [17:14:10] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129326 (https://phabricator.wikimedia.org/T389211) (owner: 10Vgutierrez) [17:21:38] (03PS1) 10Alexandros Kosiaris: spiderpig: Switch to listening on :: instead of 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/1129330 (https://phabricator.wikimedia.org/T383945) [17:22:16] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 98306864 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:22:31] (03PS2) 10Alexandros Kosiaris: spiderpig: Switch envoy servername to FQDN [puppet] - 10https://gerrit.wikimedia.org/r/1129327 (https://phabricator.wikimedia.org/T383945) [17:22:31] (03PS2) 10Alexandros Kosiaris: spiderpig: Switch to listening on :: instead of 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/1129330 (https://phabricator.wikimedia.org/T383945) [17:23:16] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 101552 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:23:56] (03CR) 10CI reject: [V:04-1] spiderpig: Switch to listening on :: instead of 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/1129330 (https://phabricator.wikimedia.org/T383945) (owner: 10Alexandros Kosiaris) [17:24:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10653427 (10bking) a:05Jclark-ctr→03bking @Jclark-ctr Per IRC mention in #wikimedia-dcops: sorry to steal this one from you, but I ju... [17:25:54] (03CR) 10Alexandros Kosiaris: [C:03+2] spiderpig: Switch envoy servername to FQDN [puppet] - 10https://gerrit.wikimedia.org/r/1129327 (https://phabricator.wikimedia.org/T383945) (owner: 10Alexandros Kosiaris) [17:25:59] (03CR) 10Alexandros Kosiaris: [C:03+2] spiderpig: Switch to listening on :: instead of 0.0.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/1129330 (https://phabricator.wikimedia.org/T383945) (owner: 10Alexandros Kosiaris) [17:33:04] !log disable puppet on A:lvs-codfw to roll out CR 1128937 [17:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:23] (03CR) 10Ssingh: [C:03+1] aux-k8s codfw: enable worker ingress [puppet] - 10https://gerrit.wikimedia.org/r/1128937 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [17:34:31] (03CR) 10Ssingh: [C:03+2] aux-k8s codfw: enable worker ingress [puppet] - 10https://gerrit.wikimedia.org/r/1128937 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [17:34:40] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:38:40] 10ops-codfw, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T389398 (10phaultfinder) 03NEW [17:39:27] (03CR) 10Alexandros Kosiaris: [C:03+1] CAS: Add service definition for spiderpig [puppet] - 10https://gerrit.wikimedia.org/r/1129292 (https://phabricator.wikimedia.org/T383947) (owner: 10Muehlenhoff) [17:40:39] 10ops-codfw, 06SRE, 06DC-Ops: codfw:expansion: Console/management wiring - https://phabricator.wikimedia.org/T382383#10653527 (10Papaul) [17:41:29] (03PS1) 10Bking: cirrus: create symlink from /etc/elasticsearch to /etc/opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1129334 (https://phabricator.wikimedia.org/T388610) [17:41:53] (03CR) 10CI reject: [V:04-1] cirrus: create symlink from /etc/elasticsearch to /etc/opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1129334 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:42:41] (03PS2) 10Bking: cirrus: create symlink from /etc/elasticsearch to /etc/opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1129334 (https://phabricator.wikimedia.org/T388610) [17:42:43] (03CR) 10Dzahn: [C:03+1] "lgtm, though I expected at some point in the future the required groups need to be extended if it's meant for all deployers. Plenty of dep" [puppet] - 10https://gerrit.wikimedia.org/r/1129292 (https://phabricator.wikimedia.org/T383947) (owner: 10Muehlenhoff) [17:42:55] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129334 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:43:01] !log run agent on lvs2014 and restart pybal [17:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:26] !log run agent on lvs2014 and restart pybal [CR 1128937] [17:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:44] !log herron@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker2002.codfw.wmnet [17:46:01] !log herron@puppetserver1001 conftool action : set/pooled=yes; selector: name=aux-k8s-worker2004.codfw.wmnet [17:49:40] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:49:59] !log run agent on lvs2013 and restart pybal [CR 1128937] [17:50:00] (03PS3) 10Bking: cirrus: create symlink from /etc/elasticsearch to /etc/opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1129334 (https://phabricator.wikimedia.org/T388610) [17:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:23] (03CR) 10CI reject: [V:04-1] cirrus: create symlink from /etc/elasticsearch to /etc/opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1129334 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:50:24] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129334 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:51:44] (03PS4) 10Bking: cirrus: create symlink from /etc/elasticsearch to /etc/opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1129334 (https://phabricator.wikimedia.org/T388610) [17:52:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129334 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:52:15] !log sudo cumin 'A:lvs-codfw' 'run-puppet-agent --enable "rolling out CR 1128937"' [17:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:03] !log Import varnishkafka 1.1.0-5 into wikimedia-bullseye component/varnish-staging [17:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:21] !log Import varnishkafka 1.1.0-5 into wikimedia-bullseye component/varnish-staging (T389322) [17:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:24] T389322: varnishkafka needs to be linked against libvarnishapi3 - https://phabricator.wikimedia.org/T389322 [17:54:33] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417#10653567 (10herron) [17:55:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10653571 (10phaultfinder) [17:59:50] 06SRE, 06Infrastructure-Foundations, 07Kubernetes, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): aux-k8s-codfw cluster setup - https://phabricator.wikimedia.org/T381417#10653593 (10herron) 05Open→03Resolved a:03herron Thanks to @ssingh the k8s-ingress-aux.svc.codfw.wmnet LVS is aliv... [18:00:04] jnuche and jeena: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1800). [18:00:22] (03CR) 10Dzahn: [C:03+2] deployment_server/k8s: set kubeconfig files for codesearch [puppet] - 10https://gerrit.wikimedia.org/r/1126170 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [18:00:34] 06SRE, 06Infrastructure-Foundations, 07Epic, 07Kubernetes: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams - https://phabricator.wikimedia.org/T378742#10653600 (10herron) [18:02:35] (03CR) 10Herron: [C:03+1] add ingress service aliases for codesearch on k8s-aux [dns] - 10https://gerrit.wikimedia.org/r/1126177 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [18:03:13] (03CR) 10Herron: [C:03+1] "this is done now, ready to go!" [dns] - 10https://gerrit.wikimedia.org/r/1126180 (https://phabricator.wikimedia.org/T345894) (owner: 10Dzahn) [18:05:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10653629 (10phaultfinder) [18:06:13] (03CR) 10BCornwall: cdn: Add roll-upgrade-varnish (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall) [18:06:49] (03CR) 10Dzahn: [C:03+2] "thanks and done!:) this did a few things on the deployment server. it edited /etc/profile.d/kube-conf.sh and created new cfssl certs for c" [puppet] - 10https://gerrit.wikimedia.org/r/1126170 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [18:09:03] (03PS4) 10BCornwall: cdn: Add roll-upgrade-varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) [18:09:06] (03CR) 10Dzahn: [C:03+2] add ingress service aliases for codesearch on k8s-aux [dns] - 10https://gerrit.wikimedia.org/r/1126177 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [18:09:17] (03PS4) 10Dzahn: add ingress service aliases for codesearch on k8s-aux [dns] - 10https://gerrit.wikimedia.org/r/1126177 (https://phabricator.wikimedia.org/T268199) [18:20:17] (03PS1) 10Bartosz Dziewoński: Add logging to help figure unserialization issues [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) [18:21:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, March 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/Echo] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129336 (https://phabricator.wikimedia.org/T388725) (owner: 10Bartosz Dziewoński) [18:21:22] (03CR) 10Ssingh: "Some questions but looks good otherwise!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall) [18:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10653888 (10phaultfinder) [18:27:52] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:codfw and A:cp for 9.2.9-1wm1 [18:28:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10653908 (10VRiley-WMF) Starting this process. Currently wiping previously installed RAID group on these units [18:28:35] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [18:28:38] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [18:29:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10653919 (10phaultfinder) [18:29:44] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [18:30:18] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic [18:30:19] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic [18:31:54] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1126177 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [18:33:13] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=cloudelastic1008.eqiad.wmnet [18:35:52] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1248 crash - https://phabricator.wikimedia.org/T388837#10653969 (10VRiley-WMF) 05Open→03In progress Proceeding with reseating NIC and the cables. Will commence a flea power drain as well [18:38:18] !log dzahn@dns1004 START - running authdns-update [18:40:35] !log dzahn@dns1004 END - running authdns-update [18:40:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10653982 (10phaultfinder) [18:44:13] (03PS1) 10Btullis: Upgrade snapshot hosts to PHP version 8.1 - except snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1129343 (https://phabricator.wikimedia.org/T382484) [18:44:36] (03CR) 10CI reject: [V:04-1] Upgrade snapshot hosts to PHP version 8.1 - except snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1129343 (https://phabricator.wikimedia.org/T382484) (owner: 10Btullis) [18:47:01] (03PS2) 10Btullis: Upgrade snapshot hosts to PHP version 8.1 - except snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1129343 (https://phabricator.wikimedia.org/T382484) [18:47:24] (03CR) 10CI reject: [V:04-1] Upgrade snapshot hosts to PHP version 8.1 - except snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1129343 (https://phabricator.wikimedia.org/T382484) (owner: 10Btullis) [18:47:46] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1248 crash - https://phabricator.wikimedia.org/T388837#10653993 (10VRiley-WMF) 05In progress→03Resolved This has been completed. Will monitor for a bit. Hopefully this should clear the issue [18:48:55] (03PS3) 10Btullis: Upgrade snapshot hosts to PHP version 8.1 - except snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1129343 (https://phabricator.wikimedia.org/T382484) [18:53:11] (03CR) 10Dzahn: [C:03+2] add k8s ingress service aliases for jaeger in codfw [dns] - 10https://gerrit.wikimedia.org/r/1126180 (https://phabricator.wikimedia.org/T345894) (owner: 10Dzahn) [18:53:18] (03PS3) 10Dzahn: add k8s ingress service aliases for jaeger in codfw [dns] - 10https://gerrit.wikimedia.org/r/1126180 (https://phabricator.wikimedia.org/T345894) [18:55:05] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1126180 (https://phabricator.wikimedia.org/T345894) (owner: 10Dzahn) [18:55:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10654015 (10phaultfinder) [18:57:11] !log dzahn@dns1004 START - running authdns-update [18:57:28] (03CR) 10Brouberol: [C:03+1] cirrus: create symlink from /etc/elasticsearch to /etc/opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1129334 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:58:52] jouncebot: nowandnext [18:58:52] For the next 1 hour(s) and 1 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T1800) [18:58:52] In 1 hour(s) and 1 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T2000) [18:59:18] (03CR) 10Bking: [C:03+2] cirrus: create symlink from /etc/elasticsearch to /etc/opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1129334 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:59:24] !log dzahn@dns1004 END - running authdns-update [18:59:51] (03PS1) 10Jaime Nuche: Edit check: return early in debounced methods if surface is gone [extensions/VisualEditor] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129346 (https://phabricator.wikimedia.org/T389394) [19:00:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10654032 (10phaultfinder) [19:02:00] !log bblack@dns1005 START - running authdns-update [19:02:02] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on A:eqiad and A:cp for 9.2.9-1wm1 [19:02:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jnuche@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129346 (https://phabricator.wikimedia.org/T389394) (owner: 10Jaime Nuche) [19:02:31] (03PS4) 10Btullis: Upgrade snapshot hosts to PHP version 8.1 - except snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1129343 (https://phabricator.wikimedia.org/T382484) [19:02:36] !log dzahn@dns1004 START - running authdns-update [19:13:51] (03Merged) 10jenkins-bot: Edit check: return early in debounced methods if surface is gone [extensions/VisualEditor] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129346 (https://phabricator.wikimedia.org/T389394) (owner: 10Jaime Nuche) [19:14:27] !log jnuche@deploy2002 Started scap sync-world: Backport for [[gerrit:1129346|Edit check: return early in debounced methods if surface is gone (T389394)]] [19:14:31] T389394: TypeError: Cannot read properties of null (reading 'getView'/'getContext') - https://phabricator.wikimedia.org/T389394 [19:19:26] !log jnuche@deploy2002 jnuche: Backport for [[gerrit:1129346|Edit check: return early in debounced methods if surface is gone (T389394)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:19:32] !log jnuche@deploy2002 jnuche: Continuing with sync [19:19:54] RECOVERY - Restbase root url on restbase1028 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/RESTBase [19:22:33] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [19:22:37] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [19:22:52] (03PS1) 10Gergő Tisza: varnish: Fix X-Wikimedia-Debug cookie handling [puppet] - 10https://gerrit.wikimedia.org/r/1129349 [19:23:35] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [19:25:22] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 295 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 764, active_shards: 1236, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 263, delayed_unassigned_shards [19:25:22] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.73154800783801 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:25:22] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 319 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 795, active_shards: 1273, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 319, delayed_unassigned_shards: [19:25:22] er_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.96231155778895 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:25:22] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 319 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 795, active_shards: 1273, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 319, delayed_unassigned_shards: [19:25:22] er_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.96231155778895 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:25:34] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 319 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 795, active_shards: 1273, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 319, delayed_unassigned_shards: [19:25:34] er_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.96231155778895 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:25:46] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 319 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 795, active_shards: 1273, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 319, delayed_unassigned_shards: [19:25:46] er_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.96231155778895 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:25:46] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 294 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 764, active_shards: 1237, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 262, delayed_unassigned_shards [19:25:46] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.79686479425212 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:25:46] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 294 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 764, active_shards: 1237, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 262, delayed_unassigned_shards [19:25:46] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.79686479425212 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:25:58] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 294 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 764, active_shards: 1237, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 262, delayed_unassigned_shards [19:25:58] ber_of_pending_tasks: 4, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 11127, active_shards_percent_as_number: 80.79686479425212 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:26:11] ^^ expected, I'll try and suppress alerts [19:26:25] although I'm not sure if that will quiet down IRC [19:26:46] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 764, active_shards: 1317, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 193, delayed_unassigned_shards: 0, number_of_pending_t [19:26:46] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.02220770738079 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:26:46] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 764, active_shards: 1317, relocating_shards: 0, initializing_shards: 21, unassigned_shards: 193, delayed_unassigned_shards: 0, number_of_pending_t [19:26:46] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.02220770738079 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:26:58] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 764, active_shards: 1372, relocating_shards: 0, initializing_shards: 13, unassigned_shards: 146, delayed_unassigned_shards: 0, number_of_pending_t [19:26:58] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 575, active_shards_percent_as_number: 89.61463096015676 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:27:01] !log jnuche@deploy2002 Finished scap sync-world: Backport for [[gerrit:1129346|Edit check: return early in debounced methods if surface is gone (T389394)]] (duration: 12m 34s) [19:27:05] T389394: TypeError: Cannot read properties of null (reading 'getView'/'getContext') - https://phabricator.wikimedia.org/T389394 [19:27:22] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 764, active_shards: 1440, relocating_shards: 0, initializing_shards: 19, unassigned_shards: 72, delayed_unassigned_shards: 0, number_of_pending_ta [19:27:22] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.05617243631613 https://wikitech.wikimedia.org/wiki/Search%23Administration [19:29:22] (03PS1) 10Gergő Tisza: Revert^2 "Allowlist Special:WikimediaDebug on the shared domain" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129351 [19:29:35] (03PS1) 10Gergő Tisza: Revert^2 "Allowlist Special:WikimediaDebug on the shared domain" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129352 [19:29:41] (03PS1) 10Gergő Tisza: Revert^2 "Fix SUL3 login cohort logic" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129353 [19:29:50] (03PS1) 10Gergő Tisza: Revert^2 "Fix SUL3 login cohort logic" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129354 [19:32:21] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [19:34:15] (03PS1) 10Dzahn: admin: remove spiderpig from deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1129356 [19:34:36] (03PS1) 10Ebernhardson: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129357 [19:35:58] (03CR) 10Dzahn: "puppet broken on appservers/snapshot hosts etc, the machines that have the deployment group but did not get the spiderpig user. This is be" [puppet] - 10https://gerrit.wikimedia.org/r/1129289 (owner: 10Alexandros Kosiaris) [19:35:59] (03CR) 10Ahmon Dancy: [C:03+1] "LGTM as a workaround. After this I'll need help with doing this The Right Way." [puppet] - 10https://gerrit.wikimedia.org/r/1129356 (owner: 10Dzahn) [19:37:28] 10ops-codfw, 06SRE, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T389398#10654201 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [19:37:50] (03CR) 10Ebernhardson: [C:03+2] cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129357 (owner: 10Ebernhardson) [19:37:58] (03CR) 10RLazarus: [C:03+1] admin: remove spiderpig from deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1129356 (owner: 10Dzahn) [19:38:16] (03CR) 10Dzahn: [C:03+2] admin: remove spiderpig from deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1129356 (owner: 10Dzahn) [19:39:24] (03Merged) 10jenkins-bot: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129357 (owner: 10Ebernhardson) [19:39:42] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:42:36] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [19:42:42] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:44:55] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2248 to codfw - jhancock@cumin2002" [19:45:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2248 to codfw - jhancock@cumin2002" [19:45:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:45:49] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2248 [19:45:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2248 [19:46:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2248.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:48:10] PROBLEM - Ensure traffic_manager is running for instance backend on cp5031 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:49:10] RECOVERY - Ensure traffic_manager is running for instance backend on cp5031 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:49:13] !log upgrading varnishkafka to 1.1.0-5 on A:cp-ulsfo and cp30[66,74] (T389322) [19:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:18] T389322: varnishkafka needs to be linked against libvarnishapi3 - https://phabricator.wikimedia.org/T389322 [19:54:48] (03PS2) 10Scott French: mw-on-k8s: align alerts with "pools" of capacity [alerts] - 10https://gerrit.wikimedia.org/r/1129358 (https://phabricator.wikimedia.org/T389224) [19:54:48] (03CR) 10Scott French: "Thanks in advance for the review! This seems like the best option for getting what we intend: alert on the aggregate behavior of a "pool" " [alerts] - 10https://gerrit.wikimedia.org/r/1129358 (https://phabricator.wikimedia.org/T389224) (owner: 10Scott French) [19:57:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2248.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [19:58:58] (03PS1) 10Bartosz Dziewoński: Add logging to help figure unserialization issues [extensions/Echo] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129362 (https://phabricator.wikimedia.org/T388725) [19:59:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/main at eqiad: 19.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:59:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wikikube-worker2248.codfw.wmnet with OS bookworm [19:59:29] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10654250 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wikikube-worker2248.codfw.wmnet with... [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T2000). [20:00:04] bd808 and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] o/ [20:00:39] o/ [20:01:07] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - eqiad - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [20:01:19] I can deploy in 5 min [20:01:33] tgr_: awesome [20:01:46] mine is just a wmf-config/LabsServices.php update [20:04:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/main at eqiad: 19.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:07:20] bd808: so that just needs to be merged, right? [20:07:30] anyway I can just batch it with the rest [20:07:52] tgr_: yeah, just merged [20:08:16] it will dirty diff on prod until pulled, but the sync should be a noop [20:08:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [20:08:52] well... technically that file is in the prod image and metal nodes I guess [20:09:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128996 (https://phabricator.wikimedia.org/T389252) (owner: 10BryanDavis) [20:09:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129215 (https://phabricator.wikimedia.org/T389318) (owner: 10Gergő Tisza) [20:09:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129351 (owner: 10Gergő Tisza) [20:09:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129352 (owner: 10Gergő Tisza) [20:09:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129353 (owner: 10Gergő Tisza) [20:09:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129354 (owner: 10Gergő Tisza) [20:10:01] (03Merged) 10jenkins-bot: LabsServices: use appservers service name for parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128996 (https://phabricator.wikimedia.org/T389252) (owner: 10BryanDavis) [20:10:04] (03Merged) 10jenkins-bot: wikitech: Remove $wgCookieDomain override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129215 (https://phabricator.wikimedia.org/T389318) (owner: 10Gergő Tisza) [20:10:15] backend response time is pretty high [20:11:00] (03Merged) 10jenkins-bot: Revert^2 "Allowlist Special:WikimediaDebug on the shared domain" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129351 (owner: 10Gergő Tisza) [20:11:01] (03Merged) 10jenkins-bot: Revert^2 "Allowlist Special:WikimediaDebug on the shared domain" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129352 (owner: 10Gergő Tisza) [20:11:03] o/, looking as well [20:11:18] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1086669976 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:13:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [20:14:16] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 44352 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:15:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2248.codfw.wmnet with reason: host reimage [20:15:53] (03Merged) 10jenkins-bot: Revert^2 "Fix SUL3 login cohort logic" [extensions/CentralAuth] (wmf/1.44.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1129353 (owner: 10Gergő Tisza) [20:16:22] (03Merged) 10jenkins-bot: Revert^2 "Fix SUL3 login cohort logic" [extensions/CentralAuth] (wmf/1.44.0-wmf.21) - 10https://gerrit.wikimedia.org/r/1129354 (owner: 10Gergő Tisza) [20:16:56] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1128996|LabsServices: use appservers service name for parsoid (T389252)]], [[gerrit:1129215|wikitech: Remove $wgCookieDomain override (T389318)]], [[gerrit:1129351|Revert^2 "Allowlist Special:WikimediaDebug on the shared domain"]], [[gerrit:1129352|Revert^2 "Allowlist Special:WikimediaDebug on the shared domain"]], [[gerrit:1129353|Revert^2 "Fix SUL3 login [20:16:56] cohort logic"]], [[gerrit:1129354|Revert^2 "Fix SUL3 login cohort logic"]] [20:17:01] T389252: deployment-restbase05.deployment-prep.eqiad1.wikimedia.cloud configured to talk to parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T389252 [20:17:01] T389318: Unable to login via SUL3 on wikitech.wikimedia.org ("Session hijacking" error) - https://phabricator.wikimedia.org/T389318 [20:17:15] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [20:17:18] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [20:17:45] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [20:18:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2248.codfw.wmnet with reason: host reimage [20:19:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:20:41] (03PS5) 10BCornwall: cdn: Add roll-upgrade-varnish [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) [20:20:50] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [20:20:54] (03CR) 10BCornwall: cdn: Add roll-upgrade-varnish (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall) [20:22:03] !log tgr@deploy2002 tgr, bd808: Backport for [[gerrit:1128996|LabsServices: use appservers service name for parsoid (T389252)]], [[gerrit:1129215|wikitech: Remove $wgCookieDomain override (T389318)]], [[gerrit:1129351|Revert^2 "Allowlist Special:WikimediaDebug on the shared domain"]], [[gerrit:1129352|Revert^2 "Allowlist Special:WikimediaDebug on the shared domain"]], [[gerrit:1129353|Revert^2 "Fix SUL3 login cohort logic [20:22:03] "]], [[gerrit:1129354|Revert^2 "Fix SUL3 login cohort logic"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:22:07] T389252: deployment-restbase05.deployment-prep.eqiad1.wikimedia.cloud configured to talk to parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T389252 [20:22:07] T389318: Unable to login via SUL3 on wikitech.wikimedia.org ("Session hijacking" error) - https://phabricator.wikimedia.org/T389318 [20:22:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10654319 (10phaultfinder) [20:23:01] PROBLEM - OpenSearch health check for shards on 9600 on cloudelastic1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 319 threshold =0.15 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 795, active_shards: 1273, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 319, delayed_unassigned_shards: [20:23:01] er_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 79.96231155778895 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:23:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:24:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10654321 (10Eevans) @Jclark-ctr are these being handed off to #data-persistence ? Are they ready to go? [20:26:50] (03PS1) 10Reedy: nooop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129365 [20:26:50] (03CR) 10Reedy: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129365 (owner: 10Reedy) [20:34:39] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:34:39] !log tgr@deploy2002 tgr, bd808: Continuing with sync [20:34:51] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:35:06] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1115.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:35:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:35:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2248.codfw.wmnet with OS bookworm [20:35:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10654353 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wikikube-worker2248.codfw.wmnet with OS... [20:35:56] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1113.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:36:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:36:19] PROBLEM - Disk space on maps1009 is CRITICAL: DISK CRITICAL - free space: / 2625 MB (3% inode=96%): /tmp 2625 MB (3% inode=96%): /var/tmp 2625 MB (3% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=maps1009&var-datasource=eqiad+prometheus/ops [20:36:23] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1113.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:36:27] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1114.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:36:56] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1113.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:37:04] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1113.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:37:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10654368 (10phaultfinder) [20:38:39] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1113.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:39:17] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 587906704 and 70 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:39:25] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 10RESTBase: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10654369 (10Jclark-ctr) @Eevans yes these are finished [20:40:17] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 64184 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:41:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:42:22] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1128996|LabsServices: use appservers service name for parsoid (T389252)]], [[gerrit:1129215|wikitech: Remove $wgCookieDomain override (T389318)]], [[gerrit:1129351|Revert^2 "Allowlist Special:WikimediaDebug on the shared domain"]], [[gerrit:1129352|Revert^2 "Allowlist Special:WikimediaDebug on the shared domain"]], [[gerrit:1129353|Revert^2 "Fix SUL3 logi [20:42:22] n cohort logic"]], [[gerrit:1129354|Revert^2 "Fix SUL3 login cohort logic"]] (duration: 25m 25s) [20:42:26] T389252: deployment-restbase05.deployment-prep.eqiad1.wikimedia.cloud configured to talk to parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T389252 [20:42:27] T389318: Unable to login via SUL3 on wikitech.wikimedia.org ("Session hijacking" error) - https://phabricator.wikimedia.org/T389318 [20:44:39] !log UTC late deploys done [20:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1115.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:47:05] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host elastic1112.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:47:28] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-03-11-234147 to 2025-03-19-125950 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129367 (https://phabricator.wikimedia.org/T314342) [20:47:37] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-03-11-234105 to 2025-03-19-203723 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129368 (https://phabricator.wikimedia.org/T314342) [20:48:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1114.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:49:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/main (k8s) 1.277s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:49:28] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1111.eqiad.wmnet with OS bullseye [20:49:39] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:49:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10654421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1111.eqiad.wmnet with O... [20:50:01] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1012 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 795, active_shards: 1356, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 234, delayed_unassigned_shards: 0, number_of_pending_ta [20:50:01] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 4, active_shards_percent_as_number: 85.17587939698493 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:50:23] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 795, active_shards: 1420, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 172, delayed_unassigned_shards: 0, number_of_pending_ta [20:50:23] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 89.19597989949749 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:50:23] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 795, active_shards: 1420, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 172, delayed_unassigned_shards: 0, number_of_pending_ta [20:50:23] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 89.19597989949749 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:50:33] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 795, active_shards: 1453, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 138, delayed_unassigned_shards: 0, number_of_pending_ta [20:50:33] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 82, active_shards_percent_as_number: 91.26884422110552 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:50:45] RECOVERY - OpenSearch health check for shards on 9600 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 795, active_shards: 1499, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 93, delayed_unassigned_shards: 0, number_of_pending_tas [20:50:45] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.15829145728644 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:50:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1113.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:51:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10654439 (10Jclark-ctr) [20:53:05] (03PS9) 10Volans: netbox: refactor support for GraphQL queries [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [20:53:44] FIRING: [2x] KubernetesDeploymentUnavailableReplicas: Deployment thumbor-main in thumbor at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [20:54:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-int/main (k8s) 1.666s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:55:33] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [20:55:38] T389119: Upgrade wmf_opensearch_search_plugins .deb and restart opensearch - https://phabricator.wikimedia.org/T389119 [20:55:46] (03PS6) 10Bartosz Dziewoński: MediaWiki: Only proxy existing .php files, otherwise return nice 404 [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) [20:55:55] (03CR) 10Bartosz Dziewoński: "You mean httpbb, right? Although I remember phpBB fondly ;)" [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) (owner: 10Bartosz Dziewoński) [20:56:03] (03PS7) 10Bartosz Dziewoński: MediaWiki: Only proxy existing .php files, otherwise return nice 404 [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) [20:56:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10654444 (10phaultfinder) [20:56:41] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: upgrade search plugins - bking@cumin2002 - T389119 [20:56:55] (03PS10) 10Volans: netbox: refactor support for GraphQL queries [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [20:57:09] (03PS1) 10Dzahn: vrts: add parameters for exim_deny_senders from private repo [puppet] - 10https://gerrit.wikimedia.org/r/1129369 (https://phabricator.wikimedia.org/T389356) [20:57:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web/main (k8s) 1.542s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:58:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1112.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [20:58:45] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 302 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 764, active_shards: 1229, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 270, delayed_unassigned_shards [20:58:45] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.27433050293925 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:58:45] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 302 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 764, active_shards: 1229, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 270, delayed_unassigned_shards [20:58:45] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.27433050293925 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:58:53] (03CR) 10Volans: "Arzhel, as agreed I've added the tests." [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [20:58:57] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 302 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 764, active_shards: 1229, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 270, delayed_unassigned_shards [20:58:57] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.27433050293925 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:59:20] (03CR) 10Volans: netbox: refactor support for GraphQL queries (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1094291 (owner: 10Ayounsi) [20:59:23] PROBLEM - OpenSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 301 threshold =0.15 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 764, active_shards: 1230, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 269, delayed_unassigned_shards [20:59:23] ber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 80.33964728935337 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:59:52] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on A:eqiad and A:cp for 9.2.9-1wm1 [21:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T2100) [21:00:12] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host elastic1112.eqiad.wmnet with OS bullseye [21:00:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install elastic1111-elastic1125 - https://phabricator.wikimedia.org/T384966#10654473 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host elastic1112.eqiad.wmnet with O... [21:00:41] inflatador: are the above opensearch alerts expected? [21:01:07] (03PS1) 10Bking: rolling-operation.py: Put back a reference to nodes.start_elasticsearch() [cookbooks] - 10https://gerrit.wikimedia.org/r/1129373 (https://phabricator.wikimedia.org/T383811) [21:01:52] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update evaluators from 2025-03-11-234147 to 2025-03-19-125950 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129367 (https://phabricator.wikimedia.org/T314342) (owner: 10Jforrester) [21:02:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-web/main (k8s) 1.359s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:02:38] (03PS1) 10Dzahn: vrts: add profile::vrts::exim_deny_senders with fake value [labs/private] - 10https://gerrit.wikimedia.org/r/1129374 [21:02:55] PROBLEM - Restbase root url on restbase1028 is CRITICAL: connect to address 10.64.0.208 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [21:03:30] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-03-11-234147 to 2025-03-19-125950 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129367 (https://phabricator.wikimedia.org/T314342) (owner: 10Jforrester) [21:03:38] (03PS2) 10Dzahn: vrts: add profile::vrts::exim_deny_senders with fake value [labs/private] - 10https://gerrit.wikimedia.org/r/1129374 (https://phabricator.wikimedia.org/T389079) [21:03:39] (03PS1) 10Volans: spicerack: convert some @property into methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1129375 (https://phabricator.wikimedia.org/T389329) [21:03:51] (03CR) 10Dzahn: [V:03+2 C:03+2] vrts: add profile::vrts::exim_deny_senders with fake value [labs/private] - 10https://gerrit.wikimedia.org/r/1129374 (https://phabricator.wikimedia.org/T389079) (owner: 10Dzahn) [21:04:20] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:04:48] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:05:12] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:05:57] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1011 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 764, active_shards: 1302, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 197, delayed_unassigned_shards: 0, number_of_pending_t [21:05:57] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.04245591116917 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:06:10] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:06:12] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:06:23] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 764, active_shards: 1306, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 193, delayed_unassigned_shards: 0, number_of_pending_t [21:06:23] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.3037230568256 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:06:26] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update orchestrator from 2025-03-11-234105 to 2025-03-19-203723 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129368 (https://phabricator.wikimedia.org/T314342) (owner: 10Jforrester) [21:06:45] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 764, active_shards: 1307, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 192, delayed_unassigned_shards: 0, number_of_pending_t [21:06:45] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.36903984323972 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:06:45] RECOVERY - OpenSearch health check for shards on 9200 on cloudelastic1007 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 4, number_of_data_nodes: 4, discovered_master: True, active_primary_shards: 764, active_shards: 1307, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 192, delayed_unassigned_shards: 0, number_of_pending_t [21:06:45] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.36903984323972 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:07:11] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:07:48] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-03-11-234105 to 2025-03-19-203723 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1129368 (https://phabricator.wikimedia.org/T314342) (owner: 10Jforrester) [21:08:31] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:09:02] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:09:46] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:10:27] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:10:38] (03PS2) 10Dzahn: vrts: add parameters for exim_deny_senders from private repo [puppet] - 10https://gerrit.wikimedia.org/r/1129369 (https://phabricator.wikimedia.org/T389356) [21:10:40] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:11:21] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:11:35] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:11:48] (03PS2) 10Ayounsi: Remove v6 include for e8/f8 uplinks [dns] - 10https://gerrit.wikimedia.org/r/1091711 (https://phabricator.wikimedia.org/T380050) [21:12:27] (03CR) 10CI reject: [V:04-1] Remove v6 include for e8/f8 uplinks [dns] - 10https://gerrit.wikimedia.org/r/1091711 (https://phabricator.wikimedia.org/T380050) (owner: 10Ayounsi) [21:12:49] (03PS3) 10Dzahn: vrts: add parameters for exim_deny_senders from private repo [puppet] - 10https://gerrit.wikimedia.org/r/1129369 (https://phabricator.wikimedia.org/T389356) [21:14:48] (03CR) 10CI reject: [V:04-1] spicerack: convert some @property into methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/1129375 (https://phabricator.wikimedia.org/T389329) (owner: 10Volans) [21:15:47] (03CR) 10Dzahn: [C:03+2] vrts: add parameters for exim_deny_senders from private repo [puppet] - 10https://gerrit.wikimedia.org/r/1129369 (https://phabricator.wikimedia.org/T389356) (owner: 10Dzahn) [21:17:15] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1129375 (https://phabricator.wikimedia.org/T389329) (owner: 10Volans) [21:17:21] !log eevans@cumin1002 START - Cookbook sre.dns.netbox [21:17:27] (03PS1) 10Scott French: hieradata: migrate mw-misc to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1128923 (https://phabricator.wikimedia.org/T383845) [21:17:29] (03PS1) 10Scott French: mw-misc: migrate to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128924 (https://phabricator.wikimedia.org/T383845) [21:17:54] (03PS2) 10Ryan Kemper: rolling-operation.py: Put back a reference to nodes.start_elasticsearch() [cookbooks] - 10https://gerrit.wikimedia.org/r/1129373 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking) [21:18:28] (03Abandoned) 10Ryan Kemper: sre.elasticsearch.rolling-operation: log correct operation type [cookbooks] - 10https://gerrit.wikimedia.org/r/1128536 (owner: 10Ryan Kemper) [21:18:37] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2249 to codfw - jhancock@cumin2002" [21:18:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2249 to codfw - jhancock@cumin2002" [21:18:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:19:51] !log eevans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:21:04] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2249 [21:22:54] (03CR) 10Hashar: "Sorry I have missed this one or I would have deployed it this morning!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128996 (https://phabricator.wikimedia.org/T389252) (owner: 10BryanDavis) [21:23:12] (03PS3) 10Ryan Kemper: rolling-operation.py: Put back a reference to nodes.start_elasticsearch() [cookbooks] - 10https://gerrit.wikimedia.org/r/1129373 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking) [21:23:19] (03PS4) 10Ryan Kemper: rolling-operation.py: Put back a reference to nodes.start_elasticsearch() [cookbooks] - 10https://gerrit.wikimedia.org/r/1129373 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking) [21:23:52] (03PS1) 10Ryan Kemper: sre.elasticsearch.rolling-operation: log correct operation type [cookbooks] - 10https://gerrit.wikimedia.org/r/1129376 [21:26:00] (03PS1) 10Eevans: restbase: commission restbase1043 (refresh for restbase1028) [puppet] - 10https://gerrit.wikimedia.org/r/1129377 (https://phabricator.wikimedia.org/T389423) [21:26:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2249 [21:26:41] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129377 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [21:31:02] (03CR) 10Dzahn: "there was a puppet error on the machines that have the deployment group but not the spiderpig user (remaning mw*, snapshot*, ..). Instead" [puppet] - 10https://gerrit.wikimedia.org/r/1129289 (owner: 10Alexandros Kosiaris) [21:32:13] (03CR) 10Dzahn: "moved the actual value to private repo with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129369" [puppet] - 10https://gerrit.wikimedia.org/r/1128888 (https://phabricator.wikimedia.org/T389079) (owner: 10Arnaudb) [21:33:47] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:35:00] (03PS2) 10Dzahn: create a namespace for codesearch on k8s-aux cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) [21:35:08] (03CR) 10Dzahn: create a namespace for codesearch on k8s-aux cluster (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [21:35:47] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:36:26] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Decommission E/F 8 Dell switches - https://phabricator.wikimedia.org/T380050#10654652 (10BCornwall) Re: https://gerrit.wikimedia.org/r/c/operations/dns/+/1091711/comments/5e6962e8_b88980ce - Do the IPs need to be deleted from netbox? [21:39:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10654657 (10phaultfinder) [21:41:52] (03CR) 10BryanDavis: "No worries. I used the belt and suspenders approach of pinging you for review and scheduling for a backport window. :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128996 (https://phabricator.wikimedia.org/T389252) (owner: 10BryanDavis) [21:49:48] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129343 (https://phabricator.wikimedia.org/T382484) (owner: 10Btullis) [21:50:32] (03CR) 10Ryan Kemper: [C:03+2] rolling-operation.py: Put back a reference to nodes.start_elasticsearch() [cookbooks] - 10https://gerrit.wikimedia.org/r/1129373 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking) [21:51:33] (03CR) 10Ryan Kemper: [C:03+2] sre.elasticsearch.rolling-operation: log correct operation type [cookbooks] - 10https://gerrit.wikimedia.org/r/1129376 (owner: 10Ryan Kemper) [21:52:47] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:54:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10654729 (10phaultfinder) [21:55:47] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:57:48] (03PS1) 10Ahmon Dancy: profile::mediawiki::system_users: Create spiderpig user [puppet] - 10https://gerrit.wikimedia.org/r/1129389 [21:59:24] jouncebot: nowandnext [21:59:24] For the next 0 hour(s) and 0 minute(s): Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T2100) [21:59:25] In 0 hour(s) and 0 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T2200) [21:59:37] I need to roll back the train [21:59:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10654747 (10phaultfinder) [21:59:57] (03Merged) 10jenkins-bot: sre.elasticsearch.rolling-operation: log correct operation type [cookbooks] - 10https://gerrit.wikimedia.org/r/1129376 (owner: 10Ryan Kemper) [21:59:59] it should be quick [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250319T2200) [22:01:02] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129393 (https://phabricator.wikimedia.org/T386216) [22:01:04] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129393 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [22:02:02] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129393 (https://phabricator.wikimedia.org/T386216) (owner: 10TrainBranchBot) [22:02:55] (03PS2) 10Ahmon Dancy: profile::mediawiki::system_users: Create spiderpig user [puppet] - 10https://gerrit.wikimedia.org/r/1129389 [22:03:03] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy) [22:03:37] (03CR) 10Bking: [C:03+1] "Plus one, feel free to schedule at will" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129181 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [22:05:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129181 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [22:06:00] (03CR) 10Ryan Kemper: [C:03+1] cirrus: explicitly route search traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129181 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [22:10:37] (03CR) 10Ahmon Dancy: [C:04-1] "Needs work" [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy) [22:14:04] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.21 refs T386216 [22:14:07] T386216: 1.44.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T386216 [22:14:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10654804 (10phaultfinder) [22:14:44] roll back complete [22:15:59] (03PS3) 10Ahmon Dancy: profile::mediawiki::system_users: Create spiderpig user [puppet] - 10https://gerrit.wikimedia.org/r/1129389 [22:15:59] (03PS1) 10Ahmon Dancy: Revert "admin: remove spiderpig from deployment group" [puppet] - 10https://gerrit.wikimedia.org/r/1129408 [22:18:20] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy) [22:18:23] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129408 (owner: 10Ahmon Dancy) [22:25:37] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy) [22:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10654855 (10phaultfinder) [22:35:24] (03CR) 10Ahmon Dancy: "This should resolve the puppet problem that was reported in #wikimedia-sre." [puppet] - 10https://gerrit.wikimedia.org/r/1129389 (owner: 10Ahmon Dancy) [22:43:51] (03PS1) 10Superpes15: [eswiki] and [commonswiki]/[wikidatawiki] Throttle exemption for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129435 (https://phabricator.wikimedia.org/T389400) [22:44:33] (03CR) 10Reedy: [eswiki] and [commonswiki]/[wikidatawiki] Throttle exemption for Editathon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129435 (https://phabricator.wikimedia.org/T389400) (owner: 10Superpes15) [22:45:06] (03CR) 10CI reject: [V:04-1] [eswiki] and [commonswiki]/[wikidatawiki] Throttle exemption for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129435 (https://phabricator.wikimedia.org/T389400) (owner: 10Superpes15) [22:45:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10654942 (10phaultfinder) [22:46:14] (03PS2) 10Superpes15: Throttle exemption for Editathon in Ciudad de Buenos Aires - 29 March 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129435 (https://phabricator.wikimedia.org/T389400) [22:47:32] (03CR) 10CI reject: [V:04-1] Throttle exemption for Editathon in Ciudad de Buenos Aires - 29 March 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129435 (https://phabricator.wikimedia.org/T389400) (owner: 10Superpes15) [22:47:48] (03CR) 10Superpes15: Throttle exemption for Editathon in Ciudad de Buenos Aires - 29 March 2025 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129435 (https://phabricator.wikimedia.org/T389400) (owner: 10Superpes15) [22:48:50] (03PS3) 10Superpes15: Throttle exemption for Editathon in Ciudad de Buenos Aires - 29 March 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129435 (https://phabricator.wikimedia.org/T389400) [22:57:43] (03CR) 10Krinkle: [C:03+1] "LGTM. The inner one should stay singular regsub() indeed, but the other two are for decoding chars which indeed should happen for all matc" [puppet] - 10https://gerrit.wikimedia.org/r/1129349 (owner: 10Gergő Tisza) [23:04:17] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1712022696 and 79 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:10:17] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:11:22] (03CR) 10Zabe: CommonSettings: Migrate CentralNotice to Virtual Domains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129229 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy) [23:11:51] (03CR) 10Reedy: CommonSettings: Migrate CentralNotice to Virtual Domains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129229 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy) [23:12:01] (03PS2) 10Reedy: CommonSettings: Migrate CentralNotice to Virtual Domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129229 (https://phabricator.wikimedia.org/T389348) [23:16:13] (03PS10) 10Cwhite: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) [23:18:51] (03CR) 10Zabe: [C:03+1] CommonSettings: Migrate CentralNotice to Virtual Domains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129229 (https://phabricator.wikimedia.org/T389348) (owner: 10Reedy) [23:29:17] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1725645576 and 79 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:32:20] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [23:34:17] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 42456 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:34:39] PROBLEM - Hadoop NodeManager on an-worker1202 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:36:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite) [23:40:29] (03CR) 10Krinkle: [C:03+1] "As a result of this bug, Special:WikimediaDebug (https://wikitech.wikimedia.org/wiki/WikimediaDebug#Without_a_browser_extension) doesn't w" [puppet] - 10https://gerrit.wikimedia.org/r/1129349 (owner: 10Gergő Tisza) [23:40:41] (03PS2) 10Krinkle: varnish: Fix X-Wikimedia-Debug cookie handling [puppet] - 10https://gerrit.wikimedia.org/r/1129349 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [23:45:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10655071 (10phaultfinder) [23:48:39] RECOVERY - Hadoop NodeManager on an-worker1202 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:48:39] (03CR) 10Ssingh: [C:03+1] "Feel free to resolve the additional comment, so +1." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129317 (https://phabricator.wikimedia.org/T389387) (owner: 10BCornwall) [23:59:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed