[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1147162 (owner: 10TrainBranchBot) [00:03:29] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:29] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:08:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1147903 [00:08:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1147903 (owner: 10TrainBranchBot) [00:20:37] (03CR) 10Ejegg: [C:03+2] Make BundleSizeTest cross-compatible with <=1.44 and >=1.45 [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147839 (https://phabricator.wikimedia.org/T394542) (owner: 10Jgleeson) [00:20:56] (03CR) 10Ejegg: [C:03+2] Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147043 (owner: 10Greg Grossmeier) [00:21:58] (03Merged) 10jenkins-bot: Make BundleSizeTest cross-compatible with <=1.44 and >=1.45 [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147839 (https://phabricator.wikimedia.org/T394542) (owner: 10Jgleeson) [00:22:20] (03Merged) 10jenkins-bot: Merge branch 'master' into wmf_deploy [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147043 (owner: 10Greg Grossmeier) [00:25:38] (03CR) 10Scott French: [C:03+1] deployment_server: Use cli-image for mw-script [puppet] - 10https://gerrit.wikimedia.org/r/1147901 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [00:26:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1147903 (owner: 10TrainBranchBot) [00:32:22] (03CR) 10RLazarus: [C:03+2] deployment_server: Use cli-image for mw-script [puppet] - 10https://gerrit.wikimedia.org/r/1147901 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [00:33:08] RECOVERY - OpenSearch health check for shards on 9200 on relforge1008 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 51, active_shards: 91, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 12, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight [00:33:08] 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 85.04672897196261 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:33:18] RECOVERY - OpenSearch health check for shards on 9200 on relforge1009 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 51, active_shards: 93, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 10, delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_in_flight [00:33:18] 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.91588785046729 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:33:18] RECOVERY - OpenSearch health check for shards on 9200 on relforge1010 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 51, active_shards: 93, relocating_shards: 0, initializing_shards: 4, unassigned_shards: 10, delayed_unassigned_shards: 0, number_of_pending_tasks: 1, number_of_in_flight [00:33:18] 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 86.91588785046729 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:36:14] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:44:14] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/a6aa5fd3bedef0baf833256196910128f180a03c9d66ac406089c627a38ef2ae/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:48:29] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:54:42] !log rzl@deploy1003 Started scap sync-world: 1147901 [00:54:53] !log rzl@deploy1003 Stopping before sync operations [01:04:14] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:08:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.2 [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1147906 (https://phabricator.wikimedia.org/T392172) [01:08:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.2 [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1147906 (https://phabricator.wikimedia.org/T392172) (owner: 10TrainBranchBot) [01:18:58] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.2 [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1147906 (https://phabricator.wikimedia.org/T392172) (owner: 10TrainBranchBot) [01:26:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:30:16] PROBLEM - MariaDB Replica Lag: s4 on db1155 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 570.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:30:46] PROBLEM - MariaDB Replica Lag: s4 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 600.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:31:22] PROBLEM - MariaDB Replica Lag: s4 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:31:26] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.63 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:34:52] PROBLEM - Restbase root url on restbase1041 is CRITICAL: connect to address 10.64.48.40 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [01:43:29] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:41] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T0200) [02:08:29] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:43:29] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T0300) [03:01:42] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147915 (https://phabricator.wikimedia.org/T392172) [03:01:43] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147915 (https://phabricator.wikimedia.org/T392172) (owner: 10TrainBranchBot) [03:02:34] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147915 (https://phabricator.wikimedia.org/T392172) (owner: 10TrainBranchBot) [03:02:54] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.2 refs T392172 [03:02:58] T392172: 1.45.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T392172 [03:28:29] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:28:29] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:32:06] (03PS1) 10RLazarus: deployment_server: Pass mwscript.command in mwscript-k8s values [puppet] - 10https://gerrit.wikimedia.org/r/1147917 (https://phabricator.wikimedia.org/T378479) [03:32:12] (03PS1) 10RLazarus: mediawiki: Allow varying the entrypoint through mwscript values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147918 (https://phabricator.wikimedia.org/T378479) [03:39:37] (03PS2) 10RLazarus: deployment_server: Pass mwscript.command in mwscript-k8s values [puppet] - 10https://gerrit.wikimedia.org/r/1147917 (https://phabricator.wikimedia.org/T378479) [03:39:58] (03PS3) 10RLazarus: deployment_server: Pass mwscript.command in mwscript-k8s values [puppet] - 10https://gerrit.wikimedia.org/r/1147917 (https://phabricator.wikimedia.org/T378479) [03:53:45] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.2 refs T392172 (duration: 50m 50s) [03:53:49] T392172: 1.45.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T392172 [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T0400) [04:01:41] !log mwpresync@deploy1003 Pruned MediaWiki: 1.44.0-wmf.27 (duration: 01m 33s) [04:03:29] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:03:29] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:23:52] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:24:14] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:24:22] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:24:34] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:24:52] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:25:10] FIRING: [2x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:25:14] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:25:22] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:25:34] RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:25:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=drmrs&var-device=cr1-drmrs:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr2-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:30:10] RESOLVED: [4x] BFDdown: BFD session down between cr1-drmrs and 185.15.58.138 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:30:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-drmrs and cr2-eqiad (185.15.58.138) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:45:50] (03PS1) 10Marostegui: db1183: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1147920 (https://phabricator.wikimedia.org/T394661) [04:47:05] (03CR) 10Marostegui: [C:03+2] db1183: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1147920 (https://phabricator.wikimedia.org/T394661) (owner: 10Marostegui) [04:47:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76318 and previous config saved to /var/cache/conftool/dbconfig/20250520-044744-root.json [04:48:14] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (gerrit2003), Stale: 2 (cloudservices2005-dev, ...), Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:49:39] (03PS1) 10Marostegui: db1155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1147921 (https://phabricator.wikimedia.org/T394624) [04:50:39] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance [04:51:22] !log Stop mariadb on db1155, wiki replicas will show lag on: s2, s4, s6 and s7 T394624 [04:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:25] T394624: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624 [04:51:40] (03CR) 10Marostegui: [C:03+2] db1155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1147921 (https://phabricator.wikimedia.org/T394624) (owner: 10Marostegui) [04:53:04] PROBLEM - MariaDB Replica IO: s2 on clouddb1014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3312 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:53:04] PROBLEM - MariaDB Replica IO: s6 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3316 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:53:06] PROBLEM - MariaDB Replica IO: s4 on clouddb1019 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3314 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:53:16] PROBLEM - MariaDB Replica IO: s7 on clouddb1018 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3317 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:53:22] ^ this is expected [04:53:22] PROBLEM - MariaDB Replica IO: s4 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3314 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:53:30] PROBLEM - MariaDB Replica IO: s4 on clouddb1015 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3314 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:53:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1014.eqiad.wmnet with reason: Maintenance [04:53:46] PROBLEM - MariaDB Replica IO: s6 on clouddb1019 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3316 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:53:46] PROBLEM - MariaDB Replica IO: s6 on clouddb1015 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3316 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:53:46] PROBLEM - MariaDB Replica IO: s2 on clouddb1018 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3312 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:53:46] PROBLEM - MariaDB Replica IO: s2 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3312 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:04] PROBLEM - MariaDB Replica IO: s7 on an-redacteddb1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3317 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1015.eqiad.wmnet with reason: Maintenance [04:54:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1019.eqiad.wmnet with reason: Maintenance [04:54:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1018.eqiad.wmnet with reason: Maintenance [04:54:38] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Maintenance [04:56:21] ACKNOWLEDGEMENT - MariaDB Replica IO: s2 on db1155 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:56:21] ACKNOWLEDGEMENT - MariaDB Replica IO: s4 on db1155 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:56:21] ACKNOWLEDGEMENT - MariaDB Replica IO: s6 on db1155 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:56:21] ACKNOWLEDGEMENT - MariaDB Replica IO: s7 on db1155 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:56:21] ACKNOWLEDGEMENT - MariaDB Replica Lag: s2 on db1155 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:56:21] ACKNOWLEDGEMENT - MariaDB Replica Lag: s4 on db1155 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:56:21] ACKNOWLEDGEMENT - MariaDB Replica Lag: s6 on db1155 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:56:22] ACKNOWLEDGEMENT - MariaDB Replica Lag: s7 on db1155 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:56:23] ACKNOWLEDGEMENT - MariaDB Replica SQL: s2 on db1155 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:56:24] ACKNOWLEDGEMENT - MariaDB Replica SQL: s4 on db1155 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:56:24] ACKNOWLEDGEMENT - MariaDB Replica SQL: s6 on db1155 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:56:25] ACKNOWLEDGEMENT - MariaDB Replica SQL: s7 on db1155 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:56:25] ACKNOWLEDGEMENT - MariaDB read only s2 on db1155 is CRITICAL: Could not connect to localhost:3312 Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:56:25] ACKNOWLEDGEMENT - MariaDB read only s4 on db1155 is CRITICAL: Could not connect to localhost:3314 Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:56:25] ACKNOWLEDGEMENT - MariaDB read only s6 on db1155 is CRITICAL: Could not connect to localhost:3316 Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:56:26] ACKNOWLEDGEMENT - MariaDB read only s7 on db1155 is CRITICAL: Could not connect to localhost:3317 Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [04:56:27] ACKNOWLEDGEMENT - mysqld processes on db1155 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Marostegui gerrit.wikimedia.org https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:00:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2236 T394653', diff saved to https://phabricator.wikimedia.org/P76319 and previous config saved to /var/cache/conftool/dbconfig/20250520-050017-marostegui.json [05:00:22] T394653: Test MariaDB 10.11.13 - https://phabricator.wikimedia.org/T394653 [05:02:33] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10837779 (10Marostegui) @VRiley-WMF db1155 is now off and ready for you to replace the memory whenever you want. [05:02:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76320 and previous config saved to /var/cache/conftool/dbconfig/20250520-050250-root.json [05:03:03] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2236.codfw.wmnet with reason: Maintenance [05:03:47] !log Install 10.11.13 on db2236 T394653 [05:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76321 and previous config saved to /var/cache/conftool/dbconfig/20250520-050710-root.json [05:08:29] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:05] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [05:15:03] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [05:17:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76322 and previous config saved to /var/cache/conftool/dbconfig/20250520-051756-root.json [05:22:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76323 and previous config saved to /var/cache/conftool/dbconfig/20250520-052215-root.json [05:25:14] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for karma [puppet] - 10https://gerrit.wikimedia.org/r/1145826 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [05:26:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:33:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76324 and previous config saved to /var/cache/conftool/dbconfig/20250520-053302-root.json [05:37:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76325 and previous config saved to /var/cache/conftool/dbconfig/20250520-053720-root.json [05:45:33] (03CR) 10Marostegui: "What's pending here?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [05:48:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76326 and previous config saved to /var/cache/conftool/dbconfig/20250520-054807-root.json [05:48:41] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76327 and previous config saved to /var/cache/conftool/dbconfig/20250520-055225-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T0600) [06:00:05] marostegui, Amir1, and federico3: Your horoscope predicts another Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T0600). [06:03:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1183 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76328 and previous config saved to /var/cache/conftool/dbconfig/20250520-060313-root.json [06:03:29] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:04:35] (03CR) 10Brouberol: [C:03+2] airflow: do not package the tls-termination service for devenvs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147777 (https://phabricator.wikimedia.org/T393999) (owner: 10Brouberol) [06:05:24] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:06:27] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:07:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76329 and previous config saved to /var/cache/conftool/dbconfig/20250520-060731-root.json [06:08:29] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:10:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391312 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:15:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391312 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:16:21] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:21:25] fixed ↑ [06:21:27] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 07 Aug 2025 09:25:51 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:22:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76330 and previous config saved to /var/cache/conftool/dbconfig/20250520-062237-root.json [06:24:08] 06SRE, 06Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists: Replace Exim on lists.wikimedia.org with Postfix - https://phabricator.wikimedia.org/T378021#10837839 (10ABran-WMF) [06:37:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76331 and previous config saved to /var/cache/conftool/dbconfig/20250520-063743-root.json [06:43:29] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:43:53] RECOVERY - ElasticSearch setting check - 9200 on elastic1094 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [06:43:55] RECOVERY - ElasticSearch setting check - 9200 on elastic1100 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [06:44:26] (03PS1) 10Slyngshede: Netbox: Disable until upgrade [alerts] - 10https://gerrit.wikimedia.org/r/1148197 [06:46:03] (03CR) 10CI reject: [V:04-1] Netbox: Disable until upgrade [alerts] - 10https://gerrit.wikimedia.org/r/1148197 (owner: 10Slyngshede) [06:50:15] (03PS1) 10Brouberol: airflow: Monitor empty dag bags [alerts] - 10https://gerrit.wikimedia.org/r/1148198 (https://phabricator.wikimedia.org/T394459) [06:50:15] FIRING: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:52:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2236 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76332 and previous config saved to /var/cache/conftool/dbconfig/20250520-065249-root.json [06:54:34] (03PS1) 10Muehlenhoff: corto: Make the auto restart follow the ensure for Corto [puppet] - 10https://gerrit.wikimedia.org/r/1148199 [06:55:15] RESOLVED: ProbeDown: Service idp1004:443 has failed probes (http_idp_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/CAS-SSO#Alerting - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:57:32] (03PS1) 10Slyngshede: P:idp set theme on CAS 7.1 host [puppet] - 10https://gerrit.wikimedia.org/r/1148200 [06:57:59] (03CR) 10CI reject: [V:04-1] corto: Make the auto restart follow the ensure for Corto [puppet] - 10https://gerrit.wikimedia.org/r/1148199 (owner: 10Muehlenhoff) [06:58:52] (03CR) 10Elukey: [C:03+2] profile::prometheus: remove istio recording rules [puppet] - 10https://gerrit.wikimedia.org/r/1147803 (https://phabricator.wikimedia.org/T387350) (owner: 10Elukey) [06:58:58] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5600/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148200 (owner: 10Slyngshede) [07:00:05] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T0700). Please do the needful. [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:46] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5601/console" [puppet] - 10https://gerrit.wikimedia.org/r/1148200 (owner: 10Slyngshede) [07:02:14] (03PS2) 10Muehlenhoff: corto: Make the auto restart follow the ensure for Corto [puppet] - 10https://gerrit.wikimedia.org/r/1148199 [07:02:18] (03PS1) 10Brouberol: deployment_server: deploy the mediawiki-dumps-legacy scap target [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) [07:02:19] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp set theme on CAS 7.1 host [puppet] - 10https://gerrit.wikimedia.org/r/1148200 (owner: 10Slyngshede) [07:05:24] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:06:27] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:08:57] (03PS1) 10Marostegui: control-mariadb-10.11-bookworm: Bump to 10.11.13 [software] - 10https://gerrit.wikimedia.org/r/1148206 (https://phabricator.wikimedia.org/T394653) [07:09:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148199 (owner: 10Muehlenhoff) [07:10:18] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.11-bookworm: Bump to 10.11.13 [software] - 10https://gerrit.wikimedia.org/r/1148206 (https://phabricator.wikimedia.org/T394653) (owner: 10Marostegui) [07:10:45] (03Merged) 10jenkins-bot: control-mariadb-10.11-bookworm: Bump to 10.11.13 [software] - 10https://gerrit.wikimedia.org/r/1148206 (https://phabricator.wikimedia.org/T394653) (owner: 10Marostegui) [07:15:10] (03PS1) 10Slyngshede: IDP: Switch IDP to CAS version 7.1 [dns] - 10https://gerrit.wikimedia.org/r/1148207 [07:15:59] 10ops-eqiad, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on backup1008 - https://phabricator.wikimedia.org/T394673#10837882 (10jcrespo) Thank you for the prompt action. [07:17:13] (03CR) 10Muehlenhoff: [C:03+1] "Setup on 1004 looks good" [dns] - 10https://gerrit.wikimedia.org/r/1148207 (owner: 10Slyngshede) [07:17:35] (03PS27) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) [07:18:10] (03CR) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [07:22:52] (03CR) 10Slyngshede: [C:03+2] IDP: Switch IDP to CAS version 7.1 [dns] - 10https://gerrit.wikimedia.org/r/1148207 (owner: 10Slyngshede) [07:23:00] !log slyngshede@dns1004 START - running authdns-update [07:23:45] !log slyngshede@dns1004 END - running authdns-update [07:25:52] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10837896 (10jcrespo) 8 more alerts were received at the SRE board: {F60283925} I've downtimeed them. [07:25:58] !log disabling varnishkafka (webrequest) on A:cp (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1147783) (T393772) [07:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:02] T393772: Shutdown varnishkafka instances - https://phabricator.wikimedia.org/T393772 [07:27:27] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [07:28:29] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:29:21] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1805, relocating_shards: 1, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [07:29:21] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:31:50] (03CR) 10Fabfur: [C:03+2] hiera: disable vk (webrequest) on A:cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1147783 (https://phabricator.wikimedia.org/T393772) (owner: 10Fabfur) [07:32:12] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632#10837910 (10jcrespo) Hi, would it be possible to downtime this alert while it is being handled? [07:32:38] (03CR) 10Ayounsi: calico: Set veth_mtu to 1480 for staging-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [07:35:56] 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests: Grant Jenkins admin rights to Peter Hedenskog (QTE) - https://phabricator.wikimedia.org/T394749 (10hashar) 03NEW [07:36:26] 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests, 07Jenkins: Grant Jenkins admin rights to Peter Hedenskog (QTE) - https://phabricator.wikimedia.org/T394749#10837923 (10hashar) [07:36:36] (03CR) 10Elukey: [C:03+2] kartotherian: simplify the readinessProble's path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128432 (owner: 10Elukey) [07:36:53] (03CR) 10Jcrespo: [C:03+1] "I've forced a full backup rerun, we'll see if that helps. I would prefer further discussions to move to a ticket (gerrit is only ok, I thi" [puppet] - 10https://gerrit.wikimedia.org/r/1140507 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn) [07:43:29] (03PS1) 10Hashar: admin: Add phedenskog to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/1148264 (https://phabricator.wikimedia.org/T394749) [07:44:07] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [07:45:07] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [07:45:07] (03CR) 10Hashar: "+ @tcipriani@wikimedia.org for ciadmin management approval." [puppet] - 10https://gerrit.wikimedia.org/r/1148264 (https://phabricator.wikimedia.org/T394749) (owner: 10Hashar) [07:45:33] (03CR) 10Elukey: [C:03+2] "Of course I didn't bump the Chart's version.." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1128432 (owner: 10Elukey) [07:46:15] (03PS1) 10Elukey: charts: bump Kartotherian's version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148265 [07:47:16] (03CR) 10Elukey: [V:03+2 C:03+2] charts: bump Kartotherian's version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148265 (owner: 10Elukey) [07:48:30] 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10LDAP-Access-Requests, 07Jenkins, 13Patch-For-Review: Grant Jenkins admin rights to Peter Hedenskog (QTE) - https://phabricator.wikimedia.org/T394749#10837946 (10hashar) + @thcipriani as the manager approving shell access / contint-admins (... [07:50:42] (03CR) 10Fabfur: [C:03+1] Remove unused varnishkafka configuration [alerts] - 10https://gerrit.wikimedia.org/r/1146516 (https://phabricator.wikimedia.org/T391810) (owner: 10Fabfur) [07:50:44] (03CR) 10Fabfur: [C:03+2] Remove unused varnishkafka configuration [alerts] - 10https://gerrit.wikimedia.org/r/1146516 (https://phabricator.wikimedia.org/T391810) (owner: 10Fabfur) [07:54:38] !log removing varnishkafka related alerts from prometheus (https://gerrit.wikimedia.org/r/c/operations/alerts/+/1146516) (T393772) [07:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:41] T393772: Shutdown varnishkafka webrequest instances - https://phabricator.wikimedia.org/T393772 [08:00:05] andre and jnuche: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T0800). [08:03:29] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:29] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:04:15] (03CR) 10David Caro: "@raymond whenever you have tested report back here for re-review and merging" [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [08:11:12] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148266 (https://phabricator.wikimedia.org/T392172) [08:11:13] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148266 (https://phabricator.wikimedia.org/T392172) (owner: 10TrainBranchBot) [08:12:11] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148266 (https://phabricator.wikimedia.org/T392172) (owner: 10TrainBranchBot) [08:15:39] (03PS1) 10Volans: git::clone: fix support for different remote name [puppet] - 10https://gerrit.wikimedia.org/r/1148267 [08:15:39] (03PS1) 10Volans: homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) [08:16:56] (03CR) 10CI reject: [V:04-1] homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [08:18:29] (03CR) 10CI reject: [V:04-1] git::clone: fix support for different remote name [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [08:25:11] (03PS2) 10Volans: git::clone: fix support for different remote name [puppet] - 10https://gerrit.wikimedia.org/r/1148267 [08:25:11] (03PS2) 10Volans: homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) [08:25:25] (03CR) 10Tiziano Fogli: [C:03+2] ircecho: exit upon disconnection [puppet] - 10https://gerrit.wikimedia.org/r/1147766 (https://phabricator.wikimedia.org/T389937) (owner: 10Tiziano Fogli) [08:26:27] (03CR) 10CI reject: [V:04-1] homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [08:27:36] (03PS3) 10Volans: homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) [08:28:47] (03CR) 10CI reject: [V:04-1] homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [08:29:08] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [08:30:00] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1805, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [08:30:00] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:30:18] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.2 refs T392172 [08:30:22] T392172: 1.45.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T392172 [08:31:21] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [08:32:02] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [08:32:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [08:32:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [08:32:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [08:34:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [08:35:19] (03CR) 10Effie Mouzeli: [C:03+1] hieradata: switch mw-debug pinkunicorn to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1137498 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [08:35:19] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [08:36:03] !log restart gnmic in esams - T388641 [08:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:07] T388641: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641 [08:36:44] 06SRE, 10LDAP-Access-Requests: Grant Access to https://idm.wikimedia.org/ for maxbinderWMF - https://phabricator.wikimedia.org/T394523#10838024 (10SLyngshede-WMF) @MBinder_WMF Can you sign in at https://idp.wikimedia.org ? If so, can you try doing that, then go to https://idm.wikimedia.org/ I do see you're... [08:38:29] !log restart gnmic in eqsin - T388641 [08:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:04] (03PS1) 10Brouberol: mediawiki-dumps-legacy: deploy a sync toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148270 (https://phabricator.wikimedia.org/T393689) [08:40:20] (03CR) 10Effie Mouzeli: [C:03+1] mw-debug: switch mw-debug pinkunicorn to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137499 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [08:41:42] (03CR) 10Brouberol: [C:03+1] Add all WMF domains to the eventgate-analytics-external certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147737 (https://phabricator.wikimedia.org/T391411) (owner: 10Btullis) [08:43:35] (03PS1) 10Muehlenhoff: Add maps-replica2002 as Bookworm maps replica [puppet] - 10https://gerrit.wikimedia.org/r/1148271 (https://phabricator.wikimedia.org/T381565) [08:44:01] (03CR) 10Elukey: [C:03+1] Add maps-replica2002 as Bookworm maps replica [puppet] - 10https://gerrit.wikimedia.org/r/1148271 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:48:48] 06SRE, 10LDAP-Access-Requests: Grant Access to https://idm.wikimedia.org/ for maxbinderWMF - https://phabricator.wikimedia.org/T394523#10838052 (10SLyngshede-WMF) It might be an issue with your email already being in the system using the new "maxbinderwmf" user. I've invalided that email address, let's see if... [08:49:31] !log restart gnmic in codfw - T388641 [08:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:35] T388641: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641 [08:49:46] 06SRE: Avoid using codfw expansion cage for non-IPIP LVS-fronted services - https://phabricator.wikimedia.org/T394286#10838055 (10cmooney) >>! In T394286#10834871, @Jhancock.wm wrote: > I think I'm gonna make a physical list and post it somewhere in the DH5. for my personal reference. I will otherwise forget thi... [08:50:51] (03PS1) 10Jgiannelos: pcs-rb-sunset: Disable changeprop rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670) [08:51:22] (03CR) 10Alexandros Kosiaris: [C:03+2] calico: Allow to override the MTU via values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145981 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [08:52:01] (03PS2) 10Jgiannelos: pcs: Block RB traffic for all domains [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145828 [08:52:01] (03PS2) 10Jgiannelos: pcs-rb-sunset: Disable changeprop rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670) [08:52:02] (03PS4) 10Volans: homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) [08:52:22] (03CR) 10Cathal Mooney: calico: Set veth_mtu to 1480 for staging-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [08:52:59] (03Merged) 10jenkins-bot: calico: Allow to override the MTU via values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145981 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [08:53:15] (03CR) 10CI reject: [V:04-1] homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [08:53:52] (03PS4) 10Kevin Bazira: Add vLLM image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) [08:54:00] (03PS3) 10Jgiannelos: pcs-rb-sunset: Disable changeprop rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670) [08:54:21] (03CR) 10Kevin Bazira: Add vLLM image (035 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [08:54:39] (03PS4) 10Jgiannelos: pcs-rb-sunset: Disable changeprop rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670) [08:55:33] (03PS3) 10Alexandros Kosiaris: calico: Set veth_mtu to 1480 for staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T342956) [08:56:21] (03PS4) 10Alexandros Kosiaris: calico: Set veth_mtu to 1480 for {,ml}-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T352956) [08:57:27] (03PS1) 10Jelto: gerrit: remove temp firewall rule for hackathon [puppet] - 10https://gerrit.wikimedia.org/r/1148275 (https://phabricator.wikimedia.org/T382309) [08:58:19] (03PS5) 10Alexandros Kosiaris: calico: Set veth_mtu to 1480 for {,ml}-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T342956) [08:58:55] (03PS3) 10Majavah: varnish: Allow customising "contact noc@" error [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) [08:59:19] (03PS4) 10Majavah: varnish: Allow customising "contact noc@" error [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) [08:59:29] (03CR) 10Alexandros Kosiaris: calico: Set veth_mtu to 1480 for {,ml}-staging-codfw (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T342956) (owner: 10Alexandros Kosiaris) [09:00:50] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5602/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148275 (https://phabricator.wikimedia.org/T382309) (owner: 10Jelto) [09:01:35] (03CR) 10Elukey: "Great work Kevin! I added a couple of nits but the rest looks really good, I think that we should be ready to go. Lemme know how building/" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [09:01:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148271 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:02:02] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: sync [09:02:36] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [09:02:47] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10838092 (10MatthewVernon) [09:03:13] (03CR) 10Tiziano Fogli: [C:03+1] "Since Filippo is out of office, I'm adding my +1 on behalf of o11y." [puppet] - 10https://gerrit.wikimedia.org/r/1148199 (owner: 10Muehlenhoff) [09:03:23] (03PS1) 10Jcrespo: dbbackups: Setup backups for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) [09:04:14] (03PS5) 10Volans: homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) [09:05:15] (03CR) 10CI reject: [V:04-1] homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [09:05:41] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10838096 (10MatthewVernon) [09:08:46] (03PS5) 10Kevin Bazira: Add vLLM image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) [09:09:13] (03CR) 10Kevin Bazira: Add vLLM image (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [09:10:38] (03CR) 10Elukey: [C:03+1] "Let's wait for Tobias' final sign off, but we should be good for prime time." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [09:10:38] (03PS1) 10Slyngshede: P:ldap::client::ldaptui Add missing aux schemas [puppet] - 10https://gerrit.wikimedia.org/r/1148279 [09:11:54] (03CR) 10Btullis: [V:03+1 C:03+2] Add dse-k8s-worker10[10-11] to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1147887 (https://phabricator.wikimedia.org/T394647) (owner: 10Btullis) [09:12:12] (03PS2) 10Slyngshede: P:ldap::client::ldaptui Add missing aux schemas [puppet] - 10https://gerrit.wikimedia.org/r/1148279 (https://phabricator.wikimedia.org/T394341) [09:12:21] (03PS2) 10Jcrespo: dbbackups: Setup backups for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) [09:12:31] (03CR) 10Volans: "Let me know what do you think." [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [09:13:52] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [09:13:58] (03PS3) 10Jcrespo: dbbackups: Setup backups for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) [09:14:22] (03PS1) 10MVernon: apus: add apus-be1004 to eqiad cluster as osd server [puppet] - 10https://gerrit.wikimedia.org/r/1148280 (https://phabricator.wikimedia.org/T392844) [09:18:08] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [09:18:22] (03PS1) 10Ayounsi: BFDdown: don't deploy in codfw [alerts] - 10https://gerrit.wikimedia.org/r/1148281 (https://phabricator.wikimedia.org/T364092) [09:18:45] (03CR) 10Jcrespo: [C:03+1] apus: add apus-be1004 to eqiad cluster as osd server [puppet] - 10https://gerrit.wikimedia.org/r/1148280 (https://phabricator.wikimedia.org/T392844) (owner: 10MVernon) [09:18:53] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T342956) (owner: 10Alexandros Kosiaris) [09:20:25] (03CR) 10MVernon: [C:03+2] apus: add apus-be1004 to eqiad cluster as osd server [puppet] - 10https://gerrit.wikimedia.org/r/1148280 (https://phabricator.wikimedia.org/T392844) (owner: 10MVernon) [09:23:18] (03CR) 10Tiziano Fogli: [C:03+1] BFDdown: don't deploy in codfw [alerts] - 10https://gerrit.wikimedia.org/r/1148281 (https://phabricator.wikimedia.org/T364092) (owner: 10Ayounsi) [09:23:29] RESOLVED: [2x] KubernetesCalicoDown: dse-k8s-worker1010.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:24:34] (03CR) 10Muehlenhoff: [C:03+2] corto: Make the auto restart follow the ensure for Corto [puppet] - 10https://gerrit.wikimedia.org/r/1148199 (owner: 10Muehlenhoff) [09:24:45] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: deploy a sync toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148270 (https://phabricator.wikimedia.org/T393689) (owner: 10Brouberol) [09:25:23] !log brouberol@cumin2002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1010.eqiad.wmnet [09:25:45] !log brouberol@cumin2002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1011.eqiad.wmnet [09:26:24] (03CR) 10Ayounsi: [C:03+2] BFDdown: don't deploy in codfw [alerts] - 10https://gerrit.wikimedia.org/r/1148281 (https://phabricator.wikimedia.org/T364092) (owner: 10Ayounsi) [09:26:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:27:26] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1145875 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:27:28] (03CR) 10Hnowlan: [C:03+2] rest-gateway: route reading lists API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143127 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [09:28:47] (03CR) 10Ayounsi: [C:03+2] Bump TransitPeering in/out Saturation to critical [alerts] - 10https://gerrit.wikimedia.org/r/1145875 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:28:56] (03Merged) 10jenkins-bot: BFDdown: don't deploy in codfw [alerts] - 10https://gerrit.wikimedia.org/r/1148281 (https://phabricator.wikimedia.org/T364092) (owner: 10Ayounsi) [09:28:56] (03CR) 10Hnowlan: [C:03+1] mw::maintenance: Add ttlsecondsafterfinished to long interval jobs [puppet] - 10https://gerrit.wikimedia.org/r/1147778 (https://phabricator.wikimedia.org/T394423) (owner: 10Clément Goubert) [09:28:57] (03CR) 10Alexandros Kosiaris: [C:03+2] calico: Set veth_mtu to 1480 for {,ml}-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T342956) (owner: 10Alexandros Kosiaris) [09:29:39] (03Merged) 10jenkins-bot: rest-gateway: route reading lists API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143127 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [09:29:46] (03CR) 10Hnowlan: [C:03+1] mw:maintenance: Fix newlines in kubernetes periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1147754 (owner: 10Clément Goubert) [09:30:12] (03Merged) 10jenkins-bot: Bump TransitPeering in/out Saturation to critical [alerts] - 10https://gerrit.wikimedia.org/r/1145875 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:30:15] (03CR) 10Majavah: "Note that after merging this the vm images need to be rebuilt" [puppet] - 10https://gerrit.wikimedia.org/r/1147166 (https://phabricator.wikimedia.org/T394438) (owner: 10Andrew Bogott) [09:30:34] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1010.eqiad.wmnet [09:30:52] (03PS6) 10Elukey: icinga: skip downtimed services in wait_for_optimal if needed [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) [09:31:21] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:31:28] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:31:32] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1011.eqiad.wmnet [09:31:43] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance: Add ttlsecondsafterfinished to long interval jobs [puppet] - 10https://gerrit.wikimedia.org/r/1147778 (https://phabricator.wikimedia.org/T394423) (owner: 10Clément Goubert) [09:31:47] (03CR) 10Alexandros Kosiaris: [C:03+2] "Sigh, commit message was never updated and says 1480. I guess that will keep on living forever. But if anyone reads this, it's 1460, not 1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T342956) (owner: 10Alexandros Kosiaris) [09:31:49] (03CR) 10Clément Goubert: [C:03+2] mw:maintenance: Fix newlines in kubernetes periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1147754 (owner: 10Clément Goubert) [09:32:04] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [09:32:17] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [09:32:25] (03PS2) 10Slyngshede: Netbox: Disable until upgrade [alerts] - 10https://gerrit.wikimedia.org/r/1148197 [09:33:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:49] (03CR) 10CI reject: [V:04-1] Netbox: Disable until upgrade [alerts] - 10https://gerrit.wikimedia.org/r/1148197 (owner: 10Slyngshede) [09:33:58] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [09:33:59] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:34:07] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:35:21] (03Merged) 10jenkins-bot: calico: Set veth_mtu to 1480 for {,ml}-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1145982 (https://phabricator.wikimedia.org/T342956) (owner: 10Alexandros Kosiaris) [09:37:51] !log akosiaris@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:38:19] (03PS1) 10Brouberol: Include the hostname in the phabricator message when rebooting a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1148284 (https://phabricator.wikimedia.org/T394647) [09:38:44] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [09:38:46] (03PS4) 10Jcrespo: dbbackups: Setup backups for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) [09:39:09] !log akosiaris@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:39:59] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [09:41:07] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [09:43:21] (03CR) 10Muehlenhoff: [C:03+2] Add maps-replica2002 as Bookworm maps replica [puppet] - 10https://gerrit.wikimedia.org/r/1148271 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:43:39] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10838224 (10ayounsi) [09:44:57] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [09:44:59] (03PS1) 10Hnowlan: trafficserver: route testwiki reading lists APIs without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1148285 (https://phabricator.wikimedia.org/T384891) [09:45:00] (03CR) 10CI reject: [V:04-1] Include the hostname in the phabricator message when rebooting a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1148284 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [09:45:31] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/output/1148278/3935/" [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [09:46:14] Hi folks, I'm not familiar with the backporting process, and I am trying to help push out a UBN for central-notice, which powers our fundraising banners. Jdlrobson has advised me to submit the patch via https://schedule-deployment.toolforge.org/ to be released as a backport, but when I submit the patch number, I get "Only unmerged changes can be backported.". I'm not sure how to proceed. Do I need to create a new unmerged [09:46:14] patch from the approved and merged fix? Here's the patch chain we'd like to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/1147911/1. Any help would be greatly appreciated! [09:46:30] (03PS1) 10Hashar: wm-zuul-status: reset current checks [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148286 (https://phabricator.wikimedia.org/T394485) [09:47:57] jgleeson: iirc you can't backport the merge commit, you'd have to backport both the merged changes. cc hashar may be able to confirm [09:48:42] hmm [09:48:56] (03CR) 10Hashar: [C:03+2] wm-zuul-status: reset current checks [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148286 (https://phabricator.wikimedia.org/T394485) (owner: 10Hashar) [09:49:09] ah ok thanks claime, I'll try targeting the patch itself and see if that works [09:49:12] jgleeson: try cherry-picking that commit to the wmf/ release branch, i think that should schedule [09:49:27] (03Merged) 10jenkins-bot: wm-zuul-status: reset current checks [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148286 (https://phabricator.wikimedia.org/T394485) (owner: 10Hashar) [09:49:30] also why are we backporting localisation updates? [09:49:32] my guess is the schedule deployment tool is confused by CentralNotice using `wmf_deploy` as a branch [09:49:37] or has some limitation [09:50:23] since the change got merged, I guess we can deploy it manually [09:50:41] I am not sure whether the mediawiki/core wmf/ branch properly tracks CentralNotice wmf_deploy branch [09:51:11] !log hashar@deploy1003 Started deploy [gerrit/gerrit@2ecc180]: wm-zuul-status: reset current checks - T394485 [09:51:16] T394485: Incorrect "CI has completed checks" popup appears when navigating from a change with tests in progress to one with no tests in progress - https://phabricator.wikimedia.org/T394485 [09:51:20] hashar: no, CentralNotice has separate wmf/ branches that are branched from the wmf_deploy branch each week [09:51:22] !log hashar@deploy1003 Finished deploy [gerrit/gerrit@2ecc180]: wm-zuul-status: reset current checks - T394485 (duration: 00m 11s) [09:51:23] taavi: ah maybe this is a better merge branch to point to https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/1147043 (I was pointing to the latest on the main deploy branch) [09:51:27] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: deploy a sync toolbox pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148270 (https://phabricator.wikimedia.org/T393689) (owner: 10Brouberol) [09:51:43] jgleeson: so just to confirm, is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/1147839 the actual fix you're trying to ship out? [09:51:45] that repo is such an outlier [09:53:01] (03PS3) 10Fabfur: haproxy: use maxmind lua bindings to lookup client ISP [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) [09:53:07] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [09:53:11] taavi: that one you linked is a prerequisite patch added to get CI to pass, for the actual UBN fix which is merged with this patch (see how it sits ontop of the test patch you link) https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/1147043 [09:53:49] we had to add the test fix after the UBN as CI was failing on the wmf/1.45.0-wmf.1 branch [09:54:03] and then Jdlrobson reordered them [09:54:09] ... so that has a commit message saying it's a merge commit but actually it's not a merge commit? [09:54:26] it is a merge commit [09:54:39] and the tooling should be able to deploy it [09:54:58] I just submitting 1147043 and I get the same [09:55:04] I'd suggest sending a change that is merging `wmf_deploy` into `wmf/1.45.0-wmf.2` [09:55:08] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [09:55:12] (03CR) 10Majavah: "@ejegg@ejegg.com: Please do not +2 any changes on MediaWiki wmf/* branches unless you are planning to deploy them yourself immediately aft" [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147839 (https://phabricator.wikimedia.org/T394542) (owner: 10Jgleeson) [09:55:16] and deploy that merge commit [09:55:28] I don't think the tooling will notice, and Gerrit would definitely be able to submit it [09:55:40] (03PS4) 10Fabfur: haproxy: use maxmind lua bindings to lookup client ISP [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) [09:55:46] (03CR) 10Brouberol: "Should we close this one?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127916 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [09:56:01] jgleeson: problem is that these patches were +2'd in the wmf/* branch but never deployed, which should never ever happen [09:56:15] (03PS1) 10Ayounsi: InboundInterfaceErrors: disable pint [alerts] - 10https://gerrit.wikimedia.org/r/1148287 (https://phabricator.wikimedia.org/T388641) [09:56:19] jgleeson: which branch are you trying to backport these fixes to? 1.45.0-wmf.1 or 1.45.0-wmf.2? both? [09:57:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:58:03] taavi: lemme check [09:58:16] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.218 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:58:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:58:48] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [09:58:49] taavi: .1 [09:58:50] https://gerrit.wikimedia.org/r/q/project:mediawiki/extensions/CentralNotice+branch:wmf/1.45.0-wmf.1 [09:59:23] and the changes were needed due to this issue https://phabricator.wikimedia.org/T394542 [09:59:58] hashar: what is the .2 increment for? [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1000) [10:00:09] https://versions.toolforge.org/ [10:00:10] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [10:00:13] (03CR) 10Marostegui: "x3 needs backups?" [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [10:00:16] last week we have deployed 1.45.0-wmf.1 [10:00:39] this week we are deploy 1.45.0-wmf.2 which is currently only depoyed in the group0 wikis [10:01:00] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 766, active_shards: 1805, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [10:01:00] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:01:34] jgleeson: ok, if we deploy that now, are you able to test the changes on a debug server? (https://wikitech.wikimedia.org/wiki/WikimediaDebug) [10:01:38] so production currently runs TWO branch in parallel [10:02:49] (03PS1) 10Elukey: admin_ng: bump kartotherian's resourcequota to unblock deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148290 [10:03:05] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [10:03:15] yes taavi I can download that debug extension and try it out [10:03:29] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:03:41] ah ok thanks hashar for explaining [10:04:10] ok, then let's do that now [10:04:23] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1148287 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:04:42] (03CR) 10Brouberol: [C:03+1] admin_ng: bump kartotherian's resourcequota to unblock deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148290 (owner: 10Elukey) [10:04:58] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1147043|Merge branch 'master' into wmf_deploy]] [10:06:17] (03CR) 10Jcrespo: "Please talk to @Ladsgroup, he was the one to request those and that had to be setup by yesterday." [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [10:06:34] taavi: I'm guessing I won't be able to log into central notice admin on the debug servers with the same details I use for https://meta.wikimedia.org/. [10:06:48] I guess I can try! [10:07:03] I think I need to be logged in to preview banners (where the bug currently exists) [10:07:27] jgleeson: you will. basically how it works is that it pushes the new code to a set of separate servers (or kubernetes pods these days) that are otherwise running the exact same config as such as the normal traffic-serving ones [10:07:44] oh great! [10:07:57] so once scap does it's magic in some minutes (i'll ping you then), you'll be able to turn the extension on and use meta as usual but with the fixes applied [10:08:29] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:54] thanks so much taavi for all the help! [10:08:58] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [10:10:52] (03CR) 10Jcrespo: [C:03+1] Add mysql grants for cumin1003 [puppet] - 10https://gerrit.wikimedia.org/r/1145043 (https://phabricator.wikimedia.org/T393990) (owner: 10Muehlenhoff) [10:11:00] (03CR) 10Elukey: [C:03+2] admin_ng: bump kartotherian's resourcequota to unblock deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148290 (owner: 10Elukey) [10:11:38] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'. [10:11:45] !log taavi@deploy1003 gjg, taavi: Backport for [[gerrit:1147043|Merge branch 'master' into wmf_deploy]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:11:59] jgleeson: you can test now! so open meta, turn the extension on from the toolbar icon (you should see it say "k8s-mwdebug" in the dropdown), and do a hard refresh to get the new javascript [10:12:31] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'. [10:12:47] !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for db1247.eqiad.wmnet [10:12:48] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1247.eqiad.wmnet [10:13:34] I'll do that now. thanks! [10:13:40] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/admin 'sync'. [10:13:50] and let me know here once you're done testing [10:13:53] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [10:14:38] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [10:17:37] jgleeson: how is it going? [10:17:52] taavi: it's working! I see banners [10:17:58] the fix looks good [10:18:16] !log taavi@deploy1003 gjg, taavi: Continuing with sync [10:18:27] thanks! continuing to roll out the fix everywhere [10:20:10] really appreciate all the help taavi, hashar and Dreamy_Jazz [10:20:18] pcoombe: ^ [10:20:56] thanks everyone! [10:22:24] (03PS1) 10Federico Ceratto: db1247.yaml: Enabling notifications after cloning [puppet] - 10https://gerrit.wikimedia.org/r/1148291 (https://phabricator.wikimedia.org/T393612) [10:22:24] (03CR) 10Federico Ceratto: "1-line change to enable notifications for db1247" [puppet] - 10https://gerrit.wikimedia.org/r/1148291 (https://phabricator.wikimedia.org/T393612) (owner: 10Federico Ceratto) [10:23:41] (03CR) 10Ladsgroup: [C:03+1] db1247.yaml: Enabling notifications after cloning [puppet] - 10https://gerrit.wikimedia.org/r/1148291 (https://phabricator.wikimedia.org/T393612) (owner: 10Federico Ceratto) [10:24:22] (03PS1) 10Clément Goubert: mediawiki: Fix ttlSecondsAfterFinished comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148293 [10:25:14] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [10:25:15] !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1147043|Merge branch 'master' into wmf_deploy]] (duration: 20m 17s) [10:26:48] jgleeson: fix deployed everywhere! [10:28:34] ty! [10:28:55] (03CR) 10Clément Goubert: mediawiki: Allow varying the entrypoint through mwscript values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147918 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [10:29:36] (03CR) 10Clément Goubert: [C:03+1] hieradata: add usernames for mw-expermental [puppet] - 10https://gerrit.wikimedia.org/r/1147782 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [10:30:23] (03PS2) 10Effie Mouzeli: admin_ng: add mw-experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) [10:30:31] (03CR) 10Clément Goubert: [C:03+1] admin_ng: add mw-experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [10:33:15] (03PS1) 10MVernon: cephadm: handle storage servers with BOSS card [puppet] - 10https://gerrit.wikimedia.org/r/1148296 (https://phabricator.wikimedia.org/T391354) [10:33:54] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148296 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [10:35:54] (03PS1) 10FNegri: clouddb: upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1148297 (https://phabricator.wikimedia.org/T394372) [10:35:59] (03CR) 10Elukey: "As discussed on IRC, this change seems to require a lot of capacity while building, something that our infra doesn't support ATM. We shoul" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [10:36:16] (03PS4) 10Vgutierrez: liberica: Add katran config settings [puppet] - 10https://gerrit.wikimedia.org/r/1113961 (https://phabricator.wikimedia.org/T380450) [10:36:25] (03CR) 10Vgutierrez: liberica: Add katran config settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113961 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez) [10:36:29] (03CR) 10FNegri: [C:04-1] "Preparing the patch, but not ready to merge it yet." [puppet] - 10https://gerrit.wikimedia.org/r/1148297 (https://phabricator.wikimedia.org/T394372) (owner: 10FNegri) [10:37:05] (03CR) 10CI reject: [V:04-1] admin_ng: add mw-experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [10:37:32] (03CR) 10Elukey: [C:03+1] git::clone: fix support for different remote name [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [10:37:55] jouncebot: nowandnext [10:37:55] For the next 0 hour(s) and 22 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1000) [10:37:55] In 1 hour(s) and 22 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1200) [10:40:36] (03PS1) 10SimmeD: Allow itwiki bureaucrat to remove sysop permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148299 [10:42:42] (03PS3) 10Effie Mouzeli: admin_ng: add mw-experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) [10:43:06] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. One thing of note: The homer class currently passes "remote_name" ('peer') to git::clone, so if the argument is now implemente" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [10:43:29] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [10:46:29] (03CR) 10Ladsgroup: "I do think x3 needs backups since running the re-generation script can take a really long time but as I said a couple times, it's not high" [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [10:47:22] (03CR) 10Btullis: [C:03+1] airflow: Monitor empty dag bags [alerts] - 10https://gerrit.wikimedia.org/r/1148198 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [10:48:19] (03CR) 10Dr0ptp4kt: [C:03+1] ext-EventStreamConfig: Update product_metrics.web_base stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147796 (https://phabricator.wikimedia.org/T394457) (owner: 10Phuedx) [10:48:58] (03CR) 10CI reject: [V:04-1] admin_ng: add mw-experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [10:49:31] (03CR) 10Btullis: [C:03+2] Add all WMF domains to the eventgate-analytics-external certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147737 (https://phabricator.wikimedia.org/T391411) (owner: 10Btullis) [10:49:56] (03PS1) 10Effie Mouzeli: kubernetes::deployment_server: add new mw-experimental release [puppet] - 10https://gerrit.wikimedia.org/r/1148300 (https://phabricator.wikimedia.org/T276994) [10:51:23] (03Merged) 10jenkins-bot: Add all WMF domains to the eventgate-analytics-external certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147737 (https://phabricator.wikimedia.org/T391411) (owner: 10Btullis) [10:55:57] (03PS4) 10Effie Mouzeli: admin_ng: add mw-experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) [10:57:12] !log btullis@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [10:57:15] !log btullis@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [10:59:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db2167 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76333 and previous config saved to /var/cache/conftool/dbconfig/20250520-105937-ladsgroup.json [10:59:41] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [11:01:30] (03PS2) 10SimmeD: Allow itwiki bureaucrat to remove sysop permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148299 (https://phabricator.wikimedia.org/T394752) [11:01:47] (03CR) 10Vgutierrez: "ping?" [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh) [11:01:54] !log btullis@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: sync [11:02:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1258 from s8 (T351820)', diff saved to https://phabricator.wikimedia.org/P76334 and previous config saved to /var/cache/conftool/dbconfig/20250520-110214-ladsgroup.json [11:02:38] !log btullis@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync [11:02:49] !log btullis@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: sync [11:02:51] (03CR) 10Clément Goubert: [C:03+1] admin_ng: add mw-experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [11:03:09] !log btullis@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: sync [11:03:58] !log btullis@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [11:04:00] !log btullis@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [11:04:41] !log btullis@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: sync [11:04:47] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! Applies cleanly; took a quick look at the diff after merging and looks good." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1143598 (owner: 10Pppery) [11:05:16] !log btullis@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync [11:05:24] jouncebot: nowandnext [11:05:25] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [11:05:25] In 0 hour(s) and 54 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1200) [11:05:41] (03CR) 10Ladsgroup: [C:03+2] OATHAuth: Mark checkuser and suppress as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) (owner: 10SBassett) [11:06:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) (owner: 10SBassett) [11:06:31] (03Merged) 10jenkins-bot: OATHAuth: Mark checkuser and suppress as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) (owner: 10SBassett) [11:06:38] (03CR) 10Vgutierrez: cdn: Fix "reason" variable reference (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 (owner: 10BCornwall) [11:06:54] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1133245|OATHAuth: Mark checkuser and suppress as requiring 2FA (T150898 T389727)]] [11:06:58] T150898: Force OATHAuth (2FA) for certain user groups in Wikimedia production - https://phabricator.wikimedia.org/T150898 [11:07:44] !log btullis@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [11:07:47] !log btullis@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [11:08:15] !log btullis@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: sync [11:08:40] !log upgrading cassandra-dev to latest Java 11 security updates [11:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:00] !log btullis@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync [11:09:19] (03CR) 10Kevin Bazira: "SGTM, sharing the grafana dashboard here that shows the resources required during the build process: https://grafana.wikimedia.org/goto/M3" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1146891 (https://phabricator.wikimedia.org/T385173) (owner: 10Kevin Bazira) [11:12:20] (03CR) 10Vgutierrez: trafficserver: Send /evt-103e/v2/events to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [11:12:55] !log ladsgroup@deploy1003 ladsgroup, sbassett: Backport for [[gerrit:1133245|OATHAuth: Mark checkuser and suppress as requiring 2FA (T150898 T389727)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:13:00] T150898: Force OATHAuth (2FA) for certain user groups in Wikimedia production - https://phabricator.wikimedia.org/T150898 [11:13:41] !log ladsgroup@deploy1003 ladsgroup, sbassett: Continuing with sync [11:14:02] Amir1: ping me when you're done pls? [11:14:20] sure! [11:15:21] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1148284 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [11:15:35] (03CR) 10Federico Ceratto: [C:03+2] db1247.yaml: Enabling notifications after cloning [puppet] - 10https://gerrit.wikimedia.org/r/1148291 (https://phabricator.wikimedia.org/T393612) (owner: 10Federico Ceratto) [11:16:17] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:17:01] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [11:17:23] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5603/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [11:17:29] (03PS5) 10Jgiannelos: pcs-rb-sunset: Disable changeprop rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670) [11:20:52] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133245|OATHAuth: Mark checkuser and suppress as requiring 2FA (T150898 T389727)]] (duration: 13m 57s) [11:20:59] T150898: Force OATHAuth (2FA) for certain user groups in Wikimedia production - https://phabricator.wikimedia.org/T150898 [11:21:09] (03CR) 10Effie Mouzeli: [C:03+1] mediawiki: Fix ttlSecondsAfterFinished comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148293 (owner: 10Clément Goubert) [11:21:44] claime: I'm done! [11:21:49] Amir1: cool thanks [11:22:35] !log installing openjdk-11 security updates [11:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:58] !log cmooney@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool site codfw [reason: being cautious during maintenance on codfw CRs, T393552] [11:23:01] T393552: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552 [11:23:20] !log depool codfw in dns T393552 [11:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:08] (03PS1) 10Volans: setup.py: pin prospector for now [cookbooks] - 10https://gerrit.wikimedia.org/r/1148306 [11:24:11] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Fix ttlSecondsAfterFinished comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148293 (owner: 10Clément Goubert) [11:24:45] (03PS2) 10Brouberol: Include the hostname in the phabricator message when rebooting a host [cookbooks] - 10https://gerrit.wikimedia.org/r/1148284 (https://phabricator.wikimedia.org/T394647) [11:25:03] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1148284 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [11:25:23] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site codfw [reason: being cautious during maintenance on codfw CRs, T393552] [11:25:56] (03CR) 10Bunnypranav: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148299 (https://phabricator.wikimedia.org/T394752) (owner: 10SimmeD) [11:26:22] (03CR) 10Jgiannelos: [C:04-1] "Blocking until 1 week from last rollout has passed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148273 (https://phabricator.wikimedia.org/T264670) (owner: 10Jgiannelos) [11:26:51] (03Merged) 10jenkins-bot: mediawiki: Fix ttlSecondsAfterFinished comment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148293 (owner: 10Clément Goubert) [11:28:29] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:29:43] !log cgoubert@deploy1003 Started scap sync-world: mwscript-mwcron: Add some logging [11:30:01] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [11:32:16] !log cgoubert@deploy1003 Finished scap sync-world: mwscript-mwcron: Add some logging (duration: 02m 32s) [11:33:14] !log taavi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:33:29] !log taavi@cumin1002 START - Cookbook sre.dns.netbox [11:36:28] (03CR) 10Bunnypranav: [C:03+1] Allow itwiki bureaucrat to remove sysop permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148299 (https://phabricator.wikimedia.org/T394752) (owner: 10SimmeD) [11:37:30] !log cmooney@cumin1003 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on 10 hosts with reason: upgrade cr1-codfw JunOS [11:37:42] !log taavi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: allocate wiki replica VIPs for x3 - taavi@cumin1002" [11:37:48] !log taavi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: allocate wiki replica VIPs for x3 - taavi@cumin1002" [11:37:48] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:37:54] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:39:24] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:44] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 10 hosts with reason: upgrade cr1-codfw JunOS [11:39:49] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10838631 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8c92db5f-18b6-481b-8642-01c1d92b5cb0) set by cmooney@cumin1003 for 2:00:00 on 10 host(s) and their servi... [11:41:27] !log drain transport circuits landing on cr1-codfw of traffic before router upgrade (T364092) [11:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:31] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [11:41:45] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [11:42:12] (03CR) 10Volans: [C:03+2] setup.py: pin prospector for now [cookbooks] - 10https://gerrit.wikimedia.org/r/1148306 (owner: 10Volans) [11:42:33] (03PS1) 10Majavah: hieradata: cloudlb: Start announcing x3 VIPs [puppet] - 10https://gerrit.wikimedia.org/r/1148310 (https://phabricator.wikimedia.org/T390954) [11:42:34] (03PS1) 10Majavah: cloudlb: Support multiple wiki replica addresses per section [puppet] - 10https://gerrit.wikimedia.org/r/1148311 (https://phabricator.wikimedia.org/T390954) [11:42:35] (03PS1) 10Majavah: hieradata: cloudlb: Listen on s8 on the x3 VIP [puppet] - 10https://gerrit.wikimedia.org/r/1148312 (https://phabricator.wikimedia.org/T390954) [11:42:37] (03PS1) 10Majavah: hieradata: Update wiki replicas x3 DNS records to new VIP [puppet] - 10https://gerrit.wikimedia.org/r/1148313 (https://phabricator.wikimedia.org/T390954) [11:44:02] (03PS2) 10Majavah: openstack: wikireplica_dns: Point x3 records to new VIP [puppet] - 10https://gerrit.wikimedia.org/r/1148313 (https://phabricator.wikimedia.org/T390954) [11:44:55] (03CR) 10Volans: "Yes but that's already the case:" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [11:45:32] (03CR) 10Volans: "I'll try to check other usages of remote_name" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [11:47:01] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [11:48:42] (03Merged) 10jenkins-bot: setup.py: pin prospector for now [cookbooks] - 10https://gerrit.wikimedia.org/r/1148306 (owner: 10Volans) [11:49:01] (03CR) 10Clément Goubert: kubernetes::deployment_server: add new mw-experimental release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148300 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [11:49:03] !log apply bgp "graceful shutdown" community on cr1-codfw ahead of JunOS upgrade (T364092) [11:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:09] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [11:50:19] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1148311 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [11:52:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqsin and cr1-codfw (208.80.153.192) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr2-eqsin:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr1-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:53:30] (03CR) 10Kgraessle: [C:03+1] ores-extension: enable ores extention for rrla without the UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [11:53:31] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5605/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148312 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [11:53:58] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [11:54:00] (03CR) 10Marostegui: "There is a big difference in being "had to be set up by yesterday" and "it is not high priority"." [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [11:54:05] (03CR) 10Marostegui: [C:03+1] dbbackups: Setup backups for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [11:56:26] jouncebot: nowandnext [11:56:26] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [11:56:26] In 0 hour(s) and 3 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1200) [11:57:34] (03PS1) 10Máté Szabó: DeduplicateStyles: Only transform possible style nodes [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148315 (https://phabricator.wikimedia.org/T394059) [11:57:39] (03CR) 10Ayounsi: [C:03+2] InboundInterfaceErrors: disable pint [alerts] - 10https://gerrit.wikimedia.org/r/1148287 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:59:10] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1147026 (https://phabricator.wikimedia.org/T313877) (owner: 10Eevans) [11:59:54] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [11:59:55] claime: is it OK for me to deploy a patch or are you doing some relevant work atm? [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1200) [12:01:30] (03Merged) 10jenkins-bot: InboundInterfaceErrors: disable pint [alerts] - 10https://gerrit.wikimedia.org/r/1148287 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:03:27] (03CR) 10Ladsgroup: [C:03+1] dbbackups: Setup backups for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [12:03:29] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:14] (03PS3) 10Slyngshede: Netbox: Disable until upgrade [alerts] - 10https://gerrit.wikimedia.org/r/1148197 [12:05:57] !log disable routing-engine sync / graceful-switchover on cr1-codfw ahead of JunOS upgrade on RE1 T364092 [12:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:01] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [12:07:35] (03CR) 10Slyngshede: "I couldn't find a good way to simply disable the check. We can get the files back out of Gerrit when ready." [alerts] - 10https://gerrit.wikimedia.org/r/1148197 (owner: 10Slyngshede) [12:08:13] !log jmm@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [12:08:21] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1247* gradually with 4 steps - Pooling in after cloning [12:09:24] mszabo: nah you're good [12:09:30] (03CR) 10Volans: "Maybe due to the rebase some conversion of the string to integer seems to have got lost. LGTM, just a suggestion to improve the matched me" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [12:11:03] (03CR) 10Muehlenhoff: [C:03+2] Add mysql grants for cumin1003 [puppet] - 10https://gerrit.wikimedia.org/r/1145043 (https://phabricator.wikimedia.org/T393990) (owner: 10Muehlenhoff) [12:12:11] !log creating existencelinks on all wikis (T394617) [12:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:15] T394617: Create existencelinks table in production - https://phabricator.wikimedia.org/T394617 [12:12:53] (03CR) 10Jelto: site: add zuul VMs with collab-insetup-role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147855 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [12:12:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148315 (https://phabricator.wikimedia.org/T394059) (owner: 10Máté Szabó) [12:13:00] thanks! [12:15:35] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 10Release-Engineering-Team (Radar): codfw: 1VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10838820 (10Jelto) Thanks @Dzahn for picking this up. I left a comment in [operations/puppet/+/1147855](https://gerri... [12:17:03] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [12:17:44] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1148315|DeduplicateStyles: Only transform possible style nodes (T394059)]] [12:17:48] (03CR) 10Ayounsi: New device additions for codfw expansion plus policy changes (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [12:17:48] T394059: Post-cache output transforms are expensive on large pages - https://phabricator.wikimedia.org/T394059 [12:18:08] (03Merged) 10jenkins-bot: DeduplicateStyles: Only transform possible style nodes [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148315 (https://phabricator.wikimedia.org/T394059) (owner: 10Máté Szabó) [12:18:12] (03CR) 10Vgutierrez: [C:03+2] trafficserver: Send /evt-103e/v2/events to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1143483 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [12:19:15] (03CR) 10Ayounsi: New device additions for codfw expansion plus policy changes (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [12:19:24] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:05] (03PS2) 10Anzx: IP cap lift request for Leeds University 21 May [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148318 (https://phabricator.wikimedia.org/T394639) [12:21:52] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10838856 (10ayounsi) [12:21:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148318 (https://phabricator.wikimedia.org/T394639) (owner: 10Anzx) [12:22:48] (03CR) 10Cathal Mooney: New device additions for codfw expansion plus policy changes (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [12:23:23] !log rebooting backup routing-engine RE1 on cr1-codfw to install JunOS upgrade (T364092) [12:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:28] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [12:24:02] !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1148315|DeduplicateStyles: Only transform possible style nodes (T394059)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:24:08] T394059: Post-cache output transforms are expensive on large pages - https://phabricator.wikimedia.org/T394059 [12:24:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 15.61% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:26:22] gmmm [12:28:19] !log mszabo@deploy1003 mszabo: Continuing with sync [12:29:08] (03CR) 10Brouberol: [C:03+2] airflow: Monitor empty dag bags [alerts] - 10https://gerrit.wikimedia.org/r/1148198 (https://phabricator.wikimedia.org/T394459) (owner: 10Brouberol) [12:29:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 15.77% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:29:27] (03CR) 10Brouberol: [C:03+2] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1148284 (https://phabricator.wikimedia.org/T394647) (owner: 10Brouberol) [12:29:51] !log installing systemd bugfix updates from Bookworm point release [12:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:19] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10838876 (10Jelto) We’ve decided to start by switching `s3://gitlab-artifacts` to object storage. Since artifacts are currently not... [12:30:39] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10838888 (10Jelto) [12:30:47] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1148310 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [12:31:57] (03CR) 10FNegri: [C:03+1] cloudlb: Support multiple wiki replica addresses per section [puppet] - 10https://gerrit.wikimedia.org/r/1148311 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [12:32:13] !log switching active routing-engine to RE1 on cr1-codfw (this will cause protocol adjacencies to flap) (T364092) [12:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:17] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [12:32:26] (03CR) 10Majavah: [C:03+2] hieradata: cloudlb: Start announcing x3 VIPs [puppet] - 10https://gerrit.wikimedia.org/r/1148310 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [12:32:43] (03CR) 10FNegri: [C:03+1] hieradata: cloudlb: Listen on s8 on the x3 VIP [puppet] - 10https://gerrit.wikimedia.org/r/1148312 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [12:34:44] (03CR) 10FNegri: openstack: wikireplica_dns: Point x3 records to new VIP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148313 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [12:35:08] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 111, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:35:26] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148315|DeduplicateStyles: Only transform possible style nodes (T394059)]] (duration: 17m 42s) [12:35:26] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:35:31] T394059: Post-cache output transforms are expensive on large pages - https://phabricator.wikimedia.org/T394059 [12:35:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:2 (Core: cr1-codfw:xe-1/0/1:1 {#10695_12249-3}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:36:19] FIRING: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr1-codfw (10.192.254.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [12:37:15] (03CR) 10Majavah: openstack: wikireplica_dns: Point x3 records to new VIP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148313 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [12:37:39] FIRING: [6x] CoreBGPDown: Core BGP session down between cr2-eqsin and cr1-codfw (208.80.153.192) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:37:54] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:38:27] (03CR) 10Majavah: [V:03+1 C:03+2] cloudlb: Support multiple wiki replica addresses per section [puppet] - 10https://gerrit.wikimedia.org/r/1148311 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [12:38:35] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: cloudlb: Listen on s8 on the x3 VIP [puppet] - 10https://gerrit.wikimedia.org/r/1148312 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [12:38:56] (03PS2) 10Majavah: hieradata: cloudlb: Listen on s8 on the x3 VIP [puppet] - 10https://gerrit.wikimedia.org/r/1148312 (https://phabricator.wikimedia.org/T390954) [12:39:04] (03PS1) 10Brouberol: mediawiki-dumps-legacy: allow role binding multiple service accounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148321 [12:39:24] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:40:01] (03CR) 10Btullis: [C:03+1] mediawiki-dumps-legacy: allow role binding multiple service accounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148321 (owner: 10Brouberol) [12:40:36] (03CR) 10Majavah: [C:03+2] hieradata: cloudlb: Listen on s8 on the x3 VIP [puppet] - 10https://gerrit.wikimedia.org/r/1148312 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [12:41:05] (03PS1) 10Abban Dunne: Add WMDE Fundraising banner event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148322 [12:41:08] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:41:53] 06SRE, 10Observability-Metrics: Rework the Pyrra list dashboard - https://phabricator.wikimedia.org/T394415#10838931 (10herron) Thanks, this makes a lot of sense. I've saved a version of the dashboard that pairs way down on headings. FWIW Cluster I think can stay since some SLOs use it e.g. haproxy. As-is t... [12:42:39] FIRING: [8x] CoreBGPDown: Core BGP session down between cr2-codfw and cr1-codfw (208.80.153.192) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:42:53] fceratto@cumin1002 pool (PID 3834084) is awaiting input [12:43:42] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 3856 [12:44:24] RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:29] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 3856 [12:45:51] RESOLVED: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:2 (Core: cr1-codfw:xe-1/0/1:1 {#10695_12249-3}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:46:08] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: allow role binding multiple service accounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148321 (owner: 10Brouberol) [12:46:19] FIRING: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr1-codfw (10.192.254.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [12:46:32] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 42 [12:47:39] FIRING: [8x] CoreBGPDown: Core BGP session down between cr2-codfw and cr1-codfw (208.80.153.192) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:47:40] jouncebot: nowandnext [12:47:40] For the next 0 hour(s) and 12 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1200) [12:47:40] In 0 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1300) [12:48:06] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: sync [12:48:38] (03CR) 10Jcrespo: "I was asked yesterday by Amir to do this "now". I understood that as "max prio". I pushed back and said I would do it "first time I can"." [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [12:48:40] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 42 [12:48:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [12:49:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [12:49:19] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [12:50:48] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:51:46] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [12:51:52] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:51:54] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 3856 [12:51:55] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'configure' for AS: 3856 [12:52:31] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on cr2-codfw with reason: upgrade cr1-codfw JunOS [12:52:37] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10838979 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f40f3f46-731d-46ef-9db5-647d735907d6) set by cmooney@cumin1003 for 3:00:00 on 1 host(s) and their servic... [12:53:01] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [12:54:28] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:55:07] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 22616 [12:55:48] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:56:19] RESOLVED: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr1-codfw (10.192.254.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [12:56:52] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:57:10] !log rebooting backup routing-engine RE0 on cr1-codfw to install JunOS upgrade (T364092) [12:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:14] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [12:58:40] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 22616 [12:59:05] !log update dns-root-data to 2024071801~deb12u1 on dns7001 [12:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:47] (03PS7) 10Elukey: icinga: skip downtimed services in wait_for_optimal if needed [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1300). [13:00:04] HouseOfM, phuedx, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147139 (https://phabricator.wikimedia.org/T394603) (owner: 10Bunnypranav) [13:00:48] (03PS1) 10Alexandros Kosiaris: staging-codfw: Specify MTU of 1460 [puppet] - 10https://gerrit.wikimedia.org/r/1148326 (https://phabricator.wikimedia.org/T352956) [13:00:54] o/ [13:01:32] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart rolling restart_daemons on P{dns7001*} and A:dnsbox [13:02:32] (03PS1) 10Muehlenhoff: Add two more maps-replicas on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1148327 (https://phabricator.wikimedia.org/T381565) [13:03:16] Hi deployers, bit late on the scheduling, but it is just a namespace-config change, so hopefully someone can be around to deploy it. [13:03:21] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on P{dns7001*} and A:dnsbox [13:03:37] (03PS8) 10Elukey: icinga: skip downtimed services in wait_for_optimal if needed [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) [13:03:48] (03CR) 10Elukey: icinga: skip downtimed services in wait_for_optimal if needed (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [13:04:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1144526 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [13:05:00] !log switching active routing-engine to RE0 on cr1-codfw (this will cause protocol adjacencies to flap) (T364092) [13:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:04] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [13:05:56] (03PS1) 10Brouberol: airflow: display an info message explaining how to destroyt a devenv [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148329 (https://phabricator.wikimedia.org/T393998) [13:08:28] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:08:32] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148326 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [13:08:49] (03PS1) 10MVernon: thanos: remove old frontends thanos-fe100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1148330 (https://phabricator.wikimedia.org/T391352) [13:09:19] FIRING: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr1-codfw (10.192.254.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [13:09:52] (03CR) 10Jcrespo: [C:03+1] "I validated that the logic implements exactly what's indented and double checked the syntax of both puppet and ruby templating." [puppet] - 10https://gerrit.wikimedia.org/r/1148296 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [13:10:22] fceratto@cumin1002 pool (PID 3834084) is awaiting input [13:11:28] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:11:29] (03CR) 10Jcrespo: [C:03+1] "intended, not indented. Sorry." [puppet] - 10https://gerrit.wikimedia.org/r/1148296 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [13:11:35] (03CR) 10MVernon: [C:03+2] cephadm: handle storage servers with BOSS card [puppet] - 10https://gerrit.wikimedia.org/r/1148296 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [13:12:39] FIRING: [7x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-codfw (208.80.153.192) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:13:05] !log Restarted release Jenkins on releases1003 [13:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:45] jouncebot: now [13:13:45] For the next 0 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1300) [13:14:14] agh look like nobody is around to deploy? [13:14:19] RESOLVED: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr1-codfw (10.192.254.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [13:15:02] !log re-enable graceful switchover on cr1-codfw (T364092) [13:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:05] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [13:15:29] I'll do them [13:15:36] anzx: bunnypranav: I will do you rchanges [13:15:39] hashar: ok [13:15:42] Thanks! [13:16:11] (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113961 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez) [13:16:20] (03PS2) 10Brouberol: airflow: display an info message explaining how to destroy a devenv [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148329 (https://phabricator.wikimedia.org/T393998) [13:16:40] (03CR) 10Jcrespo: [C:03+1] "This looks fine, but check my question on IRC about separating the master migration from the removal. In the past we had problems with mon" [puppet] - 10https://gerrit.wikimedia.org/r/1148330 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [13:17:17] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1082 to cirrussearch1082 [13:17:24] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:17:30] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:17:39] FIRING: [7x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-codfw (208.80.153.192) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:18:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148318 (https://phabricator.wikimedia.org/T394639) (owner: 10Anzx) [13:18:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147139 (https://phabricator.wikimedia.org/T394603) (owner: 10Bunnypranav) [13:18:39] bunnypranav: and from what you get, changing the namespaces would need some command to be run? [13:18:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146985 (https://phabricator.wikimedia.org/T394505) (owner: 10ZhaoFJx) [13:19:12] hashar: namespacedupes.php [13:19:21] ah yeah good old one script :) [13:19:26] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:20:04] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10839156 (10Volans) p:05Triage→03Medium [13:20:16] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:20:27] Thanks anzx and hashar. This is second ever backport, so a bit new to this process. [13:20:34] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1083 to cirrussearch1083 [13:20:45] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1082 to cirrussearch1082 - bking@cumin2002" [13:20:48] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:20:50] we all once had a second backport :] [13:20:55] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5606/console" [puppet] - 10https://gerrit.wikimedia.org/r/1113961 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez) [13:21:29] (03Merged) 10jenkins-bot: IP cap lift request for Leeds University 21 May [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148318 (https://phabricator.wikimedia.org/T394639) (owner: 10Anzx) [13:21:32] (03Merged) 10jenkins-bot: core-Namespaces: Update Malay wiki (mswiki) namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147139 (https://phabricator.wikimedia.org/T394603) (owner: 10Bunnypranav) [13:21:35] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10839160 (10brouberol) a:05Jclark-ctr→03brouberol [13:21:36] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10839163 (10brouberol) [13:21:37] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10839164 (10brouberol) 05Resolved→03In progress [13:21:44] (03CR) 10Muehlenhoff: [C:03+2] sshd: Remove dead template argument [puppet] - 10https://gerrit.wikimedia.org/r/1146968 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:21:50] (03CR) 10Vgutierrez: [V:03+1 C:03+2] liberica: Add katran config settings [puppet] - 10https://gerrit.wikimedia.org/r/1113961 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez) [13:21:50] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:21:54] federico3: ^ [13:21:55] !log hashar@deploy1003 Started scap sync-world: Backport for [[gerrit:1148318|IP cap lift request for Leeds University 21 May (T394639)]], [[gerrit:1147139|core-Namespaces: Update Malay wiki (mswiki) namespaces (T394603)]] [13:21:59] T394639: Temporary IP lift request for Leeds University Wednesday 21 May 1130-1630 UTC - https://phabricator.wikimedia.org/T394639 [13:22:00] T394603: Configure the namespaces on Malay Wikipedia - https://phabricator.wikimedia.org/T394603 [13:22:05] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1247* gradually with 4 steps - Pooling in after cloning [13:22:26] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:22:28] fixnd now [13:22:33] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:22:39] RESOLVED: [7x] CoreBGPDown: Core BGP session down between cr2-eqiad and cr1-codfw (208.80.153.192) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:22:44] thanks, federico3 [13:22:54] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1082 to cirrussearch1082 - bking@cumin2002" [13:22:55] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:22:55] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1082 on all recursors [13:22:58] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1082 on all recursors [13:22:59] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1082 [13:23:01] hashar: Quick question, how do we see if these config changes are in effect? [13:23:18] by browsing the wikis [13:23:19] I guess [13:23:26] Tried ctrl+shift+r, but doesn't seem to reflect [13:23:31] the change are still in the process of being deployed [13:23:41] Ah, understood. [13:23:43] (03CR) 10Elukey: [C:03+1] Add two more maps-replicas on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1148327 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:24:00] the system will freeze at some point, it is first deployed on a system that is used to validate/debug [13:24:14] https://wikitech.wikimedia.org/wiki/WikimediaDebug [13:24:34] it is a web browser extension which lets one enable the debug mode [13:25:06] under the hood, once the extension is enabled and turned on, it would add a `X-Wikimedia-Debug: 1` header to all requests [13:25:11] and they will be served by the debug hosts/pod [13:25:15] which have the new code already [13:25:23] that let us validate a change is behaving as expected [13:25:35] and if happy, the code is then deployed to the rest of the servers [13:25:48] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:25:54] yeah, I do have that installed, trying to figure out how it works. [13:26:21] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1083 to cirrussearch1083 - bking@cumin2002" [13:26:33] (03CR) 10Volans: [C:03+1] "Perfect! Thanks a lot!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [13:26:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:26:43] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1082 [13:26:50] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [13:26:53] when browsing our wikis, there should be a little icon next to the url bar [13:26:53] https://wikitech.wikimedia.org/wiki/WikimediaDebug#/media/File:WikimediaDebug_v2_on.png [13:26:56] (03CR) 10Ssingh: "Thanks for the work on this folks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [13:27:10] greyed out, and once clicked that gives an option to turn it on [13:27:23] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1082 to cirrussearch1082 [13:27:32] Once I turned it on, does the change happen immediately? [13:27:40] you then refresh the page [13:27:41] I mean, no need to refresh right [13:27:51] and requests from now on will have the magic `X-Wikimedia-Debug` header injected [13:27:57] !log hashar@deploy1003 hashar, bunnypranav, anzx: Backport for [[gerrit:1148318|IP cap lift request for Leeds University 21 May (T394639)]], [[gerrit:1147139|core-Namespaces: Update Malay wiki (mswiki) namespaces (T394603)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:27:58] but the change has not been deployed ot those hosts yet [13:28:04] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1083 to cirrussearch1083 - bking@cumin2002" [13:28:05] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:28:05] T394639: Temporary IP lift request for Leeds University Wednesday 21 May 1130-1630 UTC - https://phabricator.wikimedia.org/T394639 [13:28:05] T394603: Configure the namespaces on Malay Wikipedia - https://phabricator.wikimedia.org/T394603 [13:28:05] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1083 on all recursors [13:28:08] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1083 on all recursors [13:28:09] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1083 [13:28:37] hashar: nothing to check on throttle patch [13:28:47] anzx: +1 :) [13:29:03] (03PS1) 10Brouberol: kafka-jumbio: provision broker 1016 [puppet] - 10https://gerrit.wikimedia.org/r/1148333 (https://phabricator.wikimedia.org/T377874) [13:29:07] (03PS1) 10Brouberol: kafka-jumbio: provision broker 1017 [puppet] - 10https://gerrit.wikimedia.org/r/1148334 (https://phabricator.wikimedia.org/T377874) [13:29:11] (03PS1) 10Brouberol: kafka-jumbio: provision broker 1018 [puppet] - 10https://gerrit.wikimedia.org/r/1148335 (https://phabricator.wikimedia.org/T377874) [13:29:12] hashar: Understood, now I can see it. Works perfectly, thanks for the deploy! [13:29:18] ah great [13:29:21] I will continue [13:29:33] and the detailed help you gave :D [13:29:37] 06SRE, 10MW-on-K8s, 06serviceops, 07Python3-Porting: mwgrep cannot be used from a deployment host - https://phabricator.wikimedia.org/T384764#10839219 (10Jdforrester-WMF) >>! In T384764#10629639, @gerritbot wrote: > Change #1127075 had a related patch set uploaded (by Reedy; author: Reedy): > %%%[operation... [13:29:37] !log drain transport circuits landing on cr2-codfw of traffic before router upgrade (T364092) [13:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:41] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [13:29:51] and then eventually run: [13:29:51] mwscript-k8s --comment='core-Namespaces: Update Malay wiki (mswiki) namespaces - T394603' --follow -- namespaceDupes mswiki [13:30:04] fun thing [13:30:08] 06SRE, 10MW-on-K8s, 06serviceops, 07Python3-Porting: mwgrep cannot be used from a deployment host - https://phabricator.wikimedia.org/T384764#10839226 (10Reedy) 05Open→03Resolved a:03Reedy Nope, all good. [13:30:09] in spider pig I can press `Y` ;b [13:30:19] !log hashar@deploy1003 hashar, bunnypranav, anzx: Continuing with sync [13:30:21] (03PS2) 10Brouberol: kafka-jumbio: provision broker 1018 [puppet] - 10https://gerrit.wikimedia.org/r/1148335 (https://phabricator.wikimedia.org/T377874) [13:30:22] oh there is a button [13:30:54] (03CR) 10Ssingh: [V:03+1] "Pong: the reason I was mostly waiting was to see if there are any fresh concerns (after the review on Jan 7) and also because other stuff " [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh) [13:31:07] (03PS2) 10Brouberol: kafka-jumbo: provision broker 1016 [puppet] - 10https://gerrit.wikimedia.org/r/1148333 (https://phabricator.wikimedia.org/T377874) [13:31:07] (03PS2) 10Brouberol: kafka-jumbo: provision broker 1017 [puppet] - 10https://gerrit.wikimedia.org/r/1148334 (https://phabricator.wikimedia.org/T377874) [13:31:07] (03PS3) 10Brouberol: kafka-jumbo: provision broker 1018 [puppet] - 10https://gerrit.wikimedia.org/r/1148335 (https://phabricator.wikimedia.org/T377874) [13:32:06] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1083 [13:32:46] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1083 to cirrussearch1083 [13:33:13] hashar: resetAuthenticationThrottle.php https://wikitech.wikimedia.org/wiki/Increasing_account_creation_threshold needed to be run , as workshop happening tomorrow [13:33:51] ahh good to know, thank you! [13:34:10] hashar: Are you still deploying? [13:34:14] Sorry I'm late [13:34:14] yeah [13:34:16] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5607/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148335 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol) [13:34:22] I am waiting for both changes to have fully deployed [13:34:26] soon ™ [13:35:11] (03PS3) 10Brouberol: kafka-jumbo: provision broker 1016 [puppet] - 10https://gerrit.wikimedia.org/r/1148333 (https://phabricator.wikimedia.org/T377874) [13:35:11] (03PS3) 10Brouberol: kafka-jumbo: provision broker 1017 [puppet] - 10https://gerrit.wikimedia.org/r/1148334 (https://phabricator.wikimedia.org/T377874) [13:35:11] (03PS4) 10Brouberol: kafka-jumbo: provision broker 1018 [puppet] - 10https://gerrit.wikimedia.org/r/1148335 (https://phabricator.wikimedia.org/T377874) [13:36:15] 80% [13:37:23] !log hashar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148318|IP cap lift request for Leeds University 21 May (T394639)]], [[gerrit:1147139|core-Namespaces: Update Malay wiki (mswiki) namespaces (T394603)]] (duration: 15m 28s) [13:37:27] T394639: Temporary IP lift request for Leeds University Wednesday 21 May 1130-1630 UTC - https://phabricator.wikimedia.org/T394639 [13:37:28] T394603: Configure the namespaces on Malay Wikipedia - https://phabricator.wikimedia.org/T394603 [13:37:43] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1082.eqiad.wmnet with OS bullseye [13:37:43] (03PS2) 10Alexandros Kosiaris: staging-codfw: Specify MTU of 1460 [puppet] - 10https://gerrit.wikimedia.org/r/1148326 (https://phabricator.wikimedia.org/T352956) [13:37:45] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148326 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [13:37:47] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1082 [13:37:47] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1082 [13:38:04] (03PS1) 10Slyngshede: Styling: Fix logo alignment on mobile [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1148336 [13:38:07] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 11 hosts with reason: replace cr2-codfw switch control boards and install new line card [13:38:13] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10839281 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5afc68ed-eba5-4a71-b833-f809ae58201b) set by cmooney@cumin1003 for 4:00:00 on 11... [13:38:24] (03PS1) 10Vgutierrez: liberica: Don't deploy ipip-multiqueue-optimizer with katran [puppet] - 10https://gerrit.wikimedia.org/r/1148337 (https://phabricator.wikimedia.org/T380450) [13:38:28] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1083.eqiad.wmnet with OS bullseye [13:38:31] I am running the maintenance scripts now [13:38:32] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1083 [13:38:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1083 [13:39:15] 0 links to fix, 0 were resolvable, 0 were deleted. [13:39:15] hmm [13:39:35] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5608/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148335 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol) [13:40:04] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148337 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez) [13:40:21] hashar: The old links should still work, because the old namespace name is still an alias to the new name. Does that change anything? [13:40:46] !log disabling bgp groups on cr2-codfw ahead of upgrade/line-card install (T364092) [13:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:50] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [13:41:13] bunnypranav: I have pasted the output on the task https://phabricator.wikimedia.org/T394603#10839297 [13:41:19] basically nothing needed to be fix :] [13:41:38] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [13:41:43] Thanks! [13:41:46] hashar: thanks for deploying [13:42:30] hashar: I would like one more help from you, if you have +2 in extensions/CampaignEvents, could you merge https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CampaignEvents/+/1147142? [13:42:43] It's related to the same phab ticket and request [13:44:22] anzx: and I have reset the throttle (hopefully) [13:44:37] phuedx: patches deployed [13:44:46] phuedx: so you can backport you change [13:45:00] hashar: Thanks. I might try using SpiderPig [13:45:10] You should! [13:45:16] (03CR) 10Alexandros Kosiaris: [C:03+2] staging-codfw: Specify MTU of 1460 [puppet] - 10https://gerrit.wikimedia.org/r/1148326 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [13:45:41] hashar: Starting [13:45:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by phuedx@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147796 (https://phabricator.wikimedia.org/T394457) (owner: 10Phuedx) [13:45:55] anzx: I know nothing about CampaignEvents , but you can specify on that change that the mediawiki-config one has already been deployed [13:46:19] phuedx: we can pair over a video if you want :) [13:46:19] FIRING: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr2-codfw (10.192.254.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [13:46:31] anzx: sorry wrong person [13:46:35] hashar: Ok thanks! [13:46:40] :) [13:46:59] :) [13:47:08] (03Merged) 10jenkins-bot: ext-EventStreamConfig: Update product_metrics.web_base stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147796 (https://phabricator.wikimedia.org/T394457) (owner: 10Phuedx) [13:47:30] !log phuedx@deploy1003 Started scap sync-world: Backport for [[gerrit:1147796|ext-EventStreamConfig: Update product_metrics.web_base stream (T394457)]] [13:47:33] (03CR) 10Hashar: "As a note: @bunnypranav.wiki@bunnyorg.in has mentioned CampaignEvents has a related change https://gerrit.wikimedia.org/r/c/mediawiki/exte" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147139 (https://phabricator.wikimedia.org/T394603) (owner: 10Bunnypranav) [13:47:34] T394457: Add mediawiki_skin and mediawiki_database to base web stream's contextual attributes - https://phabricator.wikimedia.org/T394457 [13:47:36] (03PS1) 10Muehlenhoff: New structure for sshd_config starting with trixie (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) [13:47:45] bunnypranav: thank you for having successfully configured a wiki! [13:48:09] FIRING: [4x] CoreBGPDown: Core BGP session down between ssw1-a8-codfw and cr2-codfw (10.192.254.6) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:48:19] Yay! [13:48:37] there is another change by `HouseOfM` [config] 1146628 (Deploy change) release CampaignEvents to cbk-zam wiki - task T393604 [13:48:38] T393604: Enable Extension:CampaignEvents on cbk-zam.wikipedia.org - https://phabricator.wikimedia.org/T393604 [13:48:48] hashar: Thanks! This is so easy though! [13:49:23] phuedx: yeah don't you feel spoiled by deployment? :b [13:49:37] click > click > click . Done! [13:50:01] hashar: Absolutely. I said on the announcement thread that I remember having to manually bump .gitmodules. This is The Future™ [13:50:11] I am not that far from writing a webdriver.io script to drive spider pig :) [13:50:14] (03CR) 10Btullis: [C:03+1] airflow: display an info message explaining how to destroy a devenv [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148329 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [13:50:15] (03CR) 10CI reject: [V:04-1] New structure for sshd_config starting with trixie (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1148338 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:50:23] (03CR) 10Volans: "AFAICT noone is using the `remote_name` parameter beside `modules/homer/manifests/init.pp`." [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [13:50:33] TBH, my first backport was to a private wiki, so the deployer (who also happens to have access to that wiki), checked it themselves. So I'd call this my first backport. Thanks again hashar for making this a reality! [13:51:17] bunnypranav: you are very welcome!! [13:52:09] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1082.eqiad.wmnet with reason: host reimage [13:52:17] (03PS1) 10Muehlenhoff: sshd: Remove dead template argument [puppet] - 10https://gerrit.wikimedia.org/r/1148342 (https://phabricator.wikimedia.org/T393762) [13:52:24] (03CR) 10Brouberol: [C:03+2] airflow: display an info message explaining how to destroy a devenv [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148329 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [13:52:48] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1083.eqiad.wmnet with reason: host reimage [13:53:25] !log phuedx@deploy1003 phuedx: Backport for [[gerrit:1147796|ext-EventStreamConfig: Update product_metrics.web_base stream (T394457)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:53:28] T394457: Add mediawiki_skin and mediawiki_database to base web stream's contextual attributes - https://phabricator.wikimedia.org/T394457 [13:54:35] Checked on k8s-mwdebug. The stream config is up to date [13:54:39] !log phuedx@deploy1003 phuedx: Continuing with sync [13:54:46] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1087 to cirrussearch1087 [13:54:58] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:55:07] (03PS1) 10Vgutierrez: hiera: Enable edge uniques on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1148343 (https://phabricator.wikimedia.org/T391411) [13:55:25] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148343 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:55:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1082.eqiad.wmnet with reason: host reimage [13:55:58] (03CR) 10Muehlenhoff: [C:03+2] Add two more maps-replicas on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1148327 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:56:34] !log rebooting backup routing-engine RE1 on cr2-codfw to install JunOS upgrade (T364092) [13:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:38] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [13:57:22] (03CR) 10Elukey: [C:03+2] icinga: skip downtimed services in wait_for_optimal if needed [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [13:58:01] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 10Release-Engineering-Team (Radar): codfw: 1VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10839331 (10MoritzMuehlenhoff) >>! In T393873#10838820, @Jelto wrote: > Thanks @Dzahn for picking this up. I left a c... [13:58:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148342 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [13:58:29] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:55] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1087 to cirrussearch1087 - bking@cumin2002" [13:59:10] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10839349 (10MatthewVernon) [13:59:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1083.eqiad.wmnet with reason: host reimage [13:59:51] (03CR) 10Dzahn: [C:03+2] "ah, right, good point! will change. no biggie, the VM needs to be reimaged still regardless" [puppet] - 10https://gerrit.wikimedia.org/r/1147855 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [14:00:06] the first patch in the list will not be deployed, confirmed by Michelle over Slack (Gerrit #1146628 - release CampaignEvents to cbk-zam wiki [14:01:40] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet,db1216.eqiad.wmnet with reason: Move s8 to s3 [14:01:45] !log phuedx@deploy1003 Finished scap sync-world: Backport for [[gerrit:1147796|ext-EventStreamConfig: Update product_metrics.web_base stream (T394457)]] (duration: 14m 14s) [14:01:48] T394457: Add mediawiki_skin and mediawiki_database to base web stream's contextual attributes - https://phabricator.wikimedia.org/T394457 [14:02:01] bking@cumin2002 rename (PID 2760430) is awaiting input [14:02:11] hashar: Done. That was so cool [14:02:24] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1087 to cirrussearch1087 - bking@cumin2002" [14:02:24] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:02:24] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1087 on all recursors [14:02:27] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1087 on all recursors [14:02:28] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1087 [14:03:29] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:03:35] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1087 [14:03:44] !log switching active routing-engine to RE1 on cr2-codfw (this will cause protocol adjacencies to flap) (T364092) [14:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:47] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [14:03:58] (03PS28) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) [14:03:58] (03PS2) 10Vgutierrez: hiera: Enable edge uniques on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1148343 (https://phabricator.wikimedia.org/T391411) [14:03:58] (03PS1) 10Vgutierrez: varnish: Deploy wmfuniq-vmod and related files CDN wide [puppet] - 10https://gerrit.wikimedia.org/r/1148346 (https://phabricator.wikimedia.org/T391411) [14:04:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1087 to cirrussearch1087 [14:04:38] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148346 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:06:20] (03PS5) 10Jcrespo: dbbackups: Setup backups for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) [14:06:33] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for the Prometheus Bird exporter [puppet] - 10https://gerrit.wikimedia.org/r/1145802 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:06:39] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1087.eqiad.wmnet with OS bullseye [14:06:43] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1087 [14:06:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1087 [14:07:43] 07Puppet: validate systemd units - https://phabricator.wikimedia.org/T392629#10839410 (10MoritzMuehlenhoff) >>! In T392629#10835971, @jhathaway wrote: > since the validate cmd runs prior to writing the file to its destination. Right, I forgot about that. [14:07:52] (03Merged) 10jenkins-bot: icinga: skip downtimed services in wait_for_optimal if needed [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [14:08:29] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:24] FIRING: [6x] CoreBGPDown: Core BGP session down between cr3-ulsfo and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:09:25] (03PS2) 10Hnowlan: mw::maintenance: migrate all refreshLinkRecommendations jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143529 (https://phabricator.wikimedia.org/T385782) [14:10:16] (03PS1) 10Muehlenhoff: Enable the remaining two maps nodes as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1148351 (https://phabricator.wikimedia.org/T381565) [14:10:55] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10839416 (10MoritzMuehlenhoff) [14:11:03] (03CR) 10Ssingh: [C:03+1] varnish: Deploy wmfuniq-vmod and related files CDN wide [puppet] - 10https://gerrit.wikimedia.org/r/1148346 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:11:08] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10839417 (10MoritzMuehlenhoff) [14:11:20] (03CR) 10Jcrespo: [C:03+2] dbbackups: Setup backups for x3 [puppet] - 10https://gerrit.wikimedia.org/r/1148278 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [14:11:51] phuedx: \o/ :) [14:12:07] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T394784 (10Mmta) 03NEW [14:12:14] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10839431 (10RobH) Sorry, I meant to update this task with that info sooner! Serial-ATA_Firmware_VJPKG_WN64_DL7C_A00.EXE The DL7C version. [14:13:09] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:13:34] (03CR) 10Hnowlan: "This should be unblocked now." [puppet] - 10https://gerrit.wikimedia.org/r/1143529 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [14:13:52] (03CR) 10Vgutierrez: [C:03+2] varnish: Deploy wmfuniq-vmod and related files CDN wide [puppet] - 10https://gerrit.wikimedia.org/r/1148346 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:14:01] (03CR) 10Btullis: [C:03+1] kafka-jumbo: provision broker 1016 [puppet] - 10https://gerrit.wikimedia.org/r/1148333 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol) [14:14:10] 06SRE, 10LDAP-Access-Requests: Grant Access to https://idm.wikimedia.org/ for maxbinderWMF - https://phabricator.wikimedia.org/T394523#10839436 (10MBinder_WMF) I was able to login now, thanks! I think we can proceed with keeping "mbinder" and cleaning up "maxbinderWMF". I appreciate your patience. :) [14:14:24] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:14:25] (03CR) 10Btullis: [C:03+1] kafka-jumbo: provision broker 1017 [puppet] - 10https://gerrit.wikimedia.org/r/1148334 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol) [14:15:03] (03CR) 10Btullis: [C:03+1] kafka-jumbo: provision broker 1018 [puppet] - 10https://gerrit.wikimedia.org/r/1148335 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol) [14:15:43] (03CR) 10Brouberol: [C:03+2] kafka-jumbo: provision broker 1016 [puppet] - 10https://gerrit.wikimedia.org/r/1148333 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol) [14:15:46] (03CR) 10Brouberol: [C:03+2] kafka-jumbo: provision broker 1017 [puppet] - 10https://gerrit.wikimedia.org/r/1148334 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol) [14:15:49] (03CR) 10Brouberol: [V:03+1 C:03+2] kafka-jumbo: provision broker 1018 [puppet] - 10https://gerrit.wikimedia.org/r/1148335 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol) [14:18:09] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [14:18:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1082.eqiad.wmnet with OS bullseye [14:21:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1083.eqiad.wmnet with OS bullseye [14:21:20] (03CR) 10Elukey: [C:03+1] Enable the remaining two maps nodes as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1148351 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:21:26] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1087.eqiad.wmnet with reason: host reimage [14:25:03] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1087.eqiad.wmnet with reason: host reimage [14:25:39] (03PS1) 10Vgutierrez: varnish: Fix edge uniques secret deployment [puppet] - 10https://gerrit.wikimedia.org/r/1148355 (https://phabricator.wikimedia.org/T391411) [14:26:34] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148355 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:30:11] (03CR) 10Ssingh: [C:03+1] varnish: Fix edge uniques secret deployment [puppet] - 10https://gerrit.wikimedia.org/r/1148355 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:30:36] (03CR) 10Vgutierrez: [C:03+2] varnish: Fix edge uniques secret deployment [puppet] - 10https://gerrit.wikimedia.org/r/1148355 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:31:00] PROBLEM - Kafka Broker Server #page on kafka-jumbo1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:31:00] (03PS2) 10Slyngshede: Styling: Fix logo alignment on mobile [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1148336 [14:31:12] checking [14:31:18] was there any maintenance ongoing? [14:31:31] acked [14:31:39] jynusm jhathaway ^ [14:31:40] jynus: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/1398dec4f8428b8e4b553a6e6c1ef9337e88e089 [14:31:42] o/ [14:31:58] is it production? [14:32:00] (03CR) 10Dzahn: [C:03+2] site: add zuul VMs with collab-insetup-role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147855 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [14:32:06] nono those are new afaics [14:32:13] Balthazar is adding them [14:32:18] ah, so WIP [14:32:24] cc brouberol --^ [14:32:29] ok, let me double check no prod impact then [14:33:16] yes, sorry, I was working on adding them to the cluster [14:33:28] not receiving real messages then? [14:33:32] (yet) [14:33:43] and then I realized that their disk layout was completely borked, and I stopped kafka to prevent it from storing data on a 400GB disk [14:34:04] hmm is not in production but kafka-jumbo1018 is already present on haproxykafka server list BTW [14:34:06] I see disk usage decreasing a lot on jumbo-eqiad [14:34:11] (03PS1) 10FNegri: wikireplicas: remove dashes from script names [puppet] - 10https://gerrit.wikimedia.org/r/1148358 (https://phabricator.wikimedia.org/T351637) [14:34:18] I'll mute the alerts for these hosts, and we'll have to reimage the hosts to fix the disk layout [14:34:21] optionally we could add a "profile::monitoring::notifications_enabled: false" to those hosts until they are fully done [14:34:23] we did just stop sending legacy varnishkafka data recently [14:34:26] https://www.irccloud.com/pastebin/ct68MRfh/ [14:34:32] bblack: thanks [14:34:32] that could be causing decrease on jumbo [14:34:41] (03PS1) 10Dzahn: site: switch zuul3 VMs to use ferm insetup role, not nftables [puppet] - 10https://gerrit.wikimedia.org/r/1148359 (https://phabricator.wikimedia.org/T393873) [14:34:54] I lack a lot of recent context as I was out, so thanks to everybody providing it [14:35:11] (03PS1) 10Muehlenhoff: Add a variant of the test role which is Kerberos-enabled [puppet] - 10https://gerrit.wikimedia.org/r/1148360 [14:35:11] the disk decrease might just be the webrequest topics being phased out by their own retention [14:35:17] ok, what bblack said [14:35:18] PROBLEM - Kafka Broker Server #page on kafka-jumbo1016 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:35:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1147881 (https://phabricator.wikimedia.org/T391248) (owner: 10Scardenasmolinar) [14:35:32] all in all, I'll mute ^ these ^ for 1016 -> 1018 [14:35:32] is 16 the same? [14:35:37] I see, thanks [14:35:40] 1016, 1017, 1018 [14:35:49] ok, so seeing no impact, let's deescalate [14:36:00] * herron wonders what happens if you run a 1 line sleep script named java [14:36:04] vgutierrez: did you see something relevant fro the opposite? [14:36:10] or something to improve? [14:36:20] (03CR) 10CI reject: [V:04-1] Add a variant of the test role which is Kerberos-enabled [puppet] - 10https://gerrit.wikimedia.org/r/1148360 (owner: 10Muehlenhoff) [14:36:46] for context on what I said earlier: we were redundantly sending webrequest streams from haproxy (new) and varnish (old). We finally stopped sending the old one ~today, which causes a big dropoff in total data being sent through kafka. You can see the dropoff of the old topic here: [14:36:51] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=2025-05-20T06:03:23.775Z&to=2025-05-20T12:03:23.775Z&timezone=utc&var-datasource=000000006&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=$__all&var-topic=webrequest_text&viewPanel=panel-34 [14:37:16] yeah, no worries [14:37:16] 06SRE: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788 (10nisrael) 03NEW [14:37:19] (03CR) 10Dzahn: [C:03+2] gerrit: remove temp firewall rule for hackathon [puppet] - 10https://gerrit.wikimedia.org/r/1148275 (https://phabricator.wikimedia.org/T382309) (owner: 10Jelto) [14:37:24] silence put in place. Sorry about the heartjolt [14:37:25] I was asking about the "present on haproxykafka" [14:37:32] (03CR) 10Dzahn: [C:03+2] "thanks for creating this, yep!" [puppet] - 10https://gerrit.wikimedia.org/r/1148275 (https://phabricator.wikimedia.org/T382309) (owner: 10Jelto) [14:37:41] if that was problematic or shouldn't be there yet, etc [14:37:45] thanks brouberol [14:38:54] I'll also revert that stack https://gerrit.wikimedia.org/r/c/operations/puppet/+/1148335 to remove these 3 brokers from haproxykafka [14:39:04] ok, thanks [14:39:09] (03PS2) 10Dzahn: site: switch zuul3 VMs to use ferm insetup role, not nftables [puppet] - 10https://gerrit.wikimedia.org/r/1148359 (https://phabricator.wikimedia.org/T393873) [14:39:33] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [14:39:44] ^muting these as well^ [14:39:57] (03CR) 10Dzahn: [C:03+2] "Notice: /Stage[main]/Nftables/File[/etc/nftables/input/11_accept-hackathon-istanbul.nft]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/1148275 (https://phabricator.wikimedia.org/T382309) (owner: 10Jelto) [14:39:58] !log switching active routing-engine to RE0 on cr2-codfw (this will cause protocol adjacencies to flap) (T364092) [14:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:02] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [14:41:04] unrelated, but I also saw some alerts about systemd on ms-fe1009 CC Emperor [14:41:40] (03CR) 10Dzahn: [C:03+2] "ok, I did both. and thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1140507 (https://phabricator.wikimedia.org/T393034) (owner: 10Dzahn) [14:42:00] it recovered again pretty quickly (by the time I actually got to checking, was in a meeting) [14:42:19] no warries, thanks [14:43:05] (03PS1) 10Brouberol: Revert "kafka-jumbo: provision broker 1018" [puppet] - 10https://gerrit.wikimedia.org/r/1148361 [14:43:07] (03PS1) 10Brouberol: Revert "kafka-jumbo: provision broker 1017" [puppet] - 10https://gerrit.wikimedia.org/r/1148362 [14:43:10] (03PS1) 10Brouberol: Revert "kafka-jumbo: provision broker 1016" [puppet] - 10https://gerrit.wikimedia.org/r/1148363 [14:43:12] I see some spikes on nf_conntrack on older kafka-jumbo hosts FYI https://grafana.wikimedia.org/goto/gvnB_Q-HR?orgId=1 [14:43:29] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [14:43:51] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [14:44:16] (03CR) 10CI reject: [V:04-1] Revert "kafka-jumbo: provision broker 1018" [puppet] - 10https://gerrit.wikimedia.org/r/1148361 (owner: 10Brouberol) [14:44:23] (03CR) 10CI reject: [V:04-1] Revert "kafka-jumbo: provision broker 1017" [puppet] - 10https://gerrit.wikimedia.org/r/1148362 (owner: 10Brouberol) [14:44:32] (03PS2) 10Muehlenhoff: Add a variant of the test role which is Kerberos-enabled [puppet] - 10https://gerrit.wikimedia.org/r/1148360 [14:44:34] ACKNOWLEDGEMENT - Kafka Broker Server #page on kafka-jumbo1016 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties daniel_zahn work in progress https://phabricator.wikimedia.org/T377874 https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:44:34] ACKNOWLEDGEMENT - Kafka broker TLS certificate validity on kafka-jumbo1016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused daniel_zahn work in progress https://phabricator.wikimedia.org/T377874 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [14:44:35] ACKNOWLEDGEMENT - Kafka Broker Server #page on kafka-jumbo1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties daniel_zahn work in progress https://phabricator.wikimedia.org/T377874 https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:44:35] ACKNOWLEDGEMENT - Kafka broker TLS certificate validity on kafka-jumbo1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused daniel_zahn work in progress https://phabricator.wikimedia.org/T377874 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [14:44:36] (03CR) 10CI reject: [V:04-1] Revert "kafka-jumbo: provision broker 1016" [puppet] - 10https://gerrit.wikimedia.org/r/1148363 (owner: 10Brouberol) [14:45:04] (03PS2) 10Brouberol: Revert "kafka-jumbo: provision broker 1016" [puppet] - 10https://gerrit.wikimedia.org/r/1148363 [14:45:07] herron: it would only work if it also matches the full args ^ [14:45:15] (03PS2) 10Brouberol: Revert "kafka-jumbo: provision broker 1017" [puppet] - 10https://gerrit.wikimedia.org/r/1148362 [14:45:23] (03PS2) 10Brouberol: Revert "kafka-jumbo: provision broker 1018" [puppet] - 10https://gerrit.wikimedia.org/r/1148361 [14:45:49] urandom: I'm going to revert these nodes back to insetup mode [14:46:34] sorry, wrong nick :/ I meant mutante [14:47:01] (03CR) 10Brouberol: [C:03+2] Revert "kafka-jumbo: provision broker 1018" [puppet] - 10https://gerrit.wikimedia.org/r/1148361 (owner: 10Brouberol) [14:47:03] brouberol: if you want. but you dont have to do that just to turn off the alerting [14:47:04] (03CR) 10Brouberol: [C:03+2] Revert "kafka-jumbo: provision broker 1017" [puppet] - 10https://gerrit.wikimedia.org/r/1148362 (owner: 10Brouberol) [14:47:07] (03CR) 10Brouberol: [C:03+2] Revert "kafka-jumbo: provision broker 1016" [puppet] - 10https://gerrit.wikimedia.org/r/1148363 (owner: 10Brouberol) [14:47:42] I've put silences in place, but we have to fix the parted config (yay) so there's no point keeping these nodes in that state anyway [14:47:50] there is going to be one more for 1017 in a moment [14:47:58] brouberol: gotcha! alright, thanks [14:48:08] PROBLEM - Kafka Broker Server #page on kafka-jumbo1017 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:48:30] acking that, too [14:48:36] ACKNOWLEDGEMENT - Kafka Broker Server #page on kafka-jumbo1017 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties daniel_zahn https://phabricator.wikimedia.org/T377874 https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:48:36] !log installing expat security updates [14:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:04] brouberol: just the other day I saw filippo made this new tool to test .. partman config btw [14:49:24] (03PS3) 10Muehlenhoff: Add a variant of the test role which is Kerberos-enabled [puppet] - 10https://gerrit.wikimedia.org/r/1148360 [14:50:04] (03PS1) 10Btullis: mediawiki-dumps-legacy: Bump dumps toolbox image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148365 (https://phabricator.wikimedia.org/T394389) [14:50:06] yep, I have this open in firefox. I think today is the day I get to test it [14:50:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1087.eqiad.wmnet with OS bullseye [14:51:07] ftr, reverting to insetup role might not actually remove monitoring.. but we can also use the downtime cookbook and set it to 3 weeks or whatever [14:51:23] (03PS3) 10Brouberol: Revert "kafka-jumbo: provision broker 1017" [puppet] - 10https://gerrit.wikimedia.org/r/1148362 [14:51:29] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10839588 (10bking) 05Open→03In progress a:03bking [14:51:46] brouberol: sounds like nice timing [14:51:49] !log klausman@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Enable Java security updates - klausman@cumin1002 [14:52:14] !log shutting down backup RE1 on cr2-codfw (T393552) [14:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:18] T393552: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552 [14:52:25] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.4.1 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148367 [14:52:26] (03CR) 10Dzahn: [C:03+2] site: switch zuul3 VMs to use ferm insetup role, not nftables [puppet] - 10https://gerrit.wikimedia.org/r/1148359 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [14:52:46] (03CR) 10Brouberol: [C:03+2] Revert "kafka-jumbo: provision broker 1017" [puppet] - 10https://gerrit.wikimedia.org/r/1148362 (owner: 10Brouberol) [14:53:04] (03PS3) 10Brouberol: Revert "kafka-jumbo: provision broker 1016" [puppet] - 10https://gerrit.wikimedia.org/r/1148363 [14:53:06] brouberol: feel free to type "multiple" and merge mine as well [14:53:26] does not affect production machines [14:53:30] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148343 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:53:39] !log shutting down control board 1 on cr2-codfw (T393552) [14:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:42] ack [14:54:00] (03PS3) 10Scott French: mw-(debug|web): switch train-dev to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137500 (https://phabricator.wikimedia.org/T391057) [14:54:00] (03CR) 10Scott French: "Unless you have any objections, I'd like to merge this and [0] at some point today, in order to get train-dev off of using 7.4 images (in " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137500 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [14:54:04] (03CR) 10Btullis: [C:03+2] mediawiki-dumps-legacy: Bump dumps toolbox image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148365 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [14:54:08] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 65 hosts with reason: eqiad is depooled, noisy alerts [14:54:14] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10839593 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b139a4cb-57ae-4d79-8869-06e835f82525) set by bki... [14:54:55] (03CR) 10Brouberol: [C:03+2] Revert "kafka-jumbo: provision broker 1016" [puppet] - 10https://gerrit.wikimedia.org/r/1148363 (owner: 10Brouberol) [14:55:46] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: Bump dumps toolbox image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148365 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [14:56:39] all reverts went through. I'll run sre.hosts.downtime on these nodes [14:56:43] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [14:56:47] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [14:58:00] (03PS6) 10JHathaway: homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [14:58:13] !log brouberol@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 21 days, 0:00:00 on kafka-jumbo[1016-1018].eqiad.wmnet with reason: Parted config is broken causing the hosts to have no data disk [14:58:19] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10839616 (10elukey) Tried to dump the scp config via spicerack shell, and under `BIOS.Setup.1-1` I see `'BiosNvmeDriver': 'DellQualifiedDrives',`. So ideally we could check i... [14:58:22] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10839617 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8c2873fc-7051-429d-9a4b-eaa4b241ec27) set by br... [14:58:40] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10839619 (10bking) Thanks @jcrespo . I think the reimage cookbook must be removing downtimes. The help says it doesn't, but I... [14:58:52] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1148336 (owner: 10Slyngshede) [14:59:22] (03CR) 10JHathaway: homer: make private repo support multiple peers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [14:59:39] (03PS29) 10Vgutierrez: varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) [14:59:39] (03PS3) 10Vgutierrez: hiera: Enable edge uniques on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1148343 (https://phabricator.wikimedia.org/T391411) [14:59:48] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [14:59:51] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148343 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:00:05] jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1500). [15:00:11] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: codfw: 1VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10839625 (10Dzahn) >>! In T393873#10838820, @Jelto wrote: > I don't think we can use `nftables` because we most likely use docker/containers... [15:00:15] (03CR) 10Slyngshede: [V:03+2 C:03+2] Styling: Fix logo alignment on mobile [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1148336 (owner: 10Slyngshede) [15:01:34] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10839636 (10jcrespo) > I think the reimage cookbook must be removing downtimes. Indeed. My recommendation (or at least what... [15:05:24] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:05:26] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:07:38] 06SRE, 10Observability-Metrics: Pyrra detail grafana dashboard contains two panels displaying misleading data - https://phabricator.wikimedia.org/T393797#10839653 (10herron) [15:08:09] (03CR) 10Clément Goubert: [C:03+1] mw::maintenance: migrate all refreshLinkRecommendations jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143529 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:08:29] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:51] 06SRE, 10Observability-Metrics: Pyrra detail grafana dashboard contains two panels displaying misleading data - https://phabricator.wikimedia.org/T393797#10839661 (10herron) Made a few more improvements to the Pyrra detail dashboard, more specifically to properly include cluster labels when present, and automa... [15:09:33] !log klausman@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Enable Java security updates - klausman@cumin1002 [15:10:43] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148343 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:11:19] (03PS4) 10Hnowlan: sre:rest-gateway: rename api gateway alert, decrease threshold [alerts] - 10https://gerrit.wikimedia.org/r/1147714 (https://phabricator.wikimedia.org/T394582) [15:11:25] (03CR) 10CI reject: [V:04-1] sre:rest-gateway: rename api gateway alert, decrease threshold [alerts] - 10https://gerrit.wikimedia.org/r/1147714 (https://phabricator.wikimedia.org/T394582) (owner: 10Hnowlan) [15:11:55] (03CR) 10Ssingh: "Should we run this on a non-affected host as well to ensure NOOP?" [puppet] - 10https://gerrit.wikimedia.org/r/1148343 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:12:47] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10839691 (10Gehel) [15:13:05] (03CR) 10Vgutierrez: "you got that PCC run on the previous CR in this chain" [puppet] - 10https://gerrit.wikimedia.org/r/1148343 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:13:22] (03CR) 10Ssingh: [C:03+1] hiera: Enable edge uniques on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1148343 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:13:32] (03CR) 10Scott French: mediawiki: Allow varying the entrypoint through mwscript values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147918 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [15:13:56] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors rolling restart_daemons on A:logstash-collector [15:15:35] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Upgrading to Java 11.0.27 - eevans@cumin1002 [15:16:12] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10839700 (10ssingh) We discussed this in the Traffic meeting and there are no concerns from our side in moving ahead. Let us know the date/time we want this roll out and we can coordinate! [15:18:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10839707 (10BCornwall) @VRiley-WMF! Is there something still needed on our side or are we good to go? [15:19:18] (03PS1) 10Fabfur: haproxy: normalize host header [puppet] - 10https://gerrit.wikimedia.org/r/1148373 (https://phabricator.wikimedia.org/T392880) [15:19:19] (03CR) 10Elukey: [C:03+1] Add a variant of the test role which is Kerberos-enabled [puppet] - 10https://gerrit.wikimedia.org/r/1148360 (owner: 10Muehlenhoff) [15:21:09] (03PS1) 10Clément Goubert: mw::maintenance: Migrate wikidata_resubmit_changes_for_dispatch [puppet] - 10https://gerrit.wikimedia.org/r/1148374 (https://phabricator.wikimedia.org/T388538) [15:21:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors (exit_code=0) rolling restart_daemons on A:logstash-collector [15:22:26] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10839754 (10MoritzMuehlenhoff) >>! In T394263#10839700, @ssingh wrote: > We discussed this in the Traffic meeting and there are no concerns from our side in moving ahead. Let us know th... [15:24:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10839756 (10ssingh) >>! In T394263#10839754, @MoritzMuehlenhoff wrote: >>>! In T394263#10839700, @ssingh wrote: >> We discussed this in the Traffic meeting and there are no concerns fro... [15:25:13] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148374 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [15:28:29] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:28:37] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:29:22] (03CR) 10Hnowlan: [C:03+1] mw::maintenance: Migrate wikidata_resubmit_changes_for_dispatch [puppet] - 10https://gerrit.wikimedia.org/r/1148374 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [15:31:42] 06SRE: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10839811 (10Aklapper) [15:31:43] 06SRE, 06Fundraising-Backlog, 06Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#10839809 (10Aklapper) →14Duplicate dup:03T394788 [15:32:49] RECOVERY - Restbase root url on restbase1041 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/RESTBase [15:33:45] (03CR) 10Federico Ceratto: "There's a "-1" on the CR regarding the addition of downtime and an unresolved comment. Please let me know if there are more changes requir" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [15:33:47] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148373 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur) [15:34:57] (03PS2) 10Clément Goubert: mw::maintenance: Migrate wikidata_resubmit_changes_for_dispatch [puppet] - 10https://gerrit.wikimedia.org/r/1148374 (https://phabricator.wikimedia.org/T388538) [15:35:05] (03CR) 10BBlack: [C:03+1] "I think we're in a good?" [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:35:26] (03PS8) 10Federico Ceratto: sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146) [15:36:25] (03CR) 10Federico Ceratto: "I just rebased the CR, it should be ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146) (owner: 10Federico Ceratto) [15:38:29] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:40] (03PS5) 10Fabfur: haproxy: use maxmind lua bindings to lookup client ISP [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) [15:39:22] (03PS11) 10Federico Ceratto: Add switchover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) [15:39:26] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [15:42:02] (03CR) 10Federico Ceratto: "I rebased the CR, @marostegui@wikimedia.org I addressed your comments." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [15:42:15] (03CR) 10Scott French: [C:03+1] mw::maintenance: Migrate wikidata_resubmit_changes_for_dispatch [puppet] - 10https://gerrit.wikimedia.org/r/1148374 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [15:42:57] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance: Migrate wikidata_resubmit_changes_for_dispatch [puppet] - 10https://gerrit.wikimedia.org/r/1148374 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [15:44:19] (03CR) 10Vgutierrez: [C:03+2] varnish: Issue and handle WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1142551 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:44:57] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10839933 (10Jhancock.wm) [15:49:14] !log klausman@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Enable Java security updates - klausman@cumin1002 [15:49:51] !log vgutierrez@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4038.ulsfo.wmnet [15:51:55] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [15:52:19] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [15:53:09] FIRING: [6x] CoreBGPDown: Core BGP session down between cr3-ulsfo and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:54:18] !log vgutierrez@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4038.ulsfo.wmnet [15:54:24] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:57:50] !log depooling cp4037 before enabling edge uniques - T391411 [15:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:54] T391411: SDS 2.4.4 Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411 [15:58:00] !log vgutierrez@puppetserver1001 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [15:59:16] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge uniques on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1148343 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [15:59:17] (03PS2) 10RLazarus: mediawiki: Allow varying the entrypoint through mwscript values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147918 (https://phabricator.wikimedia.org/T378479) [16:00:05] jhathaway and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1600). nyaa~ [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:18] (03PS1) 10AOkoth: wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) [16:02:16] (03CR) 10RLazarus: mediawiki: Allow varying the entrypoint through mwscript values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147918 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [16:02:31] (03PS2) 10AOkoth: wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) [16:03:09] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:03:29] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:49] PROBLEM - Restbase root url on restbase1041 is CRITICAL: connect to address 10.64.48.40 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [16:06:57] !log klausman@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Enable Java security updates - klausman@cumin1002 [16:09:46] (03PS2) 10Hnowlan: (api|rest)-gateway: log 5xx errors by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147741 (https://phabricator.wikimedia.org/T394584) [16:11:15] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [16:11:23] (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [16:11:25] (03PS1) 10Vgutierrez: varnish: Fix non wikimedia.org Domain value for WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1148381 (https://phabricator.wikimedia.org/T391411) [16:12:11] (03CR) 10Hnowlan: (api|rest)-gateway: log 5xx errors by default (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147741 (https://phabricator.wikimedia.org/T394584) (owner: 10Hnowlan) [16:15:16] (03PS7) 10Volans: homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) [16:15:48] (03PS8) 10Volans: homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) [16:17:44] (03PS2) 10Vgutierrez: varnish: Fix non wikimedia.org Domain value for WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1148381 (https://phabricator.wikimedia.org/T391411) [16:18:00] (03CR) 10Ssingh: [C:03+1] varnish: Fix non wikimedia.org Domain value for WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1148381 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:19:11] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [16:19:49] RECOVERY - Restbase root url on restbase1041 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/RESTBase [16:22:00] (03CR) 10Vgutierrez: [C:03+2] varnish: Fix non wikimedia.org Domain value for WMF-Uniq cookie [puppet] - 10https://gerrit.wikimedia.org/r/1148381 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:22:17] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 10Release-Engineering-Team (Radar): eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10840115 (10Dzahn) [16:38:09] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:40:06] (03PS1) 10Hnowlan: mw::maintenance: move uncategorizedtemplates and wantedtemplates to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148384 (https://phabricator.wikimedia.org/T388534) [16:40:16] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1148385 [16:40:16] (03PS1) 10Ahmon Dancy: Disable wmgUsePoolCounter in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1148386 [16:40:19] (03PS3) 10Volans: git::clone: fix support for different remote name [puppet] - 10https://gerrit.wikimedia.org/r/1148267 [16:40:35] (03CR) 10Ahmon Dancy: [C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1148385 (owner: 10Ahmon Dancy) [16:40:48] (03PS9) 10Volans: homer: make private repo support multiple peers [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) [16:40:56] (03CR) 10Ahmon Dancy: [C:03+2] Disable wmgUsePoolCounter in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1148386 (owner: 10Ahmon Dancy) [16:42:16] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1148385 (owner: 10Ahmon Dancy) [16:42:17] (03Merged) 10jenkins-bot: Disable wmgUsePoolCounter in train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1148386 (owner: 10Ahmon Dancy) [16:43:50] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Upgrading to Java 11.0.27 - eevans@cumin1002 [16:44:02] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 10Release-Engineering-Team (Radar): eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10840208 (10Dzahn) after today's meetings with SRE-collab/releng/zuul author we update this request to 3 VMs p... [16:44:43] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [16:44:52] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [16:45:38] (03CR) 10Ahmon Dancy: [C:03+2] mw-(debug|web): switch train-dev to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137500 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [16:46:59] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Upgrading to Java 11.0.27 - eevans@cumin1002 [16:50:36] (03CR) 10Clément Goubert: [C:03+1] mw::maintenance: move uncategorizedtemplates and wantedtemplates to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148384 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [16:51:17] (03CR) 10Ahmon Dancy: [C:03+1] mw-(debug|web): switch train-dev to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137500 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [16:53:31] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host zuul2001.codfw.wmnet with OS bullseye [16:53:36] (03CR) 10Scott French: [C:03+1] mw::maintenance: move uncategorizedtemplates and wantedtemplates to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148384 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [16:53:38] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 10Release-Engineering-Team (Radar): eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10840269 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host z... [16:53:54] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: move uncategorizedtemplates and wantedtemplates to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148384 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [16:54:03] jouncebot: nowandnext [16:54:04] For the next 0 hour(s) and 5 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1600) [16:54:04] In 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1700) [16:54:42] (03CR) 10BCornwall: [C:03+1] wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [16:55:14] (03CR) 10Volans: "Puppet Compiler results available here:" [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [16:56:41] jouncebot: nowandnext [16:56:41] For the next 0 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1600) [16:56:41] In 0 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1700) [16:56:49] 06SRE, 10Observability-Metrics, 05Goal, 13Patch-For-Review: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870#10840289 (10dancy) I'm seeing these errors in `/var/log/messages` on `contint1002.wikimedia.org`: ` May 20 16:52:31 contint1002 statsite@8125[504329]: Traceback (most r... [16:57:45] (03PS1) 10Reedy: extension-list: Add ConfirmEdit/hCaptcha/extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148389 (https://phabricator.wikimedia.org/T382148) [16:57:46] (03PS1) 10Reedy: WIP: Prep hCaptcha config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) [16:58:20] (03PS1) 10Hnowlan: mw::maintenance: migrate lonelypages job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148391 (https://phabricator.wikimedia.org/T388534) [16:58:28] (03CR) 10Reedy: [C:03+2] "Start including messages in localisation cache when it updates..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148389 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [16:58:53] (03CR) 10BPirkle: [C:03+1] trafficserver: route testwiki reading lists APIs without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1148285 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [16:59:21] (03Merged) 10jenkins-bot: extension-list: Add ConfirmEdit/hCaptcha/extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148389 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [17:00:04] swfrench-wmf: #bothumor I � Unicode. All rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T1700). [17:00:20] o/ [17:01:31] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:01:49] Reedy: are you in the midst of / about to deploy https://gerrit.wikimedia.org/r/1148389 ? [17:02:27] swfrench-wmf: It can wait (though I should at least git pull it onto the deployment host) as it's a noop unless a big full scap runs [17:02:50] done the git pull [17:03:16] swfrench-wmf: go ahead :) [17:03:30] Reedy: got it, thanks! FWIW, I will be running scap, but with `--stop-before-sync` and image builds disabled, so I think we're good :) [17:03:58] (03CR) 10Scott French: [C:03+2] hieradata: switch mw-debug pinkunicorn to PHP 8.1 (1 of 2) [puppet] - 10https://gerrit.wikimedia.org/r/1137498 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [17:04:09] (03CR) 10Scott French: [C:03+2] "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1137498 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [17:04:51] (03PS1) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 [17:08:03] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on zuul2001.codfw.wmnet with reason: host reimage [17:08:05] (03CR) 10CI reject: [V:04-1] wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (owner: 10FNegri) [17:08:08] (03PS1) 10Reedy: Revert "extension-list: Add ConfirmEdit/hCaptcha/extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148397 (https://phabricator.wikimedia.org/T394814) [17:08:09] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:08:18] (03CR) 10Reedy: [C:03+2] Revert "extension-list: Add ConfirmEdit/hCaptcha/extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148397 (https://phabricator.wikimedia.org/T394814) (owner: 10Reedy) [17:08:47] (03PS5) 10BCornwall: varnish: Replace date/stamp headers with vars [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) [17:08:47] (03CR) 10BCornwall: [V:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [17:09:51] (03CR) 10FNegri: [C:04-1] "I want to clean up a few things before merging, pushing for early reviews." [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (owner: 10FNegri) [17:09:57] (03Merged) 10jenkins-bot: Revert "extension-list: Add ConfirmEdit/hCaptcha/extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148397 (https://phabricator.wikimedia.org/T394814) (owner: 10Reedy) [17:10:29] (03PS2) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) [17:10:55] (03PS1) 10Reedy: DNM due to T394814 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148398 [17:11:02] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 11 hosts with reason: replace cr2-codfw switch control boards and install new line card [17:11:03] (03CR) 10Reedy: [C:04-2] DNM due to T394814 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148398 (owner: 10Reedy) [17:11:06] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2001.codfw.wmnet with reason: host reimage [17:11:07] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10840439 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e24daea6-0330-4b79-bf33-b9e0f9709a10) set by cmooney@cumin1003 for 2:00:00 on 11... [17:11:38] (03CR) 10Scott French: [C:03+2] mw-debug: switch mw-debug pinkunicorn to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137499 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [17:13:16] !log swfrench@deploy1003 Started scap sync-world: Non-deploy scap run to switch mw-debug/pinkunicorn to PHP 8.1 - T391057 [17:13:20] T391057: Turn down MediaWiki image builds for PHP 7.4 - https://phabricator.wikimedia.org/T391057 [17:13:26] (03CR) 10CI reject: [V:04-1] wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri) [17:13:37] !log swfrench@deploy1003 Stopping before sync operations [17:14:16] (03PS1) 10Vgutierrez: varnish: Skip PURGE requests for WMF-Uniq handling [puppet] - 10https://gerrit.wikimedia.org/r/1148399 (https://phabricator.wikimedia.org/T391411) [17:15:54] (03Merged) 10jenkins-bot: mw-debug: switch mw-debug pinkunicorn to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137499 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [17:15:55] 10SRE-Access-Requests, 06collaboration-services, 10Continuous-Integration-Infrastructure (Zuul upgrade): create new admin group for "zuul devs" - https://phabricator.wikimedia.org/T394819 (10Dzahn) 03NEW [17:16:34] 10SRE-Access-Requests, 06collaboration-services, 10Continuous-Integration-Infrastructure (Zuul upgrade): create new admin group for "zuul devs" - https://phabricator.wikimedia.org/T394819#10840512 (10Dzahn) Who, besides James and global roots, should have access to zuul(3) VMs? [17:16:45] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [17:16:54] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:17:17] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:17:18] (03CR) 10BBlack: [C:03+1] varnish: Skip PURGE requests for WMF-Uniq handling [puppet] - 10https://gerrit.wikimedia.org/r/1148399 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [17:18:15] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:18:39] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:19:13] (03PS4) 10Scott French: mw-(debug|web): switch train-dev to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137500 (https://phabricator.wikimedia.org/T391057) [17:20:15] (03PS2) 10Vgutierrez: varnish: Skip PURGE requests for WMF-Uniq handling [puppet] - 10https://gerrit.wikimedia.org/r/1148399 (https://phabricator.wikimedia.org/T391411) [17:21:08] !log enable FPC 0 (10x100G) card in cr2-codfw (T393552) [17:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:12] T393552: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552 [17:23:14] (03CR) 10Vgutierrez: [C:03+2] varnish: Skip PURGE requests for WMF-Uniq handling [puppet] - 10https://gerrit.wikimedia.org/r/1148399 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [17:23:18] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10840540 (10Dzahn) [17:23:34] (03CR) 10Ssingh: [C:03+1] varnish: Skip PURGE requests for WMF-Uniq handling [puppet] - 10https://gerrit.wikimedia.org/r/1148399 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [17:24:01] (03CR) 10BCornwall: [C:03+1] varnish: Skip PURGE requests for WMF-Uniq handling [puppet] - 10https://gerrit.wikimedia.org/r/1148399 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [17:24:59] (03CR) 10Scott French: [C:03+2] mw-(debug|web): switch train-dev to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137500 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [17:25:06] (03CR) 10BBlack: [C:03+1] varnish: Skip PURGE requests for WMF-Uniq handling [puppet] - 10https://gerrit.wikimedia.org/r/1148399 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [17:26:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:27:36] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zuul2001.codfw.wmnet with OS bullseye [17:27:42] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10840552 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host zuul2001.codfw.wmnet with OS... [17:27:42] (03Merged) 10jenkins-bot: mw-(debug|web): switch train-dev to PHP 8.1 (2 of 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1137500 (https://phabricator.wikimedia.org/T391057) (owner: 10Scott French) [17:29:13] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host zuul1001.eqiad.wmnet [17:29:14] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [17:30:15] (03PS1) 10Bking: elastic/cirrussearch: silence eqiad, improve alert routing [puppet] - 10https://gerrit.wikimedia.org/r/1148402 (https://phabricator.wikimedia.org/T388610) [17:30:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148402 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:31:17] (03PS1) 10Jdlrobson: Enable ReadingList beta feature on test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148403 (https://phabricator.wikimedia.org/T392008) [17:31:22] !log repool cp4037 with edge uniques enabled, stats available on https://grafana.wikimedia.org/goto/fYSIMlaHR?orgId=1 - T391411 [17:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:27] T391411: SDS 2.4.4 Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411 [17:31:33] !log vgutierrez@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [17:32:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148403 (https://phabricator.wikimedia.org/T392008) (owner: 10Jdlrobson) [17:32:22] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul1001.eqiad.wmnet - dzahn@cumin1002" [17:32:29] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul1001.eqiad.wmnet - dzahn@cumin1002" [17:32:29] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:32:30] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache zuul1001.eqiad.wmnet on all recursors [17:32:34] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zuul1001.eqiad.wmnet on all recursors [17:33:05] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM zuul1001.eqiad.wmnet - dzahn@cumin1002" [17:33:11] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM zuul1001.eqiad.wmnet - dzahn@cumin1002" [17:35:13] (03PS6) 10BCornwall: varnish: Replace date/stamp headers with vars [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) [17:36:12] dzahn@cumin1002 makevm (PID 3969838) is awaiting input [17:37:54] (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1148402 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:39:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/LiquidThreads] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146654 (https://phabricator.wikimedia.org/T394025) (owner: 10Jforrester) [17:40:31] dzahn@cumin1002 makevm (PID 3969838) is awaiting input [17:41:10] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host zuul1001.eqiad.wmnet with OS bookworm [17:41:16] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10840589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host zuul1001.eqiad.wmnet with... [17:41:29] (03CR) 10Bking: [C:03+2] elastic/cirrussearch: silence eqiad, improve alert routing [puppet] - 10https://gerrit.wikimedia.org/r/1148402 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:43:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146631 (https://phabricator.wikimedia.org/T341775) (owner: 10Jforrester) [17:43:43] (03Abandoned) 10Jforrester: TransformHandler: Return 400 for invalid titles [core] (wmf/1.44.0-wmf.28) - 10https://gerrit.wikimedia.org/r/1145913 (https://phabricator.wikimedia.org/T394270) (owner: 10Jforrester) [17:43:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145867 (https://phabricator.wikimedia.org/T394270) (owner: 10Máté Szabó) [17:44:33] (03Abandoned) 10Jforrester: Update incorrect PHP namespace in BundleSizeTest [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1147060 (https://phabricator.wikimedia.org/T373017) (owner: 10Reedy) [17:45:01] !log moving links from old to new linecard cr2-codfw slot 1 to slot 0 T393552 [17:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:04] T393552: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552 [17:46:19] FIRING: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr2-codfw (10.192.254.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [17:48:09] RESOLVED: [4x] CoreBGPDown: Core BGP session down between ssw1-a8-codfw and cr2-codfw (10.192.254.6) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [17:49:31] (03PS1) 10Brouberol: Add a connection for kafka-jumbo-eqiad to be used by systems outside Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148408 (https://phabricator.wikimedia.org/T386862) [17:50:52] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cirrussearch1087.eqiad.wmnet with reason: eqiad is depooled, noisy alerts [17:51:03] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23), 13Patch-For-Review: Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10840637 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=55d264b1-8db7-4900-adde-2a... [17:51:16] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate lonelypages job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148391 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [17:51:19] FIRING: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr2-codfw (10.192.254.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [17:52:30] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1088 to cirrussearch1088 [17:52:32] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on zuul1001.eqiad.wmnet with reason: host reimage [17:52:44] !log bking@cumin2002 START - Cookbook sre.dns.netbox [17:54:31] (03PS1) 10Bernard Wang: Enable empty search recommendations on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 [17:55:16] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10840651 (10Dzahn) [17:56:09] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1001.eqiad.wmnet with reason: host reimage [17:56:19] FIRING: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr2-codfw (10.192.254.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [17:58:11] (03PS2) 10Bernard Wang: Enable empty search recommendations on beta cluster and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 [17:58:12] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10840660 (10Dzahn) [17:58:20] bking@cumin2002 rename (PID 2875053) is awaiting input [17:58:29] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:58:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [17:58:58] (03CR) 10CI reject: [V:04-1] Enable empty search recommendations on beta cluster and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [17:59:02] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10840672 (10Dzahn) [18:00:29] (03CR) 10Jdlrobson: [C:04-1] Enable empty search recommendations on beta cluster and testwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [18:03:17] (03CR) 10BCornwall: [V:03+1] "tests are still happy" [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [18:06:52] (03CR) 10Eevans: [C:03+2] aqs: cleanup Cassandra roles & grants [puppet] - 10https://gerrit.wikimedia.org/r/1147026 (https://phabricator.wikimedia.org/T313877) (owner: 10Eevans) [18:07:26] (03PS1) 10Dzahn: site: broaden regex for zuul hosts to [12]00[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1148411 (https://phabricator.wikimedia.org/T393873) [18:07:40] (03CR) 10CI reject: [V:04-1] site: broaden regex for zuul hosts to [12]00[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1148411 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [18:08:29] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:08:46] (03PS2) 10Dzahn: site: broaden regex for zuul hosts to [12]00[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1148411 (https://phabricator.wikimedia.org/T393873) [18:09:39] !log denisse@deploy1003 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 25.5.0 - T394750 [18:09:56] !log denisse@deploy1003 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to 25.5.0 - T394750 (duration: 00m 17s) [18:10:40] (03CR) 10Dzahn: [C:03+2] site: broaden regex for zuul hosts to [12]00[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1148411 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [18:11:09] (03CR) 10Volans: "With this latest version the change is a noop for everyone except homer that is the only one that uses the remote_name argument." [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [18:11:19] RESOLVED: [3x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-b1-codfw and cr2-codfw (10.192.254.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [18:12:48] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Upgrading to Java 11.0.27 - eevans@cumin1002 [18:13:52] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10840704 (10cmooney) [18:14:25] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zuul1001.eqiad.wmnet with OS bookworm [18:14:25] !log dzahn@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host zuul1001.eqiad.wmnet [18:14:36] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10840709 (10cmooney) [18:14:38] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 3 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10840708 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host zuul1001.eqiad.wmnet with OS... [18:16:04] (03CR) 10Dzahn: [C:03+1] "since pywikipedia.org does not point to WMF NS and that's why other changes were reverted.. this seems in line with that" [puppet] - 10https://gerrit.wikimedia.org/r/1137463 (owner: 10Ncmonitor) [18:17:25] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1088 to cirrussearch1088 - bking@cumin2002" [18:18:25] (03CR) 10Mforns: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148408 (https://phabricator.wikimedia.org/T386862) (owner: 10Brouberol) [18:18:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:52] (03CR) 10Brouberol: [C:03+2] Add a connection for kafka-jumbo-eqiad to be used by systems outside Kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148408 (https://phabricator.wikimedia.org/T386862) (owner: 10Brouberol) [18:20:31] bking@cumin2002 rename (PID 2875053) is awaiting input [18:21:14] !log cmooney@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site codfw [reason: repool codfw after core router maintenance, T393552] [18:21:18] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site codfw [reason: repool codfw after core router maintenance, T393552] [18:21:18] T393552: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552 [18:21:39] !log repool codfw in dns after core router maintenance T393552 [18:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:25:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10840770 (10cmooney) 05Open→03Resolved a:03cmooney Ok this is now complete. A few niggles along the way that were sorted out with multiple re-seat... [18:25:36] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1088 to cirrussearch1088 - bking@cumin2002" [18:25:36] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:25:37] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1088 on all recursors [18:25:40] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1088 on all recursors [18:25:41] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1088 [18:29:02] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1088 [18:29:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1088 to cirrussearch1088 [18:30:29] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1088.eqiad.wmnet with OS bullseye [18:30:33] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1088 [18:30:33] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1088 [18:43:29] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:45:06] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1088.eqiad.wmnet with reason: host reimage [18:46:43] !log aokoth@cumin1002 START - Cookbook sre.ganeti.makevm for new host doc1004.eqiad.wmnet [18:46:44] !log aokoth@cumin1002 START - Cookbook sre.dns.netbox [18:47:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2018.codfw.wmnet with OS bookworm [18:47:32] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10840908 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2018.codfw.wmnet with OS bookworm [18:47:40] (03CR) 10Fabfur: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1148337 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez) [18:49:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1088.eqiad.wmnet with reason: host reimage [18:49:59] !log aokoth@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc1004.eqiad.wmnet - aokoth@cumin1002" [18:50:08] !log aokoth@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doc1004.eqiad.wmnet - aokoth@cumin1002" [18:50:08] !log aokoth@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:50:08] !log aokoth@cumin1002 START - Cookbook sre.dns.wipe-cache doc1004.eqiad.wmnet on all recursors [18:50:12] !log aokoth@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doc1004.eqiad.wmnet on all recursors [18:50:41] !log aokoth@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doc1004.eqiad.wmnet - aokoth@cumin1002" [18:50:46] !log aokoth@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doc1004.eqiad.wmnet - aokoth@cumin1002" [18:51:41] (03PS1) 10Jforrester: VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) [18:52:09] !log aokoth@cumin1002 START - Cookbook sre.hosts.reimage for host doc1004.eqiad.wmnet with OS bookworm [18:55:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10840954 (10cmooney) 05Resolved→03Open Actually there are a few bits like the license and the inventory items in Netbox to be completed which I'll take o... [18:55:06] (03CR) 10Fabfur: [C:03+1] liberica: Don't deploy ipip-multiqueue-optimizer with katran [puppet] - 10https://gerrit.wikimedia.org/r/1148337 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez) [18:55:37] (03PS1) 10Jforrester: Wikifunctions: Enable Wikifunction client mode on the first five Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148423 (https://phabricator.wikimedia.org/T390552) [19:02:50] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on doc1004.eqiad.wmnet with reason: host reimage [19:03:11] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host zuul1002.eqiad.wmnet [19:03:12] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [19:05:27] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doc1004.eqiad.wmnet with reason: host reimage [19:06:44] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [19:06:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [19:08:55] dzahn@cumin1002 makevm (PID 3982637) is awaiting input [19:10:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1088.eqiad.wmnet with OS bullseye [19:11:08] PROBLEM - Juniper alarms on cr2-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 6 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [19:11:44] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [19:11:50] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [19:12:31] jhancock@cumin2002 reimage (PID 2900198) is awaiting input [19:15:29] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul1002.eqiad.wmnet - dzahn@cumin1002" [19:15:35] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:16:50] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul1002.eqiad.wmnet - dzahn@cumin1002" [19:16:50] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:16:50] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache zuul1002.eqiad.wmnet on all recursors [19:16:54] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zuul1002.eqiad.wmnet on all recursors [19:17:24] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM zuul1002.eqiad.wmnet - dzahn@cumin1002" [19:17:30] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM zuul1002.eqiad.wmnet - dzahn@cumin1002" [19:18:33] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host zuul1002.eqiad.wmnet with OS bookworm [19:19:40] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10841161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host zuul1002.eqiad.wmnet with... [19:20:23] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doc1004.eqiad.wmnet with OS bookworm [19:20:23] !log aokoth@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doc1004.eqiad.wmnet [19:23:20] (03CR) 10JHathaway: [C:03+1] sshd: Remove dead template argument [puppet] - 10https://gerrit.wikimedia.org/r/1148342 (https://phabricator.wikimedia.org/T393762) (owner: 10Muehlenhoff) [19:27:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 15.6% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:27:24] 06SRE, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640#10841211 (10bking) 05In progress→03Resolved @jcrespo Thanks for the suggestion! I merged the above CR and reimaged `c... [19:28:29] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:29:33] (03PS1) 10Dzahn: lists: include nftables throttling profile [puppet] - 10https://gerrit.wikimedia.org/r/1148432 (https://phabricator.wikimedia.org/T394519) [19:32:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 15.6% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:33:05] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on zuul1002.eqiad.wmnet with reason: host reimage [19:36:49] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1002.eqiad.wmnet with reason: host reimage [19:37:03] (03PS1) 10Dzahn: lists: add parameter and code to block abusers using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1148433 (https://phabricator.wikimedia.org/T394519) [19:38:29] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:40:31] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: T383173 [19:40:35] T383173: Supermicro: UEFI HTTP boot request hangs on cold boot - https://phabricator.wikimedia.org/T383173 [19:40:43] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host zuul2002.codfw.wmnet [19:40:44] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [19:44:15] (03PS1) 10Volans: setup.py: add support up to Python 3.13 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148434 [19:45:29] (03CR) 10Volans: "Let me know if you think I should make a new release or being the change just metadata doesn't warrant one." [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148434 (owner: 10Volans) [19:45:42] (03PS2) 10Volans: setup.py: add support up to Python 3.13 [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148434 [19:46:00] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul2002.codfw.wmnet - dzahn@cumin1002" [19:46:49] (03CR) 10Volans: [C:04-1] "Pending I9f67e54f828e8804fff738ddf4a88029fac23699 and also a decision if we should make a release or not right now." [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148367 (owner: 10Volans) [19:48:44] (03PS1) 10Cwhite: ci: uninstall statsite from ci hosts [puppet] - 10https://gerrit.wikimedia.org/r/1148435 (https://phabricator.wikimedia.org/T205870) [19:48:45] (03PS1) 10Cwhite: ci: clean up statsite includes [puppet] - 10https://gerrit.wikimedia.org/r/1148436 (https://phabricator.wikimedia.org/T205870) [19:49:04] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul2002.codfw.wmnet - dzahn@cumin1002" [19:49:04] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:49:04] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache zuul2002.codfw.wmnet on all recursors [19:49:08] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zuul2002.codfw.wmnet on all recursors [19:49:39] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM zuul2002.codfw.wmnet - dzahn@cumin1002" [19:49:44] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM zuul2002.codfw.wmnet - dzahn@cumin1002" [19:52:05] (03PS4) 10Andrew Bogott: Prepare cloudvirt103[1-9] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1147880 (https://phabricator.wikimedia.org/T394727) [19:52:05] (03PS4) 10Andrew Bogott: Remove mention of cloudvirt103[1-9].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1147883 (https://phabricator.wikimedia.org/T394727) [19:52:08] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zuul1002.eqiad.wmnet with OS bookworm [19:52:08] !log dzahn@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host zuul1002.eqiad.wmnet [19:52:20] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10841286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host zuul1002.eqiad.wmnet with OS... [19:52:45] dzahn@cumin1002 makevm (PID 3988586) is awaiting input [19:53:13] (03CR) 10CI reject: [V:04-1] Prepare cloudvirt103[1-9] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1147880 (https://phabricator.wikimedia.org/T394727) (owner: 10Andrew Bogott) [19:55:11] (03PS5) 10Andrew Bogott: Prepare cloudvirt103[1-9] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1147880 (https://phabricator.wikimedia.org/T394727) [19:55:11] (03PS5) 10Andrew Bogott: Remove mention of cloudvirt103[1-9].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1147883 (https://phabricator.wikimedia.org/T394727) [19:56:18] (03CR) 10CI reject: [V:04-1] Prepare cloudvirt103[1-9] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1147880 (https://phabricator.wikimedia.org/T394727) (owner: 10Andrew Bogott) [19:58:50] (03PS6) 10Andrew Bogott: Prepare cloudvirt103[1-9] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1147880 (https://phabricator.wikimedia.org/T394727) [19:58:50] (03PS6) 10Andrew Bogott: Remove mention of cloudvirt103[1-9].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1147883 (https://phabricator.wikimedia.org/T394727) [19:59:43] I'm happy to deploy, unless someone really wants to (I have three backports for logspam, one of which has i18n). [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T2000). [20:00:05] ZhaoFJx, Jdlrobson, and James_F: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] Here I am [20:00:23] (03CR) 10Andrew Bogott: [C:03+2] Prepare cloudvirt103[1-9] for decom [puppet] - 10https://gerrit.wikimedia.org/r/1147880 (https://phabricator.wikimedia.org/T394727) (owner: 10Andrew Bogott) [20:00:37] ZhaoFJx: I'll do your config patch alongside Jdlrobson's two. [20:00:52] Sure, thanks [20:01:07] Oh, wait, Jon's C-1'ed 1148409; I'll skip that for now. [20:01:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146985 (https://phabricator.wikimedia.org/T394505) (owner: 10ZhaoFJx) [20:01:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148403 (https://phabricator.wikimedia.org/T392008) (owner: 10Jdlrobson) [20:01:21] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host zuul1003.eqiad.wmnet [20:01:23] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [20:01:26] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host zuul2002.codfw.wmnet with OS bookworm [20:01:37] (03CR) 10Jforrester: "Skipping from backport window for now due to outstanding C-1." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:02:14] (03Merged) 10jenkins-bot: Add zh, en, and meta to zh_arbcom import sources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146985 (https://phabricator.wikimedia.org/T394505) (owner: 10ZhaoFJx) [20:02:20] (03Merged) 10jenkins-bot: Enable ReadingList beta feature on test.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148403 (https://phabricator.wikimedia.org/T392008) (owner: 10Jdlrobson) [20:02:40] (03PS3) 10Bernard Wang: Enable empty search recommendations on beta cluster and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 [20:02:43] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1146985|Add zh, en, and meta to zh_arbcom import sources (T394505)]], [[gerrit:1148403|Enable ReadingList beta feature on test.wikipedia.org (T392008)]] [20:02:47] (03PS1) 10Dzahn: wikimedia.org: add gerrit-ssh, gerrit-replica-ssh records [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) [20:02:48] T394505: Change import sources on wikipedia-zh-arbcom - https://phabricator.wikimedia.org/T394505 [20:02:48] T392008: Implement Empty search recommendations for Vector - https://phabricator.wikimedia.org/T392008 [20:03:22] (03CR) 10CI reject: [V:04-1] wikimedia.org: add gerrit-ssh, gerrit-replica-ssh records [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [20:03:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:29] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:32] (03CR) 10CI reject: [V:04-1] Enable empty search recommendations on beta cluster and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:03:36] (03PS2) 10Dzahn: wikimedia.org: add gerrit-ssh, gerrit-replica-ssh records [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) [20:04:08] (03CR) 10CI reject: [V:04-1] wikimedia.org: add gerrit-ssh, gerrit-replica-ssh records [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [20:04:36] here [20:04:54] !log jforrester@deploy1003 zhaofjx, jdlrobson, jforrester: Backport for [[gerrit:1146985|Add zh, en, and meta to zh_arbcom import sources (T394505)]], [[gerrit:1148403|Enable ReadingList beta feature on test.wikipedia.org (T392008)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:04:55] Hey Jdlrobson, I see that 1148409 is still being worked on? [20:04:59] I was hoping to do my backport myself but I don't seem to have access. @thcipriani I pinged you in Slack. [20:05:16] James_F: I just need to push a new change for 1148409 [20:05:17] ZhaoFJx: If you can verify the patch on debug, that'd be great. [20:05:26] I will check now [20:05:38] Jdlrobson: Ack. Can you first check that 1148403 is OK on debug? [20:05:47] (working on that with Bernard right now) [20:05:48] James_F: sure [20:05:51] <3 [20:06:03] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1057 to cirrussearch1057 [20:06:20] (03PS3) 10Dzahn: wikimedia.org: add gerrit-ssh, gerrit-replica-ssh records [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) [20:06:28] James_F Checked, all set and perfect [20:06:31] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:06:36] ZhaoFJx: Excellent, thank you! [20:06:48] James_F Thank you too! [20:06:51] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul1003.eqiad.wmnet - dzahn@cumin1002" [20:07:45] James_F: 1148403 is on debug? I'm not seeing it on beta features page at https://test.wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures but perhaps that doesn't work with debug? [20:07:59] Jdlrobson: It is. Did you add the new beta feature to the allow list in config? [20:08:06] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1031.eqiad.wmnet [20:08:09] James_F: argg i always forget that! [20:08:20] Jdlrobson: Aha. OK, I'll continue with deploy and we can do that in a sec. [20:08:22] !log jforrester@deploy1003 zhaofjx, jdlrobson, jforrester: Continuing with sync [20:08:28] James_F: so yeh i think this is safe to sync and I can follow up with that once I remember how :) [20:08:39] Thanks! Glad you are here that would have cost me 20 mins or so [20:09:13] Jdlrobson: :-) [20:09:18] readinglistsbeta? [20:09:19] (03CR) 10Jdlrobson: Enable empty search recommendations on beta cluster and testwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:09:23] I'll write a quick patch for you. [20:09:48] (03PS4) 10Jdlrobson: Enable empty search recommendations on beta cluster and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:09:55] (03CR) 10Jdlrobson: [C:03+1] Enable empty search recommendations on beta cluster and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:09:58] dzahn@cumin1002 makevm (PID 3993399) is awaiting input [20:10:19] James_F: <3 [20:10:51] (03CR) 10CI reject: [V:04-1] Enable empty search recommendations on beta cluster and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:10:52] Is T392008 the right task? [20:10:52] T392008: Implement Empty search recommendations for Vector - https://phabricator.wikimedia.org/T392008 [20:11:19] (It's not tagged against ReadingLists.) [20:12:09] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1057 to cirrussearch1057 - bking@cumin2002" [20:12:32] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1057 to cirrussearch1057 - bking@cumin2002" [20:12:32] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:12:33] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1057 on all recursors [20:12:36] (03PS1) 10Jforrester: Add the ReadingLists beta feature to the allow list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148441 (https://phabricator.wikimedia.org/T392008) [20:12:36] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1057 on all recursors [20:12:37] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1057 [20:13:16] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host pc2018.codfw.wmnet with OS bookworm [20:13:28] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10841371 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2018.codfw.wmnet with OS bookworm executed with errors: - pc2018 (**FAIL**... [20:13:57] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul1003.eqiad.wmnet - dzahn@cumin1002" [20:13:57] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:14:24] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1057 [20:14:31] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [20:15:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1057 to cirrussearch1057 [20:15:26] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1146985|Add zh, en, and meta to zh_arbcom import sources (T394505)]], [[gerrit:1148403|Enable ReadingList beta feature on test.wikipedia.org (T392008)]] (duration: 12m 43s) [20:15:31] T394505: Change import sources on wikipedia-zh-arbcom - https://phabricator.wikimedia.org/T394505 [20:16:07] (03CR) 10Jdlrobson: [C:03+1] Add the ReadingLists beta feature to the allow list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148441 (https://phabricator.wikimedia.org/T392008) (owner: 10Jforrester) [20:16:25] (03CR) 10Jforrester: Enable empty search recommendations on beta cluster and testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:16:32] @James_F i'm wondering now if there's a test I could add to catch this since it's happened again [20:16:48] (03PS5) 10Jdlrobson: Enable empty search recommendations on beta cluster and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:16:52] (03CR) 10Jdlrobson: Enable empty search recommendations on beta cluster and testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:17:15] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1057.eqiad.wmnet on all recursors [20:17:18] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1057.eqiad.wmnet on all recursors [20:17:39] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1057.eqiad.wmnet with OS bullseye [20:17:43] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1057 [20:17:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1057 [20:17:46] (03CR) 10CI reject: [V:04-1] Enable empty search recommendations on beta cluster and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:18:07] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on zuul2002.codfw.wmnet with reason: host reimage [20:18:13] (03CR) 10Jforrester: Enable empty search recommendations on beta cluster and testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:18:20] (03PS3) 10Scott French: P:mw:maint:update_special_pages: all updateSpecialPages shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) [20:18:20] (03CR) 10Scott French: "Thanks advance for the review! I'll merge this once I pilot the s6 shard via a manual run and confirm it works, likely first thing on Wedn" [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [20:19:40] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [20:20:12] dzahn@cumin1002 makevm (PID 3993399) is awaiting input [20:20:14] (03PS6) 10Jforrester: Enable empty search recommendations on beta cluster and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:21:17] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:21:49] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2002.codfw.wmnet with reason: host reimage [20:22:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148441 (https://phabricator.wikimedia.org/T392008) (owner: 10Jforrester) [20:22:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:22:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, May 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148441 (https://phabricator.wikimedia.org/T392008) (owner: 10Jforrester) [20:23:00] (03Merged) 10jenkins-bot: Add the ReadingLists beta feature to the allow list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148441 (https://phabricator.wikimedia.org/T392008) (owner: 10Jforrester) [20:23:35] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1031.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [20:23:41] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1031.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [20:23:41] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:23:41] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1031.eqiad.wmnet [20:23:46] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [20:23:51] (03Merged) 10jenkins-bot: Enable empty search recommendations on beta cluster and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148409 (owner: 10Bernard Wang) [20:24:14] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1148441|Add the ReadingLists beta feature to the allow list (T392008)]], [[gerrit:1148409|Enable empty search recommendations on beta cluster and testwiki]] [20:24:18] T392008: Implement Empty search recommendations for Vector - https://phabricator.wikimedia.org/T392008 [20:25:06] bking@cumin2002 reimage (PID 2942769) is awaiting input [20:25:36] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1032.eqiad.wmnet [20:26:25] !log jforrester@deploy1003 jforrester, bwang: Backport for [[gerrit:1148441|Add the ReadingLists beta feature to the allow list (T392008)]], [[gerrit:1148409|Enable empty search recommendations on beta cluster and testwiki]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:26:29] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1057.eqiad.wmnet with OS bullseye [20:26:30] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:26:39] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [20:27:41] Jdlrobson: https://test.wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures now has the pref. Though the image is not following the standard, the description is very short, and the info/discussion links go no-where. :-( This is why it's meant to go through product review before deployment. [20:28:04] Jdlrobson: Other than that, can you check the other patch? [20:28:10] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1057.eqiad.wmnet with OS bullseye [20:28:15] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1057 [20:28:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1057 [20:29:09] James_F: we are only deploying on test.wikipedia.org for now [20:29:18] James_F: there is an OKR for this next quarter [20:29:39] Jdlrobson: Still production. The allow list is explicitly there to stop teams from deploying new Beta Features that disrupt the system by not following the rules. [20:29:48] But yeah, I won't revert. [20:30:27] Yep looks great! Please sync! [20:30:31] !log jforrester@deploy1003 jforrester, bwang: Continuing with sync [20:31:00] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:31:05] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [20:32:16] James_F: I'll pass on the feedback to Olga about the beta feature. [20:32:20] Thanks! [20:36:20] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch1057.eqiad.wmnet with OS bullseye [20:36:29] (03PS3) 10RLazarus: mediawiki: Allow varying the entrypoint through mwscript values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147918 (https://phabricator.wikimedia.org/T378479) [20:36:39] dzahn@cumin1002 makevm (PID 3993399) is awaiting input [20:36:43] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [20:37:27] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148441|Add the ReadingLists beta feature to the allow list (T392008)]], [[gerrit:1148409|Enable empty search recommendations on beta cluster and testwiki]] (duration: 13m 12s) [20:37:27] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zuul2002.codfw.wmnet with OS bookworm [20:37:27] !log dzahn@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host zuul2002.codfw.wmnet [20:37:30] T392008: Implement Empty search recommendations for Vector - https://phabricator.wikimedia.org/T392008 [20:37:32] Finally! Now onto the backports. [20:37:33] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:37:35] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host zuul1003.eqiad.wmnet [20:37:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145867 (https://phabricator.wikimedia.org/T394270) (owner: 10Máté Szabó) [20:37:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146631 (https://phabricator.wikimedia.org/T341775) (owner: 10Jforrester) [20:37:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/LiquidThreads] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146654 (https://phabricator.wikimedia.org/T394025) (owner: 10Jforrester) [20:39:25] (03CR) 10RLazarus: mediawiki: Allow varying the entrypoint through mwscript values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147918 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [20:40:07] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1032.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [20:40:32] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1032.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [20:40:32] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:40:33] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1032.eqiad.wmnet [20:40:57] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1031.eqiad.wmnet [20:45:21] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [20:45:37] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1057.eqiad.wmnet with OS bullseye [20:45:42] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1057 [20:45:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1057 [20:45:45] (03PS1) 10RLazarus: deployment_server: Add --dblist to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148450 (https://phabricator.wikimedia.org/T378479) [20:46:04] Thanks James_F for your help today. [20:46:11] Jdlrobson: Of course! [20:51:02] andrew@cumin1002 decommission (PID 3999960) is awaiting input [20:54:30] (03Merged) 10jenkins-bot: TransformHandler: Return 400 for invalid titles [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1145867 (https://phabricator.wikimedia.org/T394270) (owner: 10Máté Szabó) [20:54:32] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into wmf_deploy [extensions/CentralNotice] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146631 (https://phabricator.wikimedia.org/T341775) (owner: 10Jforrester) [20:54:35] (03Merged) 10jenkins-bot: Xml::input, label: Replace usage with Html::input, label [extensions/LiquidThreads] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1146654 (https://phabricator.wikimedia.org/T394025) (owner: 10Jforrester) [20:55:03] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1145867|TransformHandler: Return 400 for invalid titles (T394270)]], [[gerrit:1146631|Merge remote-tracking branch 'origin/master' into wmf_deploy (T341775 T373017 T393122 T394404)]], [[gerrit:1146654|Xml::input, label: Replace usage with Html::input, label (T394025)]] [20:55:13] T394270: LogicException: Title not found! - https://phabricator.wikimedia.org/T394270 [20:55:13] T341775: Discourage, deprecate and stop using Xml methods for building HTML markup - https://phabricator.wikimedia.org/T341775 [20:55:14] T373017: CI PerformanceBudgetTest fails on GrowthExperiments master branch with 0.5kB difference - https://phabricator.wikimedia.org/T373017 [20:55:14] T393122: Make PHPUnit dataProvider on BundleSizeTestBase static - https://phabricator.wikimedia.org/T393122 [20:55:14] T394404: PHP Deprecated: Use of MediaWiki\Xml\Xml::radio was deprecated in MediaWiki 1.42. [Called from SpecialCentralNoticeLogs::getLogSwitcher] - https://phabricator.wikimedia.org/T394404 [20:55:15] T394025: PHP Deprecated: Use of MediaWiki\Xml\Xml::input was deprecated in MediaWiki 1.42. [Called from MediaWiki\Xml\Xml::inputLabelSep] - https://phabricator.wikimedia.org/T394025 [20:57:41] !log jforrester@deploy1003 mszabo, jforrester: Backport for [[gerrit:1145867|TransformHandler: Return 400 for invalid titles (T394270)]], [[gerrit:1146631|Merge remote-tracking branch 'origin/master' into wmf_deploy (T341775 T373017 T393122 T394404)]], [[gerrit:1146654|Xml::input, label: Replace usage with Html::input, label (T394025)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes c [20:57:41] an now be verified there. [20:58:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:22] !log jforrester@deploy1003 mszabo, jforrester: Continuing with sync [21:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250520T2100) [21:00:09] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch1057.eqiad.wmnet with OS bullseye [21:00:52] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1057.eqiad.wmnet with OS bullseye [21:00:56] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1057 [21:00:56] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1057 [21:05:08] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1031.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [21:05:13] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1031.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [21:05:13] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:05:14] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt1031.eqiad.wmnet [21:05:24] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:06:31] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1145867|TransformHandler: Return 400 for invalid titles (T394270)]], [[gerrit:1146631|Merge remote-tracking branch 'origin/master' into wmf_deploy (T341775 T373017 T393122 T394404)]], [[gerrit:1146654|Xml::input, label: Replace usage with Html::input, label (T394025)]] (duration: 11m 28s) [21:06:40] T394270: LogicException: Title not found! - https://phabricator.wikimedia.org/T394270 [21:06:40] T341775: Discourage, deprecate and stop using Xml methods for building HTML markup - https://phabricator.wikimedia.org/T341775 [21:06:41] T373017: CI PerformanceBudgetTest fails on GrowthExperiments master branch with 0.5kB difference - https://phabricator.wikimedia.org/T373017 [21:06:42] T393122: Make PHPUnit dataProvider on BundleSizeTestBase static - https://phabricator.wikimedia.org/T393122 [21:06:42] T394404: PHP Deprecated: Use of MediaWiki\Xml\Xml::radio was deprecated in MediaWiki 1.42. [Called from SpecialCentralNoticeLogs::getLogSwitcher] - https://phabricator.wikimedia.org/T394404 [21:06:42] T394025: PHP Deprecated: Use of MediaWiki\Xml\Xml::input was deprecated in MediaWiki 1.42. [Called from MediaWiki\Xml\Xml::inputLabelSep] - https://phabricator.wikimedia.org/T394025 [21:07:28] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:10:47] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10841661 (10Dzahn) [21:12:00] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10841664 (10Dzahn) [21:14:30] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#10841676 (10RobH) @Jhancock.wm, Can we swap out the 10G DAC on this host over to a CAT6 with an SFP-T into the same port, and then redo the net... [21:16:04] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host zuul2003.codfw.wmnet [21:16:06] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [21:18:52] (03PS4) 10RLazarus: mediawiki: Allow varying the entrypoint through mwscript values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147918 (https://phabricator.wikimedia.org/T378479) [21:19:20] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul2003.codfw.wmnet - dzahn@cumin1002" [21:19:26] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul2003.codfw.wmnet - dzahn@cumin1002" [21:19:26] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:19:26] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache zuul2003.codfw.wmnet on all recursors [21:19:29] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zuul2003.codfw.wmnet on all recursors [21:20:02] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM zuul2003.codfw.wmnet - dzahn@cumin1002" [21:20:08] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM zuul2003.codfw.wmnet - dzahn@cumin1002" [21:21:01] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host zuul2003.codfw.wmnet with OS bookworm [21:21:13] (03CR) 10BryanDavis: [C:03+1] "If I understand this correctly it would allow me to setup a hiera key for deployment-prep like `profile::cache::varnish::frontend::block_h" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [21:26:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:27:08] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:28:51] (03CR) 10Cwhite: [C:03+2] ci: uninstall statsite from ci hosts [puppet] - 10https://gerrit.wikimedia.org/r/1148435 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [21:29:52] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#10841827 (10cmooney) >>! In T394847#10841676, @RobH wrote: > @Jhancock.wm, > > Can we swap out the 10G DAC on this host over to a CAT6... [21:30:08] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:33:08] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:35:08] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:38:07] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on zuul2003.codfw.wmnet with reason: host reimage [21:41:26] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul2003.codfw.wmnet with reason: host reimage [21:41:30] (03CR) 10Scott French: [C:03+1] "Looks good! This is a noop as-is should be forward compatible with the latest version of your dependent change." [puppet] - 10https://gerrit.wikimedia.org/r/1147917 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [21:43:23] 06SRE, 10DNS, 06serviceops, 06Traffic: Create redirect from tj.*.org to tg.*.org - https://phabricator.wikimedia.org/T393803#10841877 (10Dzahn) [21:43:31] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147918 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [21:48:34] 06SRE, 10DNS, 06serviceops, 06Traffic: Create redirect from tj.*.org to tg.*.org - https://phabricator.wikimedia.org/T393803#10841893 (10Dzahn) Added serviceops because this would have to be added in Apache/appserver redirects (as opposed to redirecting entire domains in ncredir service, which would be tra... [21:58:40] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zuul2003.codfw.wmnet with OS bookworm [21:58:40] !log dzahn@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host zuul2003.codfw.wmnet [22:00:39] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host zuul1003.eqiad.wmnet [22:00:40] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [22:05:24] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:05:43] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul1003.eqiad.wmnet - dzahn@cumin1002" [22:05:58] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10841941 (10Dzahn) [22:06:16] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cirrussearch1057.eqiad.wmnet with OS bullseye [22:07:28] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:07:41] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM zuul1003.eqiad.wmnet - dzahn@cumin1002" [22:07:41] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:07:41] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache zuul1003.eqiad.wmnet on all recursors [22:07:44] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zuul1003.eqiad.wmnet on all recursors [22:08:14] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM zuul1003.eqiad.wmnet - dzahn@cumin1002" [22:08:20] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM zuul1003.eqiad.wmnet - dzahn@cumin1002" [22:08:29] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:11:26] dzahn@cumin1002 makevm (PID 4011770) is awaiting input [22:15:23] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host zuul1003.eqiad.wmnet with OS bookworm [22:15:35] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10841950 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host zuul1003.eqiad.wmnet with... [22:18:18] (03PS2) 10Reedy: WIP: Prep hCaptcha config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) [22:19:26] (03PS3) 10Reedy: WIP: Prep hCaptcha config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) [22:27:44] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on zuul1003.eqiad.wmnet with reason: host reimage [22:32:10] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1003.eqiad.wmnet with reason: host reimage [22:32:54] (03PS2) 10Andrea Denisse: grafana: Remove stunnel from quickdatacopy sync [puppet] - 10https://gerrit.wikimedia.org/r/1148468 (https://phabricator.wikimedia.org/T393738) [22:32:55] (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1148468/5629/" [puppet] - 10https://gerrit.wikimedia.org/r/1148468 (https://phabricator.wikimedia.org/T393738) (owner: 10Andrea Denisse) [22:35:54] (03CR) 10Dzahn: [C:03+1] "yea, if the data is not private.. then this seems by far the easiest fix for the linked ticket.. just use plain rsync without the encrypti" [puppet] - 10https://gerrit.wikimedia.org/r/1148468 (https://phabricator.wikimedia.org/T393738) (owner: 10Andrea Denisse) [22:43:29] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [22:46:14] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 84009MiB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [22:47:29] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zuul1003.eqiad.wmnet with OS bookworm [22:47:29] !log dzahn@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host zuul1003.eqiad.wmnet [22:47:40] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for zuul3+ - https://phabricator.wikimedia.org/T393873#10842034 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host zuul1003.eqiad.wmnet with OS... [23:07:04] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:07:34] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:09:04] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53942 bytes in 9.260 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:09:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.188 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:15:35] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:16:34] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:17:04] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:17:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:17:54] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:27:13] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: BAD PEM3 on cr2-codfw - https://phabricator.wikimedia.org/T394868 (10Papaul) 03NEW [23:28:29] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:36:21] (03CR) 10Scott French: "Thanks, @brouberol@wikimedia.org!" [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [23:38:29] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1148483 [23:38:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1148483 (owner: 10TrainBranchBot) [23:39:48] (03PS4) 10RLazarus: deployment_server: Pass mwscript.command in mwscript-k8s values [puppet] - 10https://gerrit.wikimedia.org/r/1147917 (https://phabricator.wikimedia.org/T378479) [23:39:48] (03PS2) 10RLazarus: deployment_server: Add --dblist to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148450 (https://phabricator.wikimedia.org/T378479) [23:43:44] (03CR) 10RLazarus: [C:03+2] deployment_server: Pass mwscript.command in mwscript-k8s values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1147917 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [23:51:09] (03PS2) 10Scott French: alertmanager: add receiver and routing for MediaWiki-File-management tasks [puppet] - 10https://gerrit.wikimedia.org/r/1148485 (https://phabricator.wikimedia.org/T385868) [23:51:30] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1148483 (owner: 10TrainBranchBot) [23:52:26] (03CR) 10Scott French: [C:03+1] "Aside from resolving the TODO, LGTM! Just sent you [0] for that if you feel like merging that at the same time as this :)" [puppet] - 10https://gerrit.wikimedia.org/r/1144507 (https://phabricator.wikimedia.org/T385868) (owner: 10Hnowlan) [23:53:42] (03CR) 10RLazarus: [C:03+2] mediawiki: Allow varying the entrypoint through mwscript values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147918 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [23:54:04] (03PS1) 10Scott French: Profile::Mediawiki_deployment: add 'clusters' field [puppet] - 10https://gerrit.wikimedia.org/r/1148480 (https://phabricator.wikimedia.org/T388761) [23:56:53] (03Merged) 10jenkins-bot: mediawiki: Allow varying the entrypoint through mwscript values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147918 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus)