[00:03:29] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1148488 [00:08:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1148488 (owner: 10TrainBranchBot) [00:10:18] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:10:58] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10842269 (10thcipriani) 05Stalled→03Open >>! In T393723#10805887, @Eevans wrote: > @Jdlrobson-WMF this seems like an odd question after all this time, but have you signed {L3}? > > And,... [00:11:54] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10842274 (10thcipriani) a:05Jdlrobson-WMF→03None [00:12:15] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10842275 (10thcipriani) [00:12:19] jouncebot: nowandnext [00:12:19] No deployments scheduled for the next 5 hour(s) and 47 minute(s) [00:12:19] In 5 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T0600) [00:12:53] scapping out a no-op chart version bump to clean up the diff, only meaningful to mw-script [00:14:40] !log rzl@deploy1003 Started scap sync-world: 1147918 [00:16:54] !log rzl@deploy1003 Finished scap sync-world: 1147918 (duration: 03m 27s) [00:19:30] PROBLEM - Hadoop NodeManager on an-worker1206 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:20:26] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1033.eqiad.wmnet [00:20:50] PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:21:22] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1034.eqiad.wmnet [00:21:50] RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:22:12] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1035.eqiad.wmnet [00:25:15] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [00:25:30] RECOVERY - Hadoop NodeManager on an-worker1206 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:25:59] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10842303 (10Jdlrobson-WMF) For clarity, I signed with this account on phab ( @Jdlrobson-WMF ) {F60326325} [00:28:27] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1148488 (owner: 10TrainBranchBot) [00:30:17] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1034.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [00:30:26] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [00:30:39] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1034.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [00:30:39] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:30:40] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt1034.eqiad.wmnet [00:33:07] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:33:07] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1035.eqiad.wmnet [00:33:20] PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:33:49] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1034.eqiad.wmnet [00:33:52] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [00:36:29] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:36:30] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1033.eqiad.wmnet [00:38:47] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [00:40:20] RECOVERY - Hadoop NodeManager on an-worker1082 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:41:22] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1036.eqiad.wmnet [00:41:35] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:41:36] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt1034.eqiad.wmnet [00:46:14] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:46:20] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [00:46:38] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1037.eqiad.wmnet [00:49:56] (03PS3) 10RLazarus: deployment_server: Add --dblist to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148450 (https://phabricator.wikimedia.org/T378479) [00:50:07] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1036.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [00:51:06] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1036.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [00:51:06] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:51:06] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1036.eqiad.wmnet [00:52:13] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [00:52:42] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1038.eqiad.wmnet [00:56:28] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1037.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [00:57:10] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1037.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [00:57:10] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:57:11] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1037.eqiad.wmnet [00:58:02] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [00:58:29] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:59:20] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1039.eqiad.wmnet [00:59:48] (03PS4) 10RLazarus: deployment_server: Add --dblist to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148450 (https://phabricator.wikimedia.org/T378479) [01:01:45] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1038.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [01:02:21] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1038.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [01:02:22] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:02:22] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1038.eqiad.wmnet [01:05:25] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [01:08:05] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:08:06] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1039.eqiad.wmnet [01:09:00] (03CR) 10Andrew Bogott: [C:03+2] Remove mention of cloudvirt103[1-9].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1147883 (https://phabricator.wikimedia.org/T394727) (owner: 10Andrew Bogott) [01:12:55] 10ops-eqiad, 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727#10842371 (10Andrew) [01:26:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [01:42:24] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven! Two optional wording suggestions, but otherwise LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1148450 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [01:51:20] (03PS5) 10RLazarus: deployment_server: Add --dblist to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148450 (https://phabricator.wikimedia.org/T378479) [01:57:29] (03CR) 10RLazarus: [C:03+2] "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1148450 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [02:08:29] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:20:34] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: OpenSent - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:43:29] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [02:56:49] (03PS2) 10DLynch: VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester) [03:04:24] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 3.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:15:35] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:28:29] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:38:29] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:55:40] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:55:46] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 139, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:56:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:01:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:03:29] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:10:40] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:10:46] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 140, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:11:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:53:50] PROBLEM - MariaDB Replica IO: s7 on clouddb1014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3317 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:10] PROBLEM - MariaDB Replica Lag: s6 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86588.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:20] PROBLEM - MariaDB Replica Lag: s7 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86591.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:22] PROBLEM - MariaDB Replica Lag: s7 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86594.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:22] PROBLEM - MariaDB Replica Lag: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86612.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:30] PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86620.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:30] PROBLEM - MariaDB Replica Lag: s2 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86620.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:54:48] PROBLEM - MariaDB Replica Lag: s6 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86626.76 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:55:22] PROBLEM - MariaDB Replica Lag: s7 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86654.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:55:22] PROBLEM - MariaDB Replica Lag: s6 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86661.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:55:34] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1014.eqiad.wmnet with reason: Maintenance [04:55:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1015.eqiad.wmnet with reason: Maintenance [04:56:14] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1018.eqiad.wmnet with reason: Maintenance [04:56:29] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1019.eqiad.wmnet with reason: Maintenance [04:56:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Maintenance [04:56:49] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance [04:57:21] (03CR) 10Arnaudb: [C:03+1] wikimedia.org: add gerrit-ssh, gerrit-replica-ssh records [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [04:58:29] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:07:27] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [05:07:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1169', diff saved to https://phabricator.wikimedia.org/P76342 and previous config saved to /var/cache/conftool/dbconfig/20250521-050730-marostegui.json [05:08:29] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:19:32] (03CR) 10Marostegui: "Can you test this on db2186 and/or db2187?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [05:22:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool db1169 with 10%', diff saved to https://phabricator.wikimedia.org/P76343 and previous config saved to /var/cache/conftool/dbconfig/20250521-052258-marostegui.json [05:26:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [05:31:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Increase weight for db1169', diff saved to https://phabricator.wikimedia.org/P76344 and previous config saved to /var/cache/conftool/dbconfig/20250521-053116-marostegui.json [05:58:57] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 22616 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T0600) [06:03:02] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 22616 [06:03:29] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:08:29] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:13:40] (03PS1) 10Ayounsi: Revert "BFDdown: don't deploy in codfw" [alerts] - 10https://gerrit.wikimedia.org/r/1148501 [06:16:18] 06SRE, 10LDAP-Access-Requests: Grant Access to https://idm.wikimedia.org/ for maxbinderWMF - https://phabricator.wikimedia.org/T394523#10842591 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [06:16:29] (03CR) 10Ayounsi: [C:03+2] Revert "BFDdown: don't deploy in codfw" [alerts] - 10https://gerrit.wikimedia.org/r/1148501 (owner: 10Ayounsi) [06:17:55] (03Merged) 10jenkins-bot: Revert "BFDdown: don't deploy in codfw" [alerts] - 10https://gerrit.wikimedia.org/r/1148501 (owner: 10Ayounsi) [06:18:35] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10842593 (10ayounsi) [06:18:45] (03CR) 10Slyngshede: [C:03+2] P:ldap::client::ldaptui Add missing aux schemas [puppet] - 10https://gerrit.wikimedia.org/r/1148279 (https://phabricator.wikimedia.org/T394341) (owner: 10Slyngshede) [06:26:39] (03Abandoned) 10Stang: Add main page on non-English privatewiki to wgWhitelistRead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850266 (https://phabricator.wikimedia.org/T321796) (owner: 10Stang) [06:34:58] (03PS1) 10Muehlenhoff: snapshot: Remove now unused Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1148751 (https://phabricator.wikimedia.org/T394647) [06:43:29] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:44:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Increase weight for db1169', diff saved to https://phabricator.wikimedia.org/P76345 and previous config saved to /var/cache/conftool/dbconfig/20250521-064444-marostegui.json [06:48:40] (03CR) 10Ayounsi: "I'm not that familiar with this piece of code, but lgtm overall." [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [06:52:28] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10842631 (10hashar) [06:52:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [06:55:13] !log push pfw policies - T394728 [06:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Increase weight for db1169', diff saved to https://phabricator.wikimedia.org/P76346 and previous config saved to /var/cache/conftool/dbconfig/20250521-065618-marostegui.json [06:58:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [07:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:29] (03CR) 10Muehlenhoff: [C:03+2] Enable the remaining two maps nodes as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1148351 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [07:01:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Increase weight for db1169', diff saved to https://phabricator.wikimedia.org/P76347 and previous config saved to /var/cache/conftool/dbconfig/20250521-070156-marostegui.json [07:05:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [07:06:57] (03CR) 10Marostegui: [C:03+1] sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146) (owner: 10Federico Ceratto) [07:08:20] (03CR) 10Slyngshede: [C:03+1] "Looks good to me. Bumping to "stable" seems reasonable :-)" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148434 (owner: 10Volans) [07:09:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [07:11:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148434 (owner: 10Volans) [07:15:35] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:18:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [07:24:28] I am restarting Gerrit [07:24:30] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [07:24:53] 503s in gerrit.. oh I see... [07:27:22] (03PS1) 10Elukey: conftool-data: remove ml-serve1001 from lvs/pybal [puppet] - 10https://gerrit.wikimedia.org/r/1148787 (https://phabricator.wikimedia.org/T387854) [07:27:24] (03PS1) 10Elukey: role::ml_k8s::worker: set ml-serve1001 for Bookworm/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1148788 (https://phabricator.wikimedia.org/T387854) [07:27:28] (03PS1) 10Elukey: conftool-data: add ml-serve1001 back to its pool after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1148789 (https://phabricator.wikimedia.org/T387854) [07:28:29] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:29:14] (03CR) 10Elukey: role::ml_k8s::worker: set ml-serve1001 for Bookworm/containerd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148788 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:29:22] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage pc1018 [puppet] - 10https://gerrit.wikimedia.org/r/1148786 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui) [07:31:58] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5630/" [puppet] - 10https://gerrit.wikimedia.org/r/1148788 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [07:32:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1187 T394623', diff saved to https://phabricator.wikimedia.org/P76348 and previous config saved to /var/cache/conftool/dbconfig/20250521-073207-marostegui.json [07:32:11] T394623: MariaDB 10.6.22 released - https://phabricator.wikimedia.org/T394623 [07:32:29] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1187.eqiad.wmnet with reason: Maintenance [07:32:45] !log Install 10.6.22 on db1187 T394623 [07:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76349 and previous config saved to /var/cache/conftool/dbconfig/20250521-073336-root.json [07:34:49] !log Move s5 codfw to SBR T383795 [07:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:52] T383795: Move sX to STATEMENT based replication - https://phabricator.wikimedia.org/T383795 [07:36:42] (03CR) 10Vgutierrez: haproxy: normalize host header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148373 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur) [07:40:27] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10842696 (10Jelto) `s3://gitlab-packages` is empty after several hours (`s3cmd del --force --recursive s3://gitlab-packages/`). Usi... [07:43:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [07:44:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [07:48:02] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:48:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76350 and previous config saved to /var/cache/conftool/dbconfig/20250521-074841-root.json [07:48:57] (03CR) 10Jelto: "I think we also need `PTR` records?" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [07:50:02] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:50:12] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.codfw.wmnet [07:50:54] !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1016.eqiad.wmnet with OS bullseye [07:50:59] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10842722 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin2002 for host kafka-jumbo... [07:51:56] (03PS1) 10Giuseppe Lavagetto: robots.txt: add crawl-delay directive for semrushbot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148791 [07:52:54] (03PS1) 10Ayounsi: Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641) [07:53:02] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:53:29] RESOLVED: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:54:41] (03CR) 10CI reject: [V:04-1] Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [07:56:02] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:56:24] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2005-dev.codfw.wmnet [07:56:38] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:56:56] (03PS2) 10Ayounsi: Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641) [07:58:18] (03PS1) 10Alexandros Kosiaris: staging-eqiad: Specify MTU of 1460 [puppet] - 10https://gerrit.wikimedia.org/r/1148795 (https://phabricator.wikimedia.org/T352956) [07:59:34] (03CR) 10Elukey: homer: make private repo support multiple peers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [07:59:54] (03PS1) 10Jelto: gitlab: enable object storage for gitlab-artifacts in production [puppet] - 10https://gerrit.wikimedia.org/r/1148796 (https://phabricator.wikimedia.org/T378922) [08:00:05] andre and jnuche: MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T0800). Please do the needful. [08:00:13] (03CR) 10Alexandros Kosiaris: [C:03+2] staging-eqiad: Specify MTU of 1460 [puppet] - 10https://gerrit.wikimedia.org/r/1148795 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris) [08:01:16] (03CR) 10Hashar: [C:03+1] "Looking at our manifests, the **sole usage** is `modules/homer/manifests/init.pp`:" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans) [08:01:54] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 145 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:02:59] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1148796 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:03:29] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76351 and previous config saved to /var/cache/conftool/dbconfig/20250521-080346-root.json [08:05:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [08:06:34] !log brouberol@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1016.eqiad.wmnet with reason: host reimage [08:10:17] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1016.eqiad.wmnet with reason: host reimage [08:13:08] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148800 (https://phabricator.wikimedia.org/T392172) [08:13:09] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148800 (https://phabricator.wikimedia.org/T392172) (owner: 10TrainBranchBot) [08:13:51] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10842850 (10jcrespo) gitlab-artifacts is failing quite a lot to backup- so many entries on the log with missing file. Unsure if due... [08:13:57] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148800 (https://phabricator.wikimedia.org/T392172) (owner: 10TrainBranchBot) [08:18:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76352 and previous config saved to /var/cache/conftool/dbconfig/20250521-081851-root.json [08:19:36] !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1017.eqiad.wmnet with OS bullseye [08:19:42] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10842866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin2002 for host kafka-jumbo... [08:19:55] !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [08:20:07] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10842870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin2002 for host kafka-jumbo... [08:21:41] (03CR) 10Klausman: "Thanks for making this!" [puppet] - 10https://gerrit.wikimedia.org/r/1148787 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [08:23:31] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.2 refs T392172 [08:23:35] T392172: 1.45.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T392172 [08:26:29] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1016.eqiad.wmnet with OS bullseye [08:26:35] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10842914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin2002 for host kafka-jumbo1016... [08:27:42] (03PS2) 10Hnowlan: trafficserver: route testwiki reading lists APIs without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1148285 (https://phabricator.wikimedia.org/T384891) [08:29:03] (03CR) 10Vgutierrez: "no, it needs to be a new key called `block_help` under `profile::cache::varnish::frontend::fe_vcl_config`" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [08:29:30] !log disable puppet on thanos-fe1001 and thanos-fe1004 T391352 [08:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:34] T391352: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352 [08:29:44] (03CR) 10Klausman: [C:03+2] conftool-data: remove ml-serve1001 from lvs/pybal [puppet] - 10https://gerrit.wikimedia.org/r/1148787 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [08:29:46] (03CR) 10Klausman: [C:03+2] conftool-data: add ml-serve1001 back to its pool after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1148789 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [08:29:50] (03CR) 10Klausman: [C:03+2] role::ml_k8s::worker: set ml-serve1001 for Bookworm/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1148788 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [08:32:05] (03PS1) 10Majavah: hieradata: Update striker-toolsbeta to 2025-05-21-082129-production [puppet] - 10https://gerrit.wikimedia.org/r/1148801 [08:32:28] (03CR) 10MVernon: [C:03+2] thanos: remove old frontends thanos-fe100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1148330 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon) [08:32:54] (03CR) 10Vgutierrez: [C:03+2] liberica: Don't deploy ipip-multiqueue-optimizer with katran [puppet] - 10https://gerrit.wikimedia.org/r/1148337 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez) [08:33:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76353 and previous config saved to /var/cache/conftool/dbconfig/20250521-083358-root.json [08:34:21] (03PS1) 10Gkyziridis: admin_ng/LiftWing: add edit-check namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148803 (https://phabricator.wikimedia.org/T394779) [08:34:27] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10842963 (10Jelto) >>! In T378922#10842850, @jcrespo wrote: > gitlab-artifacts is failing quite a lot to backup- so many entries on... [08:34:35] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1001.eqiad.wmnet [08:34:35] !log elukey@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host ml-serve1001.eqiad.wmnet [08:35:00] !log brouberol@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1017.eqiad.wmnet with reason: host reimage [08:35:46] (03CR) 10Majavah: [C:03+2] hieradata: Update striker-toolsbeta to 2025-05-21-082129-production [puppet] - 10https://gerrit.wikimedia.org/r/1148801 (owner: 10Majavah) [08:38:25] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1017.eqiad.wmnet with reason: host reimage [08:38:30] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10842971 (10jcrespo) ` 21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067: Could not stat "/srv/gitlab-backup/artifacts/02/... [08:42:55] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bookworm [08:43:33] !log mvernon@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on P{thanos-fe100[4-7]*} or P{thanos-fe2*} and (A:thanos-fe or A:thanos-fe-codfw or A:thanos-fe-eqiad) [08:44:45] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:44:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:47:28] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on P{thanos-fe100[4-7]*} or P{thanos-fe2*} and (A:thanos-fe or A:thanos-fe-codfw or A:thanos-fe-eqiad) [08:48:58] !log mvernon@cumin1002 START - Cookbook sre.hosts.decommission for hosts thanos-fe[1001-1003].eqiad.wmnet [08:49:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76354 and previous config saved to /var/cache/conftool/dbconfig/20250521-084904-root.json [08:50:42] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10843015 (10Jelto) Interesting, thank you. I think this has nothing to do with the ongoing work here. Bacula is trying to back up t... [08:53:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:48] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10843044 (10jcrespo) Thank you. I will try to separate those jobs to the dedicated storage hosts asap. > @jcrespo is there a fixed... [08:53:52] 06SRE, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: improve docker registry architecture - https://phabricator.wikimedia.org/T209271#10843046 (10elukey) 05Open→03Resolved a:03elukey [08:54:09] (03PS1) 10Jelto: gitlab: also exclude artifacts from partial backups [puppet] - 10https://gerrit.wikimedia.org/r/1148804 (https://phabricator.wikimedia.org/T378922) [08:54:12] 06SRE, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching - https://phabricator.wikimedia.org/T215809#10843055 (10elukey) 05Open→03Resolved a:03elukey Already implemented. [08:54:52] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1017.eqiad.wmnet with OS bullseye [08:55:01] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10843062 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin2002 for host kafka-jumbo1017... [08:56:19] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-fe100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T394894 (10MatthewVernon) 03NEW [08:57:05] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10843082 (10MatthewVernon) [08:57:18] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10843086 (10MatthewVernon) [08:59:07] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage [09:00:09] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146) (owner: 10Federico Ceratto) [09:02:26] !log mvernon@cumin1002 START - Cookbook sre.dns.netbox [09:02:59] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage [09:02:59] !log cr2-eqdfw# set protocols bgp graceful-shutdown sender - T364092 [09:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:04] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [09:04:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76355 and previous config saved to /var/cache/conftool/dbconfig/20250521-090409-root.json [09:06:16] elukey@cumin1002 reimage (PID 4089853) is awaiting input [09:08:16] !log ayounsi@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr2-eqdfw,cr2-eqdfw IPv6,cr2-eqdfw.mgmt with reason: router upgrade [09:08:28] !log mvernon@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thanos-fe[1001-1003].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin1002" [09:08:29] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:33] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10843106 (10FCeratto-WMF) a:05FCeratto-WMF→03VRiley-WMF [09:09:44] !log brouberol@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [09:09:52] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10843111 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin2002 for host kafka-jumbo1018... [09:09:59] elukey@cumin1002 reimage (PID 4089853) is awaiting input [09:10:12] (03PS2) 10Elukey: conftool-data: add ml-serve1001 back to its pool after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1148789 (https://phabricator.wikimedia.org/T387854) [09:10:12] (03PS1) 10Elukey: Remove ROCM version for ml-serve1001 [puppet] - 10https://gerrit.wikimedia.org/r/1148806 [09:10:49] !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [09:10:59] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10843113 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin2002 for host kafka-jumbo... [09:11:04] (03PS1) 10David Caro: cloud: move images to use docker-registry.svc.t.o [puppet] - 10https://gerrit.wikimedia.org/r/1148808 [09:11:08] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10843114 (10MatthewVernon) >>! In T378922#10842696, @Jelto wrote: > `s3://gitlab-packages` is empty after several hours (`s3cmd del... [09:11:33] mvernon@cumin1002 decommission (PID 4090552) is awaiting input [09:11:34] !log ayounsi@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr2-eqdfw,cr2-eqdfw IPv6 with reason: router upgrade [09:11:37] (03CR) 10Elukey: [C:03+2] Remove ROCM version for ml-serve1001 [puppet] - 10https://gerrit.wikimedia.org/r/1148806 (owner: 10Elukey) [09:11:47] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10843115 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=048b70e3-25f1-4871-b6c8-5ea7b074de1e) set by ayounsi@cumin1002 for 2:00:00 on 2 host(s) and their servic... [09:12:04] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1001.eqiad.wmnet with OS bookworm [09:12:20] !log mvernon@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thanos-fe[1001-1003].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin1002" [09:12:21] !log mvernon@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:12:21] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts thanos-fe[1001-1003].eqiad.wmnet [09:12:29] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10843117 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin1002 for hosts: `thanos-fe[1001-1003].eqiad.wmnet` - thanos-fe1001.eqiad.wmnet (**PASS**) - Downti... [09:12:38] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10843121 (10FCeratto-WMF) @VRiley-WMF I've been suggested to assign this task to you while we wait for the RMA, I hope you don't mind :) [09:13:07] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bookworm [09:13:17] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host ml-serve1001 [09:13:17] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve1001 [09:13:26] !log cr2-eqdfw - shutdown transit/ix BGP sessions - T364092 [09:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:29] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [09:15:56] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10843130 (10Jelto) >>! In T378922#10843114, @MatthewVernon wrote: >>>! In T378922#10842696, @Jelto wrote: >> `s3://gitlab-packages`... [09:16:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [09:17:57] (03CR) 10JMeybohm: "As with the service-catalog change I do not understand why this should be an active/passive (e.g. -ro) service. In my understanding this i" [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [09:19:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76356 and previous config saved to /var/cache/conftool/dbconfig/20250521-091914-root.json [09:21:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [09:22:24] !log cr2-eqdfw> request vmhost reboot - T364092 [09:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:28] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [09:22:59] now we wait [09:23:25] 😯 [09:24:04] (03CR) 10Hnowlan: [C:03+2] trafficserver: route testwiki reading lists APIs without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1148285 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [09:25:20] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1001.eqiad.wmnet with OS bookworm [09:25:41] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:25:41] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:25:41] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:25:49] PROBLEM - BFD status on cr2-esams is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:25:49] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:25:49] PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:25:49] PROBLEM - BFD status on cr2-magru is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:25:49] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:25:50] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:25:50] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:26:04] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bookworm [09:26:10] FIRING: [4x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:26:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (208.80.153.204) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:26:41] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [09:27:22] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1148808 (owner: 10David Caro) [09:30:05] (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1148796 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [09:30:12] (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1148804 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [09:30:55] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:31:02] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:31:10] FIRING: [5x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:31:18] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [09:31:31] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [09:31:39] FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqdfw (208.80.153.198) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:32:24] !log radosgw-admin bucket rm --bucket=gitlab-packages --bypass-gc --purge-objects T378922 [09:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:27] T378922: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922 [09:32:44] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:32:44] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:32:48] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:32:48] RECOVERY - BFD status on cr2-esams is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:32:48] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:32:48] RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:32:50] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:32:50] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:33:40] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:34:06] !log radosgw-admin bucket rm --bucket=gitlab-artifacts --bypass-gc --purge-objects T378922 [09:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:48] RECOVERY - BFD status on cr2-magru is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:36:10] RESOLVED: [5x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:36:39] RESOLVED: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqdfw (208.80.153.198) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:36:41] (03PS1) 10Hnowlan: rest-gateway: fix typo in incoming URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148813 (https://phabricator.wikimedia.org/T384891) [09:38:06] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10843229 (10cmooney) @papaul one thing I noticed looking at the cables in Netbox from the new spine switches to the ones in row A-D is that they look like a straight patch? But I believe the... [09:38:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [09:38:51] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10843244 (10cmooney) 05Open→03Resolved License is now applied and inventory items updated for cr1-codfw and cr2-codfw. [09:38:54] (03PS1) 10Vgutierrez: hiera: Enable edge uniques on cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/1148814 (https://phabricator.wikimedia.org/T391411) [09:39:14] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148814 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [09:39:14] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 194, down: 12, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:40:59] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10843254 (10ayounsi) [09:41:16] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10843256 (10ayounsi) 05Open→03Resolved All done! Thank you all. [09:43:18] (03PS1) 10Majavah: openstack: cinder: Ensure cinder-volume service is running [puppet] - 10https://gerrit.wikimedia.org/r/1148815 [09:44:14] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate lonelypages job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148391 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [09:44:25] (03CR) 10CI reject: [V:04-1] openstack: cinder: Ensure cinder-volume service is running [puppet] - 10https://gerrit.wikimedia.org/r/1148815 (owner: 10Majavah) [09:44:51] (03PS2) 10Majavah: openstack: cinder: Ensure cinder-volume service is running [puppet] - 10https://gerrit.wikimedia.org/r/1148815 [09:46:30] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1148815 (owner: 10Majavah) [09:46:42] (03CR) 10Majavah: [C:03+2] openstack: cinder: Ensure cinder-volume service is running [puppet] - 10https://gerrit.wikimedia.org/r/1148815 (owner: 10Majavah) [09:46:51] (03PS3) 10Ayounsi: Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641) [09:48:51] (03CR) 10Hnowlan: [C:03+2] rest-gateway: fix typo in incoming URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148813 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [09:50:22] (03Merged) 10jenkins-bot: rest-gateway: fix typo in incoming URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148813 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan) [09:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:54:10] (03CR) 10Cathal Mooney: [C:03+1] Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:56:26] (03PS2) 10Cathal Mooney: New device additions for codfw expansion plus policy changes [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) [09:56:57] (03PS1) 10Gmodena: EventStreamConfig: add staging page_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148818 (https://phabricator.wikimedia.org/T394899) [09:57:12] (03PS2) 10Hnowlan: mw::maintenance: migrate cleanupUploadStash job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144507 (https://phabricator.wikimedia.org/T385868) [09:57:15] jouncebot: nowandnext [09:57:15] For the next 0 hour(s) and 2 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T0800) [09:57:15] In 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1000) [09:57:23] (03CR) 10Ayounsi: [C:03+2] Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:52] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [09:58:16] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [09:58:48] (03Merged) 10jenkins-bot: Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:58:56] (03PS1) 10Majavah: P:toolforge: legacy_redirector: Enable IPv6 monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1148819 (https://phabricator.wikimedia.org/T392506) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1000) [10:00:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [10:00:31] (03CR) 10Herron: [C:03+1] grafana: Remove stunnel from quickdatacopy sync [puppet] - 10https://gerrit.wikimedia.org/r/1148468 (https://phabricator.wikimedia.org/T393738) (owner: 10Andrea Denisse) [10:00:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2036', diff saved to https://phabricator.wikimedia.org/P76357 and previous config saved to /var/cache/conftool/dbconfig/20250521-100055-marostegui.json [10:01:24] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5633/" [puppet] - 10https://gerrit.wikimedia.org/r/1148819 (https://phabricator.wikimedia.org/T392506) (owner: 10Majavah) [10:01:45] (03CR) 10Hnowlan: [C:03+2] alertmanager: add receiver and routing for MediaWiki-File-management tasks [puppet] - 10https://gerrit.wikimedia.org/r/1148485 (https://phabricator.wikimedia.org/T385868) (owner: 10Scott French) [10:01:54] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate cleanupUploadStash job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144507 (https://phabricator.wikimedia.org/T385868) (owner: 10Hnowlan) [10:02:08] (03PS1) 10Marostegui: es2036: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1148820 (https://phabricator.wikimedia.org/T394469) [10:02:15] (03CR) 10Hnowlan: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1144507 (https://phabricator.wikimedia.org/T385868) (owner: 10Hnowlan) [10:02:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2036.codfw.wmnet with reason: Maintenance [10:03:14] (03CR) 10Cathal Mooney: New device additions for codfw expansion plus policy changes (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [10:03:22] (03CR) 10Marostegui: [C:03+2] es2036: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1148820 (https://phabricator.wikimedia.org/T394469) (owner: 10Marostegui) [10:03:23] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1148819 (https://phabricator.wikimedia.org/T392506) (owner: 10Majavah) [10:03:24] FIRING: SystemdUnitFailed: networking.service on mc-misc2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:34] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: legacy_redirector: Enable IPv6 monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1148819 (https://phabricator.wikimedia.org/T392506) (owner: 10Majavah) [10:04:06] taavi: can I merge your change? [10:04:17] marostegui: yes please [10:04:21] doing! [10:04:27] thanks! [10:07:08] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [10:07:23] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [10:08:29] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:11:00] (03CR) 10David Caro: [C:03+2] cloud: move images to use docker-registry.svc.t.o [puppet] - 10https://gerrit.wikimedia.org/r/1148808 (owner: 10David Caro) [10:11:10] (03CR) 10David Caro: [C:03+2] "deployed in toolsbeta and tools" [puppet] - 10https://gerrit.wikimedia.org/r/1148808 (owner: 10David Caro) [10:12:47] (03CR) 10Hnowlan: [C:03+1] P:mw:maint:update_special_pages: all updateSpecialPages shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [10:12:51] (03PS1) 10Jcrespo: mariadb: Change x2 and x3 ports to avoid conflicts with extra port [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274) [10:13:43] (03Abandoned) 10Hnowlan: mobileapps: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133079 (https://phabricator.wikimedia.org/T388140) (owner: 10Hnowlan) [10:14:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76358 and previous config saved to /var/cache/conftool/dbconfig/20250521-101412-root.json [10:15:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [10:15:47] (03CR) 10CI reject: [V:04-1] mariadb: Change x2 and x3 ports to avoid conflicts with extra port [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [10:17:00] (03CR) 10Ilias Sarantopoulos: [C:03+1] admin_ng/LiftWing: add edit-check namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148803 (https://phabricator.wikimedia.org/T394779) (owner: 10Gkyziridis) [10:17:59] (03CR) 10Fabfur: [C:03+1] "whenever you want..." [puppet] - 10https://gerrit.wikimedia.org/r/1148814 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [10:19:27] 10ops-codfw, 06DC-Ops: Power Supply - PS Redundancy - issue on ms-fe2012:9290 - https://phabricator.wikimedia.org/T394901 (10phaultfinder) 03NEW [10:20:03] (03CR) 10Ayounsi: [C:03+1] New device additions for codfw expansion plus policy changes [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [10:21:30] (03PS1) 10Majavah: puppet_statsd: Uninstall now that statsd is read-only [puppet] - 10https://gerrit.wikimedia.org/r/1148825 [10:22:40] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10843403 (10MatthewVernon) @Jelto both buckets deleted. [10:23:18] (03PS2) 10Jcrespo: mariadb: Change x2 and x3 ports to avoid conflicts with extra port [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274) [10:23:41] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1001.eqiad.wmnet with OS bookworm [10:24:02] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bookworm [10:24:12] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge uniques on cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/1148814 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [10:26:10] !log enabling edge uniques on cp3066 - T391411 [10:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:13] T391411: SDS 2.4.4 Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411 [10:27:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [10:27:45] brouberol@cumin2002 reimage (PID 3290949) is awaiting input [10:27:54] RESOLVED: SystemdUnitFailed: networking.service on mc-misc2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:28:44] (03CR) 10Jcrespo: "Amir: let me know what you think. Once deployed done, I will run puppet and restart the x3 instances and move them to x3 upstream." [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [10:29:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76360 and previous config saved to /var/cache/conftool/dbconfig/20250521-102917-root.json [10:30:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [10:37:11] !log installing expat security updates [10:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:36] (03PS1) 10Jelto: gerrit/nftables_throttling: make abusers more generic [puppet] - 10https://gerrit.wikimedia.org/r/1148826 (https://phabricator.wikimedia.org/T394519) [10:38:50] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [10:39:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [10:39:17] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [10:40:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [10:40:49] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5634/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148826 (https://phabricator.wikimedia.org/T394519) (owner: 10Jelto) [10:40:50] (03PS1) 10Tchanders: Temp accounts: Set group requirements for IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148827 (https://phabricator.wikimedia.org/T393615) [10:41:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:41:55] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all [10:42:53] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:43:24] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1001.eqiad.wmnet with OS bookworm [10:43:32] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [10:44:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76362 and previous config saved to /var/cache/conftool/dbconfig/20250521-104422-root.json [10:44:26] (03CR) 10Kosta Harlan: [C:03+1] Temp accounts: Set group requirements for IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148827 (https://phabricator.wikimedia.org/T393615) (owner: 10Tchanders) [10:46:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:46:34] (03CR) 10Dreamy Jazz: [C:03+1] Temp accounts: Set group requirements for IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148827 (https://phabricator.wikimedia.org/T393615) (owner: 10Tchanders) [10:51:21] !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [10:51:32] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10843485 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin2002 for host kafka-jumbo... [10:53:30] (03CR) 10Arnaudb: [C:03+1] "lgtm! thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1148826 (https://phabricator.wikimedia.org/T394519) (owner: 10Jelto) [10:53:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [10:53:58] (03PS3) 10Máté Szabó: Update IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) [10:54:14] (03PS4) 10Máté Szabó: Update IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086) [10:55:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all [10:56:37] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-codfw [10:58:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-codfw [10:59:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76363 and previous config saved to /var/cache/conftool/dbconfig/20250521-105928-root.json [11:00:05] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1100). [11:01:07] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [11:01:10] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [11:02:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [11:08:25] (03PS1) 10STran: Add mediawiki.ForeignApi.core as a dependency [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148830 (https://phabricator.wikimedia.org/T387720) [11:10:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148830 (https://phabricator.wikimedia.org/T387720) (owner: 10STran) [11:14:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76364 and previous config saved to /var/cache/conftool/dbconfig/20250521-111433-root.json [11:15:31] (03PS1) 10Arthur taylor: Enabled ScopedTypeaheadSearch for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148834 (https://phabricator.wikimedia.org/T394669) [11:17:16] (03PS6) 10Slyngshede: VueJS Permissions App [software/bitu] - 10https://gerrit.wikimedia.org/r/1140498 [11:18:32] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:21:17] (03PS2) 10Arthur taylor: Enabled ScopedTypeaheadSearch for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148834 (https://phabricator.wikimedia.org/T394669) [11:22:28] (03PS1) 10Majavah: openstack: wmcs-enc-cli: Pass project to keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/1148835 (https://phabricator.wikimedia.org/T394775) [11:22:31] (03PS1) 10Majavah: openstack: wmcs-webproxy: Pass project to keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/1148836 [11:22:45] (03PS2) 10Gmodena: EventStreamConfig: add staging page_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148818 (https://phabricator.wikimedia.org/T394899) [11:24:01] (03PS1) 10Slyngshede: data.yaml: Tracking entry for guilherme [puppet] - 10https://gerrit.wikimedia.org/r/1148838 [11:24:52] (03PS1) 10Brouberol: Duplicate partman/custom/kafka-jumbo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1148839 (https://phabricator.wikimedia.org/T377874) [11:24:54] (03PS1) 10Brouberol: partman: define a kafka-jumbo-ba recipe [puppet] - 10https://gerrit.wikimedia.org/r/1148840 (https://phabricator.wikimedia.org/T377874) [11:25:18] (03CR) 10Marostegui: [C:03+1] mariadb: Change x2 and x3 ports to avoid conflicts with extra port [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [11:26:06] (03CR) 10CI reject: [V:04-1] Duplicate partman/custom/kafka-jumbo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1148839 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol) [11:26:48] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10843676 (10cmooney) Also we can add the links to the CRs now: |Switch|Port|CR|Port| |--------|-----|---|-----| |ssw1-e1-codfw|et-0/0/31|cr1-codfw|et-3/0/2| |ssw1-f1-codfw|et-0/0/31|cr2-codf... [11:26:52] (03CR) 10Arthur taylor: "ready for review. Should not be deployed until we have a confirmation about a go-live date for the change to test.wikidata.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148834 (https://phabricator.wikimedia.org/T394669) (owner: 10Arthur taylor) [11:27:07] (03CR) 10Ladsgroup: [C:03+1] mariadb: Change x2 and x3 ports to avoid conflicts with extra port [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [11:27:18] (03PS2) 10Brouberol: Duplicate partman/custom/kafka-jumbo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1148839 (https://phabricator.wikimedia.org/T377874) [11:27:18] (03PS2) 10Brouberol: partman: define a kafka-jumbo-ba recipe [puppet] - 10https://gerrit.wikimedia.org/r/1148840 (https://phabricator.wikimedia.org/T377874) [11:28:32] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:29:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76365 and previous config saved to /var/cache/conftool/dbconfig/20250521-112939-root.json [11:33:03] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [11:33:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [11:33:59] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-eqiad [11:35:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-eqiad [11:37:20] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: push IPv6 address changes for codfw expansion link networks - cmooney@cumin1002" [11:37:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: push IPv6 address changes for codfw expansion link networks - cmooney@cumin1002" [11:37:41] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:37:51] (03PS1) 10Cathal Mooney: Add new INCLUDE statement for 2620:0:860:139::/64 reverse [dns] - 10https://gerrit.wikimedia.org/r/1148842 (https://phabricator.wikimedia.org/T394021) [11:42:09] (03PS6) 10Fabfur: haproxy: use maxmind lua bindings to lookup client ISP [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) [11:44:25] SSW's in codfw BGP alerts will probably land, nothing to worry about I'm tidying it up now [11:44:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76366 and previous config saved to /var/cache/conftool/dbconfig/20250521-114444-root.json [11:46:39] FIRING: [6x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:18b::2) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:50:08] (03CR) 10Máté Szabó: [C:03+1] Add mediawiki.ForeignApi.core as a dependency [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148830 (https://phabricator.wikimedia.org/T387720) (owner: 10STran) [11:51:39] FIRING: [8x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:18b::2) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:53:40] (03PS2) 10Clément Goubert: mw::maintenance: migrate continuousScan-commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1140484 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan) [11:54:04] (03CR) 10Ayounsi: [C:03+1] "nice!" [dns] - 10https://gerrit.wikimedia.org/r/1148842 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [11:55:01] (03CR) 10Clément Goubert: [C:03+1] "Other scripts are going well, I think this can be migrated." [puppet] - 10https://gerrit.wikimedia.org/r/1140484 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan) [11:55:22] (03CR) 10Cathal Mooney: [C:03+2] Add new INCLUDE statement for 2620:0:860:139::/64 reverse [dns] - 10https://gerrit.wikimedia.org/r/1148842 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [11:55:51] !log cmooney@dns2005 START - running authdns-update [11:56:30] !log cmooney@dns2005 END - running authdns-update [11:59:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76367 and previous config saved to /var/cache/conftool/dbconfig/20250521-115950-root.json [12:03:32] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:29] brouberol@cumin2002 reimage (PID 3336134) is awaiting input [12:15:55] (03CR) 10Jforrester: VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester) [12:19:45] (03CR) 10FNegri: [C:03+1] "LGTM. Do we know when/why this started to be required?" [puppet] - 10https://gerrit.wikimedia.org/r/1148835 (https://phabricator.wikimedia.org/T394775) (owner: 10Majavah) [12:20:17] (03CR) 10FNegri: "Is this still related to T394775 or is it solving a different issue?" [puppet] - 10https://gerrit.wikimedia.org/r/1148836 (owner: 10Majavah) [12:22:28] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur) [12:24:36] (03CR) 10Majavah: [C:03+2] "Likely related to the new openstack authentication scope enforcement thing." [puppet] - 10https://gerrit.wikimedia.org/r/1148835 (https://phabricator.wikimedia.org/T394775) (owner: 10Majavah) [12:25:08] (03CR) 10Majavah: "Similar issue but in a different script." [puppet] - 10https://gerrit.wikimedia.org/r/1148836 (owner: 10Majavah) [12:26:23] (03PS1) 10Ayounsi: Disable pint alerting for SwitchCoreInterfaceDown [alerts] - 10https://gerrit.wikimedia.org/r/1148849 (https://phabricator.wikimedia.org/T388641) [12:26:27] (03CR) 10FNegri: [C:03+1] openstack: wmcs-webproxy: Pass project to keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/1148836 (owner: 10Majavah) [12:26:37] (03CR) 10Majavah: [C:03+2] openstack: wmcs-webproxy: Pass project to keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/1148836 (owner: 10Majavah) [12:28:27] (03CR) 10Vgutierrez: [C:03+1] hiera: Add zarcillo k8s service on traffic server [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [12:30:21] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [12:30:24] (03PS1) 10Hashar: Merge tag 'v3.10.6' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148851 (https://phabricator.wikimedia.org/T390666) [12:30:37] (03CR) 10Tiziano Fogli: [C:04-1] "I think a better approach would be to edit modules/prometheus/templates/prometheus-apache-vhost.erb and add the `RewriteEngine on` directi" [puppet] - 10https://gerrit.wikimedia.org/r/1146973 (owner: 10Majavah) [12:31:39] FIRING: [9x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:139::19) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:32:41] (03PS2) 10Hashar: Merge tag 'v3.10.6' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148851 (https://phabricator.wikimedia.org/T390666) [12:34:11] !log brouberol@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [12:34:17] (03CR) 10Tiziano Fogli: [C:03+1] grafana: Remove stunnel from quickdatacopy sync [puppet] - 10https://gerrit.wikimedia.org/r/1148468 (https://phabricator.wikimedia.org/T393738) (owner: 10Andrea Denisse) [12:34:24] 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, and 2 others: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10843893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin2002 for host kafka-jumbo1018.eqiad.wmnet with OS bullseye execu... [12:36:09] (03CR) 10Majavah: "> I think a better approach would be to edit modules/prometheus/templates/prometheus-apache-vhost.erb and add the RewriteEngine on directi" [puppet] - 10https://gerrit.wikimedia.org/r/1146973 (owner: 10Majavah) [12:46:36] (03PS3) 10Sbisson: Remove unused wgContentTranslationEnableSectionTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) [12:46:41] (03CR) 10Sbisson: Remove unused wgContentTranslationEnableSectionTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson) [12:47:35] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-05-14-112404 to 2025-05-20-173017 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148857 (https://phabricator.wikimedia.org/T393631) [12:48:25] !log Ran fixStuckGlobalRename.php for T394905 [12:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:29] T394905: Unblock stuck global rename of 大筒木博人 - https://phabricator.wikimedia.org/T394905 [12:48:47] btullis@cumin1002 reimage (PID 19998) is awaiting input [12:49:06] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1148838 (owner: 10Slyngshede) [12:49:36] !log test new core_out bgp policy on asw1-bw27-esams (T394530) [12:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:41] T394530: Homer: redefine IBGP definitions to support both Unicast & EVPN clusters - https://phabricator.wikimedia.org/T394530 [12:49:59] (03PS1) 10Reedy: Stop setting $wgCaptchaClass in extension.json files [extensions/ConfirmEdit] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148859 (https://phabricator.wikimedia.org/T394814) [12:50:07] (03PS1) 10Reedy: Stop setting $wgCaptchaClass in extension.json files [extensions/ConfirmEdit] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148860 (https://phabricator.wikimedia.org/T394814) [12:50:55] (03CR) 10Tiziano Fogli: [C:03+1] "Since it's not a paging alert, LGTM." [alerts] - 10https://gerrit.wikimedia.org/r/1148849 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:50:55] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mc-misc2002.codfw.wmnet [12:51:39] FIRING: [9x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:139::19) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:54:52] (03CR) 10Slyngshede: [C:03+2] data.yaml: Tracking entry for guilherme [puppet] - 10https://gerrit.wikimedia.org/r/1148838 (owner: 10Slyngshede) [12:56:36] (03PS3) 10Cathal Mooney: New device additions for codfw expansion plus policy changes [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) [12:56:39] FIRING: [9x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:139::19) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [12:57:17] (03CR) 10Cathal Mooney: New device additions for codfw expansion plus policy changes (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [12:57:33] (03CR) 10Ssingh: "Yes, you will." [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [12:58:55] jouncebot: nowandnext [12:58:56] No deployments scheduled for the next 0 hour(s) and 1 minute(s) [12:58:56] In 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1300) [12:59:24] (03CR) 10Reedy: [C:03+2] Stop setting $wgCaptchaClass in extension.json files [extensions/ConfirmEdit] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148859 (https://phabricator.wikimedia.org/T394814) (owner: 10Reedy) [12:59:29] (03CR) 10Reedy: [C:03+2] Stop setting $wgCaptchaClass in extension.json files [extensions/ConfirmEdit] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148860 (https://phabricator.wikimedia.org/T394814) (owner: 10Reedy) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1300). [13:00:05] Tran: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:01] 👋 [13:01:30] I misread that [13:01:38] I guess I should do the window then :D [13:02:03] Tran: Just to double check, it doesn't need to go into .1 too? [13:02:12] (03CR) 10Reedy: [C:03+2] Add mediawiki.ForeignApi.core as a dependency [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148830 (https://phabricator.wikimedia.org/T387720) (owner: 10STran) [13:02:43] Just .2 as the patch it fixes was deployed as part of .2 [13:02:45] Thank you! [13:02:54] Ah, yeah https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/1138127 is only in .2 [13:02:56] Nice and easy then [13:03:14] (03CR) 10Reedy: [C:03+2] "For reference, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/1138127 only landed in .2" [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148830 (https://phabricator.wikimedia.org/T387720) (owner: 10STran) [13:04:27] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bookworm [13:07:16] (03CR) 10Ayounsi: [C:03+1] "lgtm! the less prefix-list the better" [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [13:08:07] (03PS1) 10Clément Goubert: P:kafka::broker: Set cpu governor to performance [puppet] - 10https://gerrit.wikimedia.org/r/1148862 (https://phabricator.wikimedia.org/T393513) [13:08:22] (03CR) 10Ayounsi: [C:03+2] Disable pint alerting for SwitchCoreInterfaceDown [alerts] - 10https://gerrit.wikimedia.org/r/1148849 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:08:32] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:13] (03CR) 10Cathal Mooney: [C:03+2] New device additions for codfw expansion plus policy changes [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [13:09:46] (03Merged) 10jenkins-bot: Disable pint alerting for SwitchCoreInterfaceDown [alerts] - 10https://gerrit.wikimedia.org/r/1148849 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:09:47] (03Merged) 10jenkins-bot: Stop setting $wgCaptchaClass in extension.json files [extensions/ConfirmEdit] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148859 (https://phabricator.wikimedia.org/T394814) (owner: 10Reedy) [13:10:30] (03Merged) 10jenkins-bot: New device additions for codfw expansion plus policy changes [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [13:11:31] (03Merged) 10jenkins-bot: Stop setting $wgCaptchaClass in extension.json files [extensions/ConfirmEdit] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148860 (https://phabricator.wikimedia.org/T394814) (owner: 10Reedy) [13:11:32] (03Merged) 10jenkins-bot: Add mediawiki.ForeignApi.core as a dependency [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148830 (https://phabricator.wikimedia.org/T387720) (owner: 10STran) [13:11:36] There we go [13:11:42] (03CR) 10Hnowlan: [C:03+1] P:kafka::broker: Set cpu governor to performance [puppet] - 10https://gerrit.wikimedia.org/r/1148862 (https://phabricator.wikimedia.org/T393513) (owner: 10Clément Goubert) [13:11:52] Tran: Do you need/want to test yours? Or just happy to let it go through? [13:11:56] (03CR) 10Clément Goubert: [C:03+2] P:kafka::broker: Set cpu governor to performance [puppet] - 10https://gerrit.wikimedia.org/r/1148862 (https://phabricator.wikimedia.org/T393513) (owner: 10Clément Goubert) [13:12:02] I can test, please hold [13:12:38] It's not ready to go yet ;) [13:13:32] FIRING: KubernetesCalicoDown: ml-serve1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:13:35] whoops I just realized that 😅 but yes I'd like to test when it's up [13:13:44] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1148859|Stop setting $wgCaptchaClass in extension.json files (T394814)]], [[gerrit:1148860|Stop setting $wgCaptchaClass in extension.json files (T394814)]], [[gerrit:1148830|Add mediawiki.ForeignApi.core as a dependency (T387720)]] [13:13:49] T394814: Unable to have multiple captcha implementations in extension-list - https://phabricator.wikimedia.org/T394814 [13:13:49] T387720: Prefer a parameter over a configuration for importing translations in SecurePoll - https://phabricator.wikimedia.org/T387720 [13:14:16] (03CR) 10Hashar: [C:03+2] Merge tag 'v3.10.6' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148851 (https://phabricator.wikimedia.org/T390666) (owner: 10Hashar) [13:15:59] !log reedy@deploy1003 reedy, stran: Backport for [[gerrit:1148859|Stop setting $wgCaptchaClass in extension.json files (T394814)]], [[gerrit:1148860|Stop setting $wgCaptchaClass in extension.json files (T394814)]], [[gerrit:1148830|Add mediawiki.ForeignApi.core as a dependency (T387720)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:16:39] FIRING: [9x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:139::19) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:16:43] Tran: should be good to test now [13:16:49] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "import new switches from netbox to hiera now they are status active - cmooney@cumin1003 - T394021" [13:16:52] T394021: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021 [13:16:55] (03PS1) 10Btullis: Adapt dump scripts for running in containers [dumps] - 10https://gerrit.wikimedia.org/r/1148863 (https://phabricator.wikimedia.org/T394389) [13:17:26] (03PS1) 10Slyngshede: Signup: command for generating activation link [software/bitu] - 10https://gerrit.wikimedia.org/r/1148865 [13:17:37] Confirmed working as expected 🎉 [13:17:43] sweet, mine looks good too [13:17:44] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "import new switches from netbox to hiera now they are status active - cmooney@cumin1003 - T394021" [13:17:46] !log reedy@deploy1003 reedy, stran: Continuing with sync [13:18:56] (03PS2) 10Reedy: Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148398 (https://phabricator.wikimedia.org/T382148) [13:19:29] (03PS2) 10Slyngshede: Signup: command for generating activation link [software/bitu] - 10https://gerrit.wikimedia.org/r/1148865 [13:19:38] (03PS3) 10Reedy: Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148398 (https://phabricator.wikimedia.org/T382148) [13:19:49] (03CR) 10Reedy: [C:03+2] Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148398 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [13:20:27] (03PS3) 10Slyngshede: Signup: command for generating activation link [software/bitu] - 10https://gerrit.wikimedia.org/r/1148865 [13:20:37] (03Merged) 10jenkins-bot: Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148398 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [13:20:56] (03PS1) 10Alexandros Kosiaris: eventgate-main: Increase CPU limit to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148866 (https://phabricator.wikimedia.org/T393513) [13:21:02] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage [13:21:39] FIRING: [9x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:139::19) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:21:42] (03PS4) 10Slyngshede: Signup: command for generating activation link [software/bitu] - 10https://gerrit.wikimedia.org/r/1148865 [13:22:29] (03PS1) 10Cathal Mooney: Codfw Spine EBGP: Fix typo in peer IP on ssw1-e1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1148867 (https://phabricator.wikimedia.org/T394021) [13:23:30] (03CR) 10CI reject: [V:04-1] Merge tag 'v3.10.6' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148851 (https://phabricator.wikimedia.org/T390666) (owner: 10Hashar) [13:24:02] (03CR) 10Ayounsi: [C:03+1] Codfw Spine EBGP: Fix typo in peer IP on ssw1-e1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1148867 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [13:24:26] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage [13:24:28] (03CR) 10Cathal Mooney: [C:03+2] Codfw Spine EBGP: Fix typo in peer IP on ssw1-e1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1148867 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [13:24:36] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148859|Stop setting $wgCaptchaClass in extension.json files (T394814)]], [[gerrit:1148860|Stop setting $wgCaptchaClass in extension.json files (T394814)]], [[gerrit:1148830|Add mediawiki.ForeignApi.core as a dependency (T387720)]] (duration: 10m 52s) [13:24:41] T394814: Unable to have multiple captcha implementations in extension-list - https://phabricator.wikimedia.org/T394814 [13:24:41] T387720: Prefer a parameter over a configuration for importing translations in SecurePoll - https://phabricator.wikimedia.org/T387720 [13:25:30] (03PS1) 10Hashar: Gerrit 3.10.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148868 (https://phabricator.wikimedia.org/T390666) [13:25:36] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1001.eqiad.wmnet [13:25:40] (03Merged) 10jenkins-bot: Codfw Spine EBGP: Fix typo in peer IP on ssw1-e1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1148867 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [13:25:57] (03PS1) 10Vgutierrez: hiera: Enable edge uniques on cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/1148869 (https://phabricator.wikimedia.org/T391411) [13:26:09] (03CR) 10Hashar: [C:03+2] Merge tag 'v3.10.6' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148851 (https://phabricator.wikimedia.org/T390666) (owner: 10Hashar) [13:26:40] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1148398|Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" (T382148 T394814)]] [13:26:44] T382148: Enable hCaptcha on test2wiki - https://phabricator.wikimedia.org/T382148 [13:26:51] (03PS1) 10Majavah: P:toolforge::proxy: Remove unnecessary Prometheus term [puppet] - 10https://gerrit.wikimedia.org/r/1148870 [13:26:51] (03PS1) 10Majavah: P:toolforge::proxy: Listen on IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575) [13:27:10] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148869 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:28:32] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:29:46] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5637/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575) (owner: 10Majavah) [13:31:12] (03CR) 10Ssingh: [C:03+1] hiera: Enable edge uniques on cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/1148869 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:31:21] (03CR) 10Alexandros Kosiaris: [C:03+2] eventgate-main: Increase CPU limit to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148866 (https://phabricator.wikimedia.org/T393513) (owner: 10Alexandros Kosiaris) [13:31:39] RESOLVED: [9x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:139::19) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [13:33:02] (03Merged) 10jenkins-bot: eventgate-main: Increase CPU limit to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148866 (https://phabricator.wikimedia.org/T393513) (owner: 10Alexandros Kosiaris) [13:34:11] (03CR) 10Vgutierrez: "same as with `is_alt_domain`, after a var.set() we should log its value with std.log()" [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [13:34:22] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge uniques on cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/1148869 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:36:26] !log enabling edge uniques on cp4045 - T391411 [13:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:30] T391411: SDS 2.4.4 Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411 [13:37:09] (03Merged) 10jenkins-bot: Merge tag 'v3.10.6' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148851 (https://phabricator.wikimedia.org/T390666) (owner: 10Hashar) [13:37:10] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet [13:37:33] !log deploy eventgate-main to pickup the CPU change as well as the change in envoy histogram buckets [13:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:32] RESOLVED: KubernetesCalicoDown: dse-k8s-worker1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:39:57] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [13:40:33] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [13:41:10] !log updating dns-root-data on A:dnsbox [13:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:13] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1001.eqiad.wmnet with OS bookworm [13:41:25] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: apply [13:41:33] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [13:42:23] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [13:42:56] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [13:43:13] !log updating dns-root-data on A:wikidough [13:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:39] (03PS2) 10Majavah: P:toolforge::proxy: Listen on IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575) [13:44:37] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] grafana: Remove stunnel from quickdatacopy sync [puppet] - 10https://gerrit.wikimedia.org/r/1148468 (https://phabricator.wikimedia.org/T393738) (owner: 10Andrea Denisse) [13:44:44] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5638/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575) (owner: 10Majavah) [13:45:34] (03CR) 10Elukey: [C:03+2] conftool-data: add ml-serve1001 back to its pool after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1148789 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey) [13:45:58] (03CR) 10Majavah: [V:03+1] "Also checked that this works on existing IPv4-only hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575) (owner: 10Majavah) [13:46:30] denisse: o/ ok to merge? [13:46:49] elukey: Yes, please. I can't ssh to puppetserver for some reason. :( [13:47:05] {{done}} [13:47:10] Thanks!! [13:47:37] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [13:48:12] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox and not P{dns7001*} and A:dnsbox [13:48:27] !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1001.eqiad.wmnet [13:48:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1001.eqiad.wmnet [13:49:12] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [13:49:36] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [13:50:12] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: name=ml-serve1001.eqiad.wmnet,dc=eqiad,cluster=maps,service=inference [13:51:00] (03PS1) 10Vgutierrez: hiera: Enable edge uniques in one server per DC and cluster [puppet] - 10https://gerrit.wikimedia.org/r/1148877 (https://phabricator.wikimedia.org/T391411) [13:51:30] (03CR) 10FNegri: [C:03+1] P:toolforge::proxy: Remove unnecessary Prometheus term [puppet] - 10https://gerrit.wikimedia.org/r/1148870 (owner: 10Majavah) [13:51:34] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148877 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:51:50] (03CR) 10FNegri: [C:03+1] P:exim::smarthost: Convert unsupported domain warn to reject [puppet] - 10https://gerrit.wikimedia.org/r/1137758 (https://phabricator.wikimedia.org/T366935) (owner: 10Majavah) [13:52:35] (03CR) 10FNegri: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575) (owner: 10Majavah) [13:52:42] (03CR) 10Majavah: [C:03+2] P:toolforge::proxy: Remove unnecessary Prometheus term [puppet] - 10https://gerrit.wikimedia.org/r/1148870 (owner: 10Majavah) [13:52:48] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::proxy: Listen on IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575) (owner: 10Majavah) [13:52:59] (03CR) 10Majavah: [C:03+2] P:exim::smarthost: Convert unsupported domain warn to reject [puppet] - 10https://gerrit.wikimedia.org/r/1137758 (https://phabricator.wikimedia.org/T366935) (owner: 10Majavah) [13:53:53] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:54:16] !log elukey@puppetserver1001 conftool action : set/weight=1; selector: name=ml-serve1001.eqiad.wmnet [13:56:50] (03PS1) 10Muehlenhoff: Move Kartotherian/staging to the new Bookworm nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148881 (https://phabricator.wikimedia.org/T381565) [13:56:52] !log reedy@deploy1003 reedy: Backport for [[gerrit:1148398|Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" (T382148 T394814)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:56:56] T382148: Enable hCaptcha on test2wiki - https://phabricator.wikimedia.org/T382148 [13:56:57] T394814: Unable to have multiple captcha implementations in extension-list - https://phabricator.wikimedia.org/T394814 [13:57:04] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=ml-serve10.*.eqiad.wmnet [13:57:12] this is taking a while [13:57:21] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:57:59] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=ml-serve20.*.codfw.wmnet [13:58:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1140498 (owner: 10Slyngshede) [13:58:29] !log reedy@deploy1003 reedy: Continuing with sync [13:59:27] (03PS2) 10Muehlenhoff: Move Kartotherian/staging to the new Bookworm nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148881 (https://phabricator.wikimedia.org/T381565) [13:59:44] (03PS1) 10Ssingh: dnsdist: set setMaxTCPQueriesPerConnection and a reasonable default [puppet] - 10https://gerrit.wikimedia.org/r/1148883 [14:00:07] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1400) [14:00:23] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5639/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh) [14:00:45] (03CR) 10Ssingh: dnsdist: set setMaxTCPQueriesPerConnection and a reasonable default [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh) [14:01:00] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [14:02:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one typo inline" [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh) [14:02:35] reedy: are you you still doing backport stuff? [14:03:17] we need to get an updated version of the 04 securepoll patch out to wmf.2 as it’s currently causing https://phabricator.wikimedia.org/T394900 [14:03:55] (03CR) 10Ssingh: [C:03+1] "Checked host names to ensure one per cluster and DC." [puppet] - 10https://gerrit.wikimedia.org/r/1148877 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:04:16] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1018.eqiad.wmnet with reason: host reimage [14:05:01] (03PS2) 10Ssingh: dnsdist: set setMaxTCPQueriesPerConnection and a reasonable default [puppet] - 10https://gerrit.wikimedia.org/r/1148883 [14:06:18] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5640/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh) [14:06:21] (03CR) 10Ssingh: dnsdist: set setMaxTCPQueriesPerConnection and a reasonable default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh) [14:07:23] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh) [14:07:30] (03CR) 10Ssingh: [V:03+1 C:03+2] dnsdist: set setMaxTCPQueriesPerConnection and a reasonable default [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh) [14:07:40] sbassett: it's doing a localisation rebuild, so taking a while [14:07:44] 14:07:42 K8s deployment progress: 57% (ok: 1420; fail: 0; left: 1058) \ [14:08:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1018.eqiad.wmnet with reason: host reimage [14:08:03] shouldn't be too much longer... [14:08:21] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge uniques in one server per DC and cluster [puppet] - 10https://gerrit.wikimedia.org/r/1148877 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:08:32] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:57] !log running agent on A:wikidough [14:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:12] !log enabling edge uniques in one server per DC and cluster (cp[1100-1101],cp[2027-2028],cp3074,cp[5017,5025],cp[6001,6009],cp[7001,7009])- T391411 [14:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:16] T391411: SDS 2.4.4 Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411 [14:10:07] Reedy: ok - looks like the wikifunctions folks aren’t using their window rn, so we should be good to sec-deploy once your done. Assuming that’s it? [14:10:16] Yeah,I'm done [14:10:21] Ok, thanks [14:10:21] (when this is done) [14:11:27] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [14:11:50] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:11:54] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:12:33] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148398|Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" (T382148 T394814)]] (duration: 45m 53s) [14:12:38] T382148: Enable hCaptcha on test2wiki - https://phabricator.wikimedia.org/T382148 [14:12:38] T394814: Unable to have multiple captcha implementations in extension-list - https://phabricator.wikimedia.org/T394814 [14:12:43] sbassett: that's me clear now [14:13:21] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10844388 (10Papaul) @cmooney thank you. I do not have any preference on how this is done. What works best for all is good with me. [14:16:07] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10844404 (10GPSLeo) When is this expected to be solved? Because of this problem many important maintenance and monitoring tools are broken. This should have UBN priority. [14:20:11] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [14:22:47] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10844425 (10bking) Not to get too far off topic, but let me contextualize this. > > This is the first time I heard about that repository. Was the already exi... [14:23:53] (03PS3) 10Ayounsi: Icinga: remove some network devices checks [puppet] - 10https://gerrit.wikimedia.org/r/1144650 (https://phabricator.wikimedia.org/T388641) [14:24:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [14:24:42] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1018.eqiad.wmnet with OS bullseye [14:24:53] (03PS1) 10ZhaoFJx: arbcom_zhwiki: Change wgWhitelistRead Setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) [14:25:15] (03PS3) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) [14:25:20] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [14:25:40] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Upgrading to Java 11.0.27 - eevans@cumin1002 [14:26:49] (03CR) 10Hoo man: [C:03+1] "Fine to deploy whenever we want." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148834 (https://phabricator.wikimedia.org/T394669) (owner: 10Arthur taylor) [14:26:49] (03CR) 10Nikerabbit: [C:03+1] Remove unused wgContentTranslationEnableSectionTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson) [14:27:49] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [14:27:54] (03PS1) 10CDobbins: replace X-WMF-UUID with vmod_var variable [puppet] - 10https://gerrit.wikimedia.org/r/1148889 [14:28:13] (03CR) 10CI reject: [V:04-1] wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri) [14:29:07] (03PS2) 10ZhaoFJx: arbcom_zhwiki: Change wgWhitelistRead Setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) [14:29:56] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on ms-fe2012:9290 - https://phabricator.wikimedia.org/T394901#10844461 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated power cord. alert cleared. [14:30:34] !log Deployed updated security fix for T392341 (04) to 1.45-wmf.2 [14:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:51] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate continuousScan-commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1140484 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan) [14:31:54] !incidents [14:31:54] 6181 (ACKED) kafka-jumbo1016/Kafka Broker Server (paged) [14:31:54] 6182 (ACKED) kafka-jumbo1017/Kafka Broker Server (paged) [14:32:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [14:32:19] I asked brouberol twice if to resolve them [14:32:41] ok, so re-page from yesterday? [14:32:46] yeah [14:33:04] wanted the greenlight first [14:33:16] but it is what it is [14:33:41] :) [14:36:19] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#10844521 (10Jhancock.wm) I've connected the intel NIC via a 1000BASE-TX SFP to port 44 on the switch. Let me know when you need it rem... [14:36:31] (03PS1) 10Ayounsi: Netops: remove check_bgp [puppet] - 10https://gerrit.wikimedia.org/r/1148891 (https://phabricator.wikimedia.org/T388641) [14:37:55] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:37:56] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:38:14] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148891 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [14:38:17] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:38:32] `1 [14:39:17] Fat fingers that I have.... [14:40:23] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet,db1216.eqiad.wmnet with reason: Restart x3 [14:42:33] (03CR) 10DLynch: VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester) [14:42:55] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox and not P{dns7001*} and A:dnsbox [14:43:06] !log installing postgresql-15 security updates [14:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:32] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [14:43:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2018.codfw.wmnet with OS bookworm [14:43:53] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10844573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2018.codfw.wmnet with OS bookworm [14:47:34] (03PS1) 10Brouberol: airflow-dev: pull from the remote branch every 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148893 (https://phabricator.wikimedia.org/T393998) [14:47:51] (03PS1) 10ZhaoFJx: zh.arbcom: Enable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148894 (https://phabricator.wikimedia.org/T394920) [14:48:23] (03CR) 10Jforrester: VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester) [14:49:09] (03PS1) 10Elukey: kubernetes: add maps-test codfw as external service [puppet] - 10https://gerrit.wikimedia.org/r/1148896 (https://phabricator.wikimedia.org/T381565) [14:49:35] (03PS2) 10ZhaoFJx: zh.arbcom: Enable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148894 (https://phabricator.wikimedia.org/T394920) [14:50:52] (03CR) 10Elukey: Move Kartotherian/staging to the new Bookworm nodes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148881 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:51:22] (03PS3) 10ZhaoFJx: arbcom_zhwiki: Enable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148894 (https://phabricator.wikimedia.org/T394920) [14:52:05] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5642/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148896 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [14:52:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) (owner: 10ZhaoFJx) [14:52:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148894 (https://phabricator.wikimedia.org/T394920) (owner: 10ZhaoFJx) [14:53:22] (03CR) 10CDobbins: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148889 (owner: 10CDobbins) [14:55:26] (03CR) 10Muehlenhoff: kubernetes: add maps-test codfw as external service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148896 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [14:55:54] RECOVERY - Host an-worker1068 is UP: PING WARNING - Packet loss = 33%, RTA = 930.22 ms [14:58:06] (03CR) 10Btullis: [C:03+1] airflow-dev: pull from the remote branch every 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148893 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [14:58:06] (03CR) 10Jcrespo: [C:03+2] mariadb: Change x2 and x3 ports to avoid conflicts with extra port [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo) [14:59:18] (03Abandoned) 10Brouberol: Duplicate partman/custom/kafka-jumbo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1148839 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol) [14:59:24] (03Abandoned) 10Brouberol: partman: define a kafka-jumbo-ba recipe [puppet] - 10https://gerrit.wikimedia.org/r/1148840 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol) [14:59:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2018.codfw.wmnet with reason: host reimage [15:00:47] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10844697 (10Marostegui) >>! In T394624#10844404, @GPSLeo wrote: > When is this expected to be solved? Because of this problem many important maintenance and monitoring tools are broken. This should have... [15:02:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2018.codfw.wmnet with reason: host reimage [15:03:03] RECOVERY - MegaRAID on an-worker1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:05:36] (03CR) 10Brouberol: [C:03+1] Adapt dump scripts for running in containers [dumps] - 10https://gerrit.wikimedia.org/r/1148863 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:27] (03CR) 10Brouberol: [C:03+1] airflow-dev: pull from the remote branch every 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148893 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [15:08:30] (03CR) 10Brouberol: [C:03+2] airflow-dev: pull from the remote branch every 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148893 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [15:09:39] (03CR) 10Federico Ceratto: [C:03+2] hiera: Add zarcillo k8s service on traffic server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [15:12:56] (03CR) 10Btullis: [C:03+2] Adapt dump scripts for running in containers [dumps] - 10https://gerrit.wikimedia.org/r/1148863 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [15:15:41] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:16:30] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:16:47] (03CR) 10BryanDavis: [C:03+1] "Should I cherry-pick this in Beta to prove that it works there?" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [15:18:32] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:18:57] (03CR) 10Stang: arbcom_zhwiki: Change wgWhitelistRead Setting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) (owner: 10ZhaoFJx) [15:19:11] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:20:45] (03CR) 10Eamedina: [C:03+1] Remove unused wgContentTranslationEnableSectionTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson) [15:21:20] (03CR) 10Ladsgroup: openstack: wikireplica_dns: Point x3 records to new VIP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148313 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [15:21:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:21:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2018.codfw.wmnet with OS bookworm [15:22:04] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10844770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2018.codfw.wmnet with OS bookworm completed: - pc2018 (**PASS**) - Remov... [15:22:34] (03PS1) 10Cathal Mooney: Switch IBGP Policy: set next-hop self on routes being originated [homer/public] - 10https://gerrit.wikimedia.org/r/1148898 (https://phabricator.wikimedia.org/T394530) [15:22:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10844773 (10Jhancock.wm) 05Open→03Resolved [15:23:01] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10844777 (10Jhancock.wm) @Marostegui this is completed. [15:23:58] (03CR) 10FNegri: [C:03+1] openstack: wikireplica_dns: Point x3 records to new VIP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148313 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [15:24:30] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10844781 (10Ladsgroup) Thank you! [15:25:07] !log forgetting 4 old instances @ orchestrator-web T384274 [15:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:11] T384274: Backups for x3 - https://phabricator.wikimedia.org/T384274 [15:25:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:26:47] (03PS1) 10Majavah: toolforge: toolviews: Do not log secret changes [puppet] - 10https://gerrit.wikimedia.org/r/1148899 [15:27:13] (03PS4) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) [15:28:32] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:30:16] (03CR) 10CI reject: [V:04-1] wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri) [15:31:57] (03PS3) 10ZhaoFJx: arbcom_zhwiki: Change wgWhitelistRead Setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) [15:32:01] (03CR) 10ZhaoFJx: arbcom_zhwiki: Change wgWhitelistRead Setting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) (owner: 10ZhaoFJx) [15:33:01] (03CR) 10Ayounsi: [C:03+1] Switch IBGP Policy: set next-hop self on routes being originated [homer/public] - 10https://gerrit.wikimedia.org/r/1148898 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [15:34:16] (03CR) 10FNegri: [C:03+1] toolforge: toolviews: Do not log secret changes [puppet] - 10https://gerrit.wikimedia.org/r/1148899 (owner: 10Majavah) [15:37:20] (03PS5) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) [15:40:25] (03CR) 10CI reject: [V:04-1] wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri) [15:40:26] (03PS1) 10Effie Mouzeli: memcached: add option to switch to the performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) [15:41:34] (03CR) 10CI reject: [V:04-1] memcached: add option to switch to the performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [15:41:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:43:35] (03CR) 10Stang: [C:03+1] arbcom_zhwiki: Change wgWhitelistRead Setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) (owner: 10ZhaoFJx) [15:48:26] (03CR) 10Xcollazo: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148818 (https://phabricator.wikimedia.org/T394899) (owner: 10Gmodena) [15:48:26] (03PS2) 10Effie Mouzeli: memcached: add option to switch to the performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) [15:49:14] (03PS1) 10Effie Mouzeli: mediawiki::memcached enable performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148901 (https://phabricator.wikimedia.org/T371881) [15:49:31] (03CR) 10CI reject: [V:04-1] memcached: add option to switch to the performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [15:49:36] (03PS2) 10Effie Mouzeli: mediawiki::memcached enable performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148901 (https://phabricator.wikimedia.org/T371881) [15:50:25] (03CR) 10Federico Ceratto: "Ok, tracking progress in https://phabricator.wikimedia.org/T394884" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [15:51:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148818 (https://phabricator.wikimedia.org/T394899) (owner: 10Gmodena) [15:52:20] (03PS3) 10Effie Mouzeli: memcached: add option to switch to the performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) [15:52:58] jhancock@cumin2002 provision (PID 3468025) is awaiting input [15:53:09] (03PS4) 10Effie Mouzeli: memcached: add option to switch to the performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) [15:55:24] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10844927 (10Dzahn) [15:55:53] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10844928 (10Dzahn) 05Open→03Resolved 6 VMs have been created: 2 VMs - main zuul (8GB) zuul1001 zuul200... [15:56:01] (03CR) 10CI reject: [V:04-1] memcached: add option to switch to the performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [15:56:56] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [15:56:59] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [15:57:37] (03CR) 10Majavah: [C:03+2] toolforge: toolviews: Do not log secret changes [puppet] - 10https://gerrit.wikimedia.org/r/1148899 (owner: 10Majavah) [15:58:42] (03CR) 10Joal: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148818 (https://phabricator.wikimedia.org/T394899) (owner: 10Gmodena) [16:01:29] (03PS1) 10Dzahn: site: separate zuul regex, make it clear what is doing what [puppet] - 10https://gerrit.wikimedia.org/r/1148902 (https://phabricator.wikimedia.org/T393873) [16:02:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#10844977 (10RobH) > I kind of like this idea, but it might complicate the reimage process. So probably the easiest thing is: > > * C... [16:02:09] (03CR) 10Dzahn: "also avoids the string "zuul3"" [puppet] - 10https://gerrit.wikimedia.org/r/1148902 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [16:02:16] (03PS5) 10Effie Mouzeli: memcached: add option to switch to the performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) [16:02:37] RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [16:03:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:03:32] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm [16:03:47] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10844995 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm [16:04:19] PROBLEM - Host cirrussearch2079 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:43] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate all refreshLinkRecommendations jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143529 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [16:05:55] RECOVERY - Host cirrussearch2079 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms [16:07:27] PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:09] (03CR) 10Dzahn: "yes, PTR records needed for sure.. I basically just forgot to amend one more time.. doing" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [16:08:12] (03PS4) 10Scott French: P:mw:maint:update_special_pages: all updateSpecialPages shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) [16:08:45] RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.70 ms [16:09:27] (03PS6) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) [16:11:30] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [16:11:48] (03PS1) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 210s. [dns] - 10https://gerrit.wikimedia.org/r/1148904 (https://phabricator.wikimedia.org/T394312) [16:11:52] (03CR) 10Dzahn: "well.. ACTUALLY.. multiple A records for one IP is considered standard but multiple PTR records for one IP is considered "not recommended"" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [16:12:35] (03CR) 10CI reject: [V:04-1] wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri) [16:12:40] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [16:12:46] jouncebot: nowandnext [16:12:46] No deployments scheduled for the next 0 hour(s) and 47 minute(s) [16:12:46] In 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1700) [16:13:06] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [16:16:30] (03CR) 10Clément Goubert: [C:03+1] "Completely optional nit, up to you, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [16:16:41] (03CR) 10Clément Goubert: [C:03+1] mediawiki::memcached enable performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148901 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli) [16:17:06] (03CR) 10Dzahn: "https://serverfault.com/questions/618700/why-multiple-ptr-records-in-dns-is-not-recommended" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [16:17:50] (03PS1) 10Effie Mouzeli: WIP: profile::kubernetes::node: Add script to pull and mount latest mw [puppet] - 10https://gerrit.wikimedia.org/r/1148905 (https://phabricator.wikimedia.org/T276994) [16:18:11] (03CR) 10Clément Goubert: [C:03+1] P:mw:maint:update_special_pages: all updateSpecialPages shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [16:18:18] (03CR) 10Hnowlan: [C:03+1] P:mw:maint:update_special_pages: all updateSpecialPages shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [16:19:00] (03CR) 10CI reject: [V:04-1] WIP: profile::kubernetes::node: Add script to pull and mount latest mw [puppet] - 10https://gerrit.wikimedia.org/r/1148905 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [16:25:28] (03CR) 10Cathal Mooney: [C:03+2] Switch IBGP Policy: set next-hop self on routes being originated [homer/public] - 10https://gerrit.wikimedia.org/r/1148898 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [16:25:43] (03PS3) 10Cathal Mooney: Network: add puppet data for new devices and networks codfw expansion [puppet] - 10https://gerrit.wikimedia.org/r/1145194 (https://phabricator.wikimedia.org/T394021) [16:26:40] (03Merged) 10jenkins-bot: Switch IBGP Policy: set next-hop self on routes being originated [homer/public] - 10https://gerrit.wikimedia.org/r/1148898 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney) [16:33:17] (03CR) 10Cathal Mooney: [C:03+2] Network: add puppet data for new devices and networks codfw expansion [puppet] - 10https://gerrit.wikimedia.org/r/1145194 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney) [16:35:02] (03PS1) 10DLynch: Extend the mobile insert menu config so that tools can be specified [extensions/VisualEditor] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148908 (https://phabricator.wikimedia.org/T388604) [16:35:18] (03PS1) 10DLynch: Extend the mobile insert menu config so that tools can be specified [extensions/VisualEditor] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148909 (https://phabricator.wikimedia.org/T388604) [16:37:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/VisualEditor] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148909 (https://phabricator.wikimedia.org/T388604) (owner: 10DLynch) [16:37:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/VisualEditor] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148908 (https://phabricator.wikimedia.org/T388604) (owner: 10DLynch) [16:37:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester) [16:37:34] (03CR) 10Dzahn: "recently on #dns" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [16:38:02] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10845153 (10AKanji-WMF) [16:40:58] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Upgrading to Java 11.0.27 - eevans@cumin1002 [16:41:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:41:31] (03CR) 10Btullis: [C:03+1] snapshot: Remove now unused Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1148751 (https://phabricator.wikimedia.org/T394647) (owner: 10Muehlenhoff) [16:42:56] (03CR) 10Dzahn: "yea.. so since/if I am supposed to pick a single PTR record.. that would mean I just use the existing one and this change is ok as is." [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [16:45:48] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium-restart [16:45:59] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.sanitarium-restart (exit_code=97) [16:46:14] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium-restart [16:46:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:46:27] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitarium-restart (exit_code=99) [16:46:37] !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium-restart [16:46:50] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitarium-restart (exit_code=99) [16:47:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:52:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [16:55:00] (03CR) 10Ssingh: [C:03+1] "yes, you observations are correct and at least I certainly missed the fact that you already have a PTR for the "primary" record and that's" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [16:56:25] (03CR) 10Ssingh: [C:03+1] "$ dig -x 208.80.154.151 +short" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn) [16:57:08] (03PS1) 10Hnowlan: mw::maintenance: migrate all remaining growthexperiments jobs [puppet] - 10https://gerrit.wikimedia.org/r/1148914 (https://phabricator.wikimedia.org/T385782) [16:57:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [17:00:05] swfrench-wmf: How many deployers does it take to do MediaWiki infrastructure (UTC late) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1700). [17:00:29] o/ I'll get started in a couple of minutes [17:01:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [17:05:02] (03PS1) 10Andrea Denisse: grafana: Disable dashboard sync to ugprade Grafana version [puppet] - 10https://gerrit.wikimedia.org/r/1148915 (https://phabricator.wikimedia.org/T394470) [17:06:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [17:08:32] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:38] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5643/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148915 (https://phabricator.wikimedia.org/T394470) (owner: 10Andrea Denisse) [17:09:56] (03PS2) 10Dzahn: lists: include nftables throttling profile [puppet] - 10https://gerrit.wikimedia.org/r/1148432 (https://phabricator.wikimedia.org/T394519) [17:10:12] (03CR) 10Dzahn: [C:03+1] "start simple.. then enable it" [puppet] - 10https://gerrit.wikimedia.org/r/1148432 (https://phabricator.wikimedia.org/T394519) (owner: 10Dzahn) [17:13:02] jhancock@cumin2002 reimage (PID 3486418) is awaiting input [17:20:54] (03CR) 10Scott French: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [17:20:56] (03CR) 10Scott French: [C:03+2] P:mw:maint:update_special_pages: all updateSpecialPages shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French) [17:27:29] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10845246 (10VRiley-WMF) 05Open→03In progress Taking this unit down for the memory swap. [17:28:32] FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:29:26] (03PS1) 10Bking: cirrussearch: Add cirrussearch row C/remove elastic row D [puppet] - 10https://gerrit.wikimedia.org/r/1148922 (https://phabricator.wikimedia.org/T388610) [17:29:39] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148922 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:33:17] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [17:33:29] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [17:37:14] I'm done with the infra window [17:41:13] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate all remaining growthexperiments jobs [puppet] - 10https://gerrit.wikimedia.org/r/1148914 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [17:41:22] (03PS1) 10Dzahn: role: delete requesttracker role [puppet] - 10https://gerrit.wikimedia.org/r/1148923 [17:41:33] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10845303 (10VRiley-WMF) 05In progress→03Resolved This is completed [17:42:15] (03PS2) 10Dzahn: role: delete requesttracker role [puppet] - 10https://gerrit.wikimedia.org/r/1148923 (https://phabricator.wikimedia.org/T385777) [17:45:53] (03CR) 10Btullis: [C:03+1] cirrussearch: Add cirrussearch row C/remove elastic row D [puppet] - 10https://gerrit.wikimedia.org/r/1148922 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:48:26] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-fe100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T394894#10845334 (10VRiley-WMF) [17:51:28] jouncebot: nowandnext [17:51:28] For the next 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1700) [17:51:29] In 2 hour(s) and 8 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T2000) [17:51:44] (03PS1) 10Dreamy Jazz: Support creating logs in emptyUserGroup.php [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148924 (https://phabricator.wikimedia.org/T394914) [17:52:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [17:54:57] Anyone mind if I deploy? [17:55:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148924 (https://phabricator.wikimedia.org/T394914) (owner: 10Dreamy Jazz) [17:57:03] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:57:08] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:57:19] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:57:30] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:57:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:57:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:58:03] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: Add cirrussearch row C/remove elastic row D [puppet] - 10https://gerrit.wikimedia.org/r/1148922 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:58:04] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-fe100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T394894#10845357 (10VRiley-WMF) [17:58:08] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:58:37] (03CR) 10Muehlenhoff: "Also needs to be dropped from profile::prometheus::ops" [puppet] - 10https://gerrit.wikimedia.org/r/1148923 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [18:01:21] (03PS7) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) [18:01:56] FIRING: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [18:02:01] Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [18:02:07] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:02:23] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:02:34] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2018:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:02:44] RESOLVED: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:02:56] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [18:03:08] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:03:14] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:03:30] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:06:56] RESOLVED: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [18:07:01] Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [18:08:06] (03Merged) 10jenkins-bot: Support creating logs in emptyUserGroup.php [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148924 (https://phabricator.wikimedia.org/T394914) (owner: 10Dreamy Jazz) [18:08:30] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1148924|Support creating logs in emptyUserGroup.php (T394914)]] [18:08:32] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:08:34] T394914: Update emptyUserGroup.php to optionally support creating log entries for removal - https://phabricator.wikimedia.org/T394914 [18:09:01] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-fe100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T394894#10845376 (10VRiley-WMF) 05Open→03Resolved [18:10:50] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1148924|Support creating logs in emptyUserGroup.php (T394914)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:12:00] (03CR) 10Dzahn: [C:03+2] site: separate zuul regex, make it clear what is doing what [puppet] - 10https://gerrit.wikimedia.org/r/1148902 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn) [18:13:55] (03CR) 10Bking: [C:03+2] cirrussearch: Add cirrussearch row C/remove elastic row D [puppet] - 10https://gerrit.wikimedia.org/r/1148922 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:14:35] (03PS1) 10Dzahn: zuul: create basic role/profile for zuul::man and install docker.io [puppet] - 10https://gerrit.wikimedia.org/r/1148930 (https://phabricator.wikimedia.org/T393873) [18:14:49] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [18:15:21] yes, it's ok to merge multiple ;) [18:17:54] (03PS7) 10BCornwall: varnish: Replace date/stamp headers with vars [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) [18:17:59] (03CR) 10BCornwall: "I had done that originally but removed it in PS6 because the values are output in the `RespHeader`. For example:" [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [18:21:48] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148924|Support creating logs in emptyUserGroup.php (T394914)]] (duration: 13m 18s) [18:21:52] T394914: Update emptyUserGroup.php to optionally support creating log entries for removal - https://phabricator.wikimedia.org/T394914 [18:22:53] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2003.codfw.wmnet with OS bookworm [18:23:00] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10845407 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm executed with errors: - sretest2003 (... [18:23:21] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10845409 (10Ladsgroup) >>! In T394624#10845303, @VRiley-WMF wrote: > This is completed Thanks! I started the mariadb deamons. [18:23:22] !log bking@cumin2002 conftool action : set/pooled=no:weight=10; selector: name=elastic1060.eqiad.wmnet|name=elastic1061.eqiad.wmnet|name=elastic1062.eqiad.wmnet|name=elastic1063.eqiad.wmnet|name=elastic1064.eqiad.wmnet|name=elastic1065.eqiad.wmnet|name=elastic1066.eqiad.wmnet|name=elastic1067.eqiad.wmnet|name=elastic1103.eqiad.wmnet [18:23:25] RECOVERY - MariaDB Replica IO: s7 on clouddb1018 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:31] RECOVERY - MariaDB Replica IO: s4 on clouddb1015 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:32] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:23:41] RECOVERY - MariaDB Replica IO: s6 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:41] RECOVERY - MariaDB Replica IO: s2 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:53] RECOVERY - MariaDB Replica IO: s6 on clouddb1019 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:53] RECOVERY - MariaDB Replica IO: s7 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:53] RECOVERY - MariaDB Replica IO: s4 on clouddb1019 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:53] RECOVERY - MariaDB Replica IO: s2 on clouddb1018 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:55] RECOVERY - MariaDB Replica IO: s6 on clouddb1015 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:55] RECOVERY - MariaDB Replica IO: s7 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:55] RECOVERY - MariaDB Replica IO: s4 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:23:55] RECOVERY - MariaDB Replica IO: s2 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:26:15] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10845425 (10Ladsgroup) Start replication. [18:30:59] (03PS2) 10Dzahn: zuul: create basic role/profile for zuul::man and install docker.io [puppet] - 10https://gerrit.wikimedia.org/r/1148930 (https://phabricator.wikimedia.org/T393873) [18:32:33] (03PS3) 10Dzahn: role: delete requesttracker role [puppet] - 10https://gerrit.wikimedia.org/r/1148923 (https://phabricator.wikimedia.org/T385777) [18:32:47] (03CR) 10Dzahn: "oh! thanks. done!" [puppet] - 10https://gerrit.wikimedia.org/r/1148923 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [18:35:37] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1103 to cirrussearch1103 [18:35:40] (03PS3) 10Dzahn: zuul: create basic role/profile for zuul::man and install docker.io [puppet] - 10https://gerrit.wikimedia.org/r/1148930 (https://phabricator.wikimedia.org/T393873) [18:35:49] !log bking@cumin2002 START - Cookbook sre.dns.netbox [18:36:12] (03PS4) 10Dzahn: zuul: create role/profile for new zuul main servers, install docker.io [puppet] - 10https://gerrit.wikimedia.org/r/1148930 (https://phabricator.wikimedia.org/T393873) [18:38:59] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1103 to cirrussearch1103 - bking@cumin2002" [18:39:41] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1103 to cirrussearch1103 - bking@cumin2002" [18:39:41] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:39:42] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1103 on all recursors [18:39:45] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1103 on all recursors [18:39:46] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1103 [18:40:56] 06SRE, 10SRE-Access-Requests, 06collaboration-services, 10Continuous-Integration-Infrastructure (Zuul upgrade): create new admin group for "zuul devs" - https://phabricator.wikimedia.org/T394819#10845468 (10Dzahn) Fair enough. I'd be fine just using contint-roots. Though decom'ing groups is also not a big... [18:42:12] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1103 [18:42:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1103 to cirrussearch1103 [18:43:32] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:45:00] 06SRE, 10SRE-Access-Requests, 06collaboration-services, 10Continuous-Integration-Infrastructure (Zuul upgrade): create new admin group for "zuul devs" - https://phabricator.wikimedia.org/T394819#10845486 (10Dzahn) ftr, the existing contint servers have all of these: ` profile::admin::groups: - contint-... [18:45:16] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye [18:45:20] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103 [18:45:20] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103 [18:52:08] (03PS1) 10Dzahn: zuul: add contint-roots admin group to new zuul::main role [puppet] - 10https://gerrit.wikimedia.org/r/1148937 (https://phabricator.wikimedia.org/T394819) [18:53:45] (03PS1) 10Jdlrobson: Fixes: TypeError: Cannot read properties of undefined (reading 'contains') [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148938 [18:53:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148938 (owner: 10Jdlrobson) [18:54:42] (03PS1) 10Jdlrobson: bookmark: Fix click event not working [extensions/ReadingLists] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148939 (https://phabricator.wikimedia.org/T394736) [18:54:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ReadingLists] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148939 (https://phabricator.wikimedia.org/T394736) (owner: 10Jdlrobson) [18:56:52] 06SRE, 10SRE-Access-Requests, 06collaboration-services, 10Continuous-Integration-Infrastructure (Zuul upgrade), 13Patch-For-Review: create new admin group for "zuul devs" - https://phabricator.wikimedia.org/T394819#10845524 (10Dzahn) @thcipriani As the existing "approval"-person for the contint-roots rol... [19:01:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson) [19:10:22] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1066 to cirrussearch1066 [19:10:35] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:13:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:15:59] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1066 to cirrussearch1066 - bking@cumin2002" [19:16:22] jhancock@cumin2002 provision (PID 3570058) is awaiting input [19:17:25] RECOVERY - MariaDB Replica Lag: s7 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:17:25] RECOVERY - MariaDB Replica Lag: s7 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:17:25] RECOVERY - MariaDB Replica Lag: s7 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:18:13] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1066 to cirrussearch1066 - bking@cumin2002" [19:18:14] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:18:14] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1066 on all recursors [19:18:15] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: BAD PEM3 on cr2-codfw - https://phabricator.wikimedia.org/T394868#10845555 (10Papaul) ` Case 2025-0520-703157 has been updated by Mathias Zuniga UPDATE HAS BEEN ADDED: Hello Team, Please could you bring me the following com... [19:18:17] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1066 on all recursors [19:18:18] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1066 [19:18:27] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:18:32] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:19:25] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1066 [19:19:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:20:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1066 to cirrussearch1066 [19:21:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:21:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm [19:21:46] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10845559 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm [19:24:39] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1066.eqiad.wmnet with OS bullseye [19:24:43] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1066 [19:24:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1066 [19:24:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm [19:24:56] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845567 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2004.codfw.wmnet with OS bookworm [19:28:32] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:35:09] RECOVERY - MariaDB Replica Lag: s6 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:35:25] RECOVERY - MariaDB Replica Lag: s6 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:35:53] RECOVERY - MariaDB Replica Lag: s6 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:41:28] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1066.eqiad.wmnet with reason: host reimage [19:43:35] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1103.eqiad.wmnet with OS bullseye [19:45:15] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1066.eqiad.wmnet with reason: host reimage [19:50:04] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Upgrading to Java 11.0.27 - eevans@cumin1002 [19:51:26] RECOVERY - MariaDB Replica Lag: s2 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:51:32] RECOVERY - MariaDB Replica Lag: s2 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:51:32] RECOVERY - MariaDB Replica Lag: s2 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:52:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [19:52:42] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye [19:52:46] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103 [19:52:46] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103 [19:56:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [19:56:45] (03PS1) 10Arlolra: Remove $wgParserEnableLegacyMediaDOM option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148950 (https://phabricator.wikimedia.org/T394054) [19:58:40] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch1080.eqiad.wmnet|name=cirrussearch1081.eqiad.wmnet|name=cirrussearch1082.eqiad.wmnet|name=cirrussearch1083.eqiad.wmnet|name=cirrussearch1087.eqiad.wmnet|name=cirrussearch1088.eqiad.wmnet|name=cirrussearch1118.eqiad.wmnet|name=cirrussearch1119.eqiad.wmnet [19:59:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T2000). [20:00:05] ZhaoFJx, Tchanders, Kemayo, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] o/ [20:00:10] Just on time [20:00:46] o/ [20:02:24] My three need to all be deployed together (two backports and a config-change that makes them have an effect). I don't mind spiderpigging them myself. [20:02:27] (03PS1) 10Jforrester: [wikifunctions] Don't grant new generic-enum rights to Functioneers for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148951 (https://phabricator.wikimedia.org/T391913) [20:02:35] o/ [20:03:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage [20:03:32] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:23] Kemayo: Are you going first? [20:04:52] Tchanders: Sure, I can. I was going to wait and see whether a deployer was going to show up first, but I don't mind just doing it. [20:05:42] Ah. I haven't seen a deployer at one of these for a while, but then I haven't done this time slot for a while... [20:07:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148909 (https://phabricator.wikimedia.org/T388604) (owner: 10DLynch) [20:07:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148908 (https://phabricator.wikimedia.org/T388604) (owner: 10DLynch) [20:07:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester) [20:07:55] Tchanders: I'll admit, I haven't done a backport myself since before spiderpig came about, so I'm not 100% sure what the etiquette is these days. [20:08:25] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10845666 (10Jclark-ctr) [20:08:32] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [20:09:13] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1065 to cirrussearch1065 [20:10:08] Kemayo: "Do as many patches as reasonable in one deploy, because they each take ~15 minutes at best". [20:11:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1066.eqiad.wmnet with OS bullseye [20:12:47] I can do the deploy, I suppose. [20:13:15] Oh, Kemayo is already on it, never mind. [20:13:59] It's even the deploy to unblock *you*. :D [20:14:04] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:14:05] jclark@cumin1002 netbox (PID 138500) is awaiting input [20:14:14] Yes yes, hence why I was going to do the deploy for you rather than have a very late lunch. [20:14:19] But given that, bye. :-) [20:15:00] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for thanos-be1006,1007 - jclark@cumin1002" [20:15:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for thanos-be1006,1007 - jclark@cumin1002" [20:15:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:18:17] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1065 to cirrussearch1065 - bking@cumin2002" [20:18:23] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1065 to cirrussearch1065 - bking@cumin2002" [20:18:23] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:18:24] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1065 on all recursors [20:18:27] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1065 on all recursors [20:18:28] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1065 [20:18:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:19:12] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:19:40] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1065 [20:19:55] (03Merged) 10jenkins-bot: Extend the mobile insert menu config so that tools can be specified [extensions/VisualEditor] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148909 (https://phabricator.wikimedia.org/T388604) (owner: 10DLynch) [20:19:56] (03Merged) 10jenkins-bot: Extend the mobile insert menu config so that tools can be specified [extensions/VisualEditor] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148908 (https://phabricator.wikimedia.org/T388604) (owner: 10DLynch) [20:20:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1065 to cirrussearch1065 [20:20:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:20:46] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1065.eqiad.wmnet with OS bullseye [20:20:51] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1065 [20:20:51] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1065 [20:22:39] jclark@cumin1002 provision (PID 139211) is awaiting input [20:23:24] jclark@cumin1002 provision (PID 139216) is awaiting input [20:23:41] jhancock@cumin2002 reimage (PID 3574856) is awaiting input [20:25:42] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:28:47] jhancock@cumin2002 reimage (PID 3573569) is awaiting input [20:30:51] !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:31:26] !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:34:24] RESOLVED: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:34:43] (03PS2) 10Jforrester: wikifunctions: Update orchestrator from 2025-05-14-112404 to 2025-05-21-192453 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148857 (https://phabricator.wikimedia.org/T385899) [20:34:46] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-05-12-235119 to 2025-05-21-192515 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148954 (https://phabricator.wikimedia.org/T385899) [20:36:32] Quick update: two of my patches merged, but I'm still waiting for the third one to finish. [20:37:59] Actually... I wonder if something is wedged out of position. The +2 is there on the patch, but no gate-and-submit. [20:38:02] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1065.eqiad.wmnet with reason: host reimage [20:39:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:39:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2004.codfw.wmnet with OS bookworm [20:39:50] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2004.codfw.wmnet with OS bookworm completed: - sretest2004 (**PASS**)... [20:40:22] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845770 (10Jhancock.wm) [20:41:13] Kemayo: Thanks for the update. (Also it's been easy to watch along via Spiderpig so thanks RelEng) [20:41:37] I'll bow out of this deployment window, since it's getting late here and not looking likely we'll get round to my patches [20:41:46] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1065.eqiad.wmnet with reason: host reimage [20:41:53] Sorry about it taking so long [20:42:07] jclark@cumin1002 provision (PID 139211) is awaiting input [20:43:01] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845780 (10Jhancock.wm) @RobH It passed reimaging with UEFI. You do have to turn some things off that might be a security issue. We discussed it in the last dcops meeting. The issue Joh... [20:43:05] (03CR) 10Jforrester: [C:03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester) [20:43:14] Kemayo: I've re-triggered it [20:43:52] (03Merged) 10jenkins-bot: VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester) [20:44:09] James_F: thanks! Was just commenting on the patch enough, or did it need to be by someone who had +2 on the repo? [20:44:23] !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1148909|Extend the mobile insert menu config so that tools can be specified (T388604)]], [[gerrit:1148908|Extend the mobile insert menu config so that tools can be specified (T388604)]], [[gerrit:1148422|VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis (T388604)]] [20:44:27] T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604 [20:44:39] Kemayo: Needed to be a C+2 comment. [20:45:00] Kemayo: But anyone with spiderpig deploy access will have C+2 access. [20:45:45] James_F: Not sure that's true -- I certainly don't on mediawiki-config. [20:46:37] !log kemayo@deploy1003 jforrester, kemayo: Backport for [[gerrit:1148909|Extend the mobile insert menu config so that tools can be specified (T388604)]], [[gerrit:1148908|Extend the mobile insert menu config so that tools can be specified (T388604)]], [[gerrit:1148422|VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis (T388604)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug [20:46:37] ). Changes can now be verified there. [20:46:47] Kemayo: Oh dear, you should file a task about that. C+2 is available for wmf-deployment https://gerrit.wikimedia.org/r/admin/repos/operations/mediawiki-config,access [20:46:55] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845786 (10RobH) So there isn't a specific team in mind for this host, it really depends on what we have for Config D use in next fiscal. This fiscal, we purchased the following hostnam... [20:48:20] !log kemayo@deploy1003 jforrester, kemayo: Continuing with sync [20:48:52] Kemayo: No need to apologise! [20:50:27] jclark@cumin1002 provision (PID 139211) is awaiting input [20:53:03] (03PS1) 10SBassett: Revert "OATHAuth: Mark checkuser and suppress as requiring 2FA" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148956 [20:53:32] James_F: I guess that I skirted through the need to be in wmf-deployment, presumably because scap removed the actual requirement of being able to directly merge things. [20:53:53] Hmm, I thought spiderpig was just meant to be a nicer way of having the same access. [20:55:19] !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148909|Extend the mobile insert menu config so that tools can be specified (T388604)]], [[gerrit:1148908|Extend the mobile insert menu config so that tools can be specified (T388604)]], [[gerrit:1148422|VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis (T388604)]] (duration: 10m 56s) [20:55:21] I assume it's because it offloads the +2 to TrainBranchBot, so the actual user running scap or spiderpig doesn't need the access. [20:55:23] T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604 [20:55:38] Aye. [20:55:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:55:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2003.codfw.wmnet with OS bookworm [20:55:49] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10845818 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm completed: - sretest2003 (**WARN**)... [20:55:56] Anyway, you're done but the next window is in five minutes' time (and it's mine). [20:55:56] jclark@cumin1002 provision (PID 139211) is awaiting input [20:56:32] o7 [20:56:34] sbassett: Did you need to emergency-deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1148956 ? [20:57:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:57:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:00:04] !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@2bce0c7]: Deploy Airflow artifact for T392494 and T394310. [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T2100) [21:00:08] T392494: Add data quality metrics to mediawiki_content_current_v1 - https://phabricator.wikimedia.org/T392494 [21:00:14] jclark@cumin1002 provision (PID 139216) is awaiting input [21:00:46] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [21:00:56] PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100% [21:01:00] !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@2bce0c7]: Deploy Airflow artifact for T392494 and T394310. (duration: 00m 55s) [21:02:14] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms [21:02:16] (03CR) 10David Martin: [C:03+2] wikifunctions: Update evaluators from 2025-05-12-235119 to 2025-05-21-192515 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148954 (https://phabricator.wikimedia.org/T385899) (owner: 10Jforrester) [21:03:59] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-05-12-235119 to 2025-05-21-192515 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148954 (https://phabricator.wikimedia.org/T385899) (owner: 10Jforrester) [21:04:46] jhancock@cumin2002 provision (PID 3624840) is awaiting input [21:06:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:07:56] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1065.eqiad.wmnet with OS bullseye [21:08:20] !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:08:32] FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:09:02] !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:10:24] !log dmartin@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:10:29] 06SRE: when servers are about to run out of disk monitoring should notify the owners - https://phabricator.wikimedia.org/T394955 (10Dzahn) 03NEW [21:10:46] 06SRE: when servers are about to run out of disk monitoring should notify the owners - https://phabricator.wikimedia.org/T394955#10845869 (10Dzahn) [21:11:11] !log dmartin@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:11:31] 06SRE, 06serviceops-radar: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10845871 (10Dzahn) >>! In T392834#10782948, @Ladsgroup wrote: > Yeah. Can you file a ticket for better monitoring? done. T394955 [21:12:02] !log dmartin@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:12:59] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1103.eqiad.wmnet with OS bullseye [21:13:09] !log dmartin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:15:01] (03CR) 10David Martin: [C:03+2] wikifunctions: Update orchestrator from 2025-05-14-112404 to 2025-05-21-192453 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148857 (https://phabricator.wikimedia.org/T385899) (owner: 10Jforrester) [21:15:43] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10845884 (10Jclark-ctr) [21:15:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:16:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:16:56] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-05-14-112404 to 2025-05-21-192453 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148857 (https://phabricator.wikimedia.org/T385899) (owner: 10Jforrester) [21:18:06] !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [21:18:30] !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [21:19:05] !log dmartin@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [21:19:36] !log dmartin@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [21:19:51] !log dmartin@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [21:20:25] !log dmartin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [21:25:12] (03PS1) 10Dzahn: aprepo: allow gitlab-ce and gitlab-runner versions > 17.10 < 17.11 [puppet] - 10https://gerrit.wikimedia.org/r/1148966 (https://phabricator.wikimedia.org/T394953) [21:25:34] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [21:25:44] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [21:26:05] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845911 (10Jhancock.wm) okay. i'll do this bios test and wrap up the task then. I'll keep some notes for the next dcops meeting [21:26:12] (03CR) 10Dzahn: [C:03+2] aprepo: allow gitlab-ce and gitlab-runner versions > 17.10 < 17.11 [puppet] - 10https://gerrit.wikimedia.org/r/1148966 (https://phabricator.wikimedia.org/T394953) (owner: 10Dzahn) [21:29:24] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for thanos-be1008,1009 - jclark@cumin1002" [21:29:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for thanos-be1008,1009 - jclark@cumin1002" [21:29:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:33:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:36:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm [21:36:21] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10845938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm [21:36:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm [21:36:50] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2004.codfw.wmnet with OS bookworm [21:37:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:37:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:43:14] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [21:43:38] RECOVERY - Host sretest2004 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [21:49:50] (03PS1) 10Ryan Kemper: wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) [21:50:10] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [21:51:01] (03CR) 10CI reject: [V:04-1] wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [21:51:16] (03PS2) 10Ryan Kemper: wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) [21:51:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:51:56] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:52:25] (03PS3) 10Ryan Kemper: wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) [21:52:25] (03CR) 10CI reject: [V:04-1] wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [21:53:34] (03CR) 10Ryan Kemper: [C:03+2] "@volans great catch! will fix these and/or delete these cookbooks" [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [21:56:32] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [21:58:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage [21:59:34] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1006.eqiad.wmnet with OS bullseye [21:59:47] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10846017 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host thanos-be1006.eqiad.wmnet with OS bull... [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T2200) [22:00:36] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10846019 (10Jclark-ctr) [22:02:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage [22:02:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye [22:02:38] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1007.eqiad.wmnet with OS bullseye [22:02:39] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103 [22:02:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103 [22:02:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10846026 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host thanos-be1... [22:02:59] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Upgrading to Java 11.0.27 - eevans@cumin1002 [22:06:49] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10846031 (10Jclark-ctr) @MatthewVernon these have been provisioned but look like i need to disable TLS. Will try again tomorrow [22:06:50] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: BAD PEM3 on cr2-codfw - https://phabricator.wikimedia.org/T394868#10846032 (10Papaul) ` Case 2025-0520-703157 has been updated by Mathias Zuniga UPDATE HAS BEEN ADDED: Hi Papaul, Thank you for your update, I have opened a t... [22:08:32] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:08:43] (03PS1) 10Ryan Kemper: wdqs: add SLIs for main & scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1148976 [22:09:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: BAD PEM3 on cr2-codfw - https://phabricator.wikimedia.org/T394868#10846033 (10Papaul) p:05Triage→03High a:03Jhancock.wm [22:09:31] (03CR) 10Bking: [C:03+1] wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [22:09:49] (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [22:13:20] (03PS1) 10Ryan Kemper: wdqs: nuke previously absented pyrra update lag [puppet] - 10https://gerrit.wikimedia.org/r/1148979 (https://phabricator.wikimedia.org/T393966) [22:16:10] RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 54.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:16:26] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:19:58] RECOVERY - MariaDB Replica Lag: s4 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:20:21] jhancock@cumin2002 reimage (PID 3641329) is awaiting input [22:20:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2004.codfw.wmnet with OS bookworm [22:20:46] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10846049 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2004.codfw.wmnet with OS bookworm completed: - sretest2004 (**WARN**)... [22:23:32] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:25:56] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Fri 06 Jun 2025 10:25:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [22:41:01] Anybody care if I do a quick config deploy?  It’s basically reverting the recent os/cu 2fa enforcement to allow for more comms: https://gerrit.wikimedia.org/r/1148956 [22:43:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148956 (owner: 10SBassett) [22:44:11] (03Merged) 10jenkins-bot: Revert "OATHAuth: Mark checkuser and suppress as requiring 2FA" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148956 (owner: 10SBassett) [22:44:37] !log sbassett@deploy1003 Started scap sync-world: Backport for [[gerrit:1148956|Revert "OATHAuth: Mark checkuser and suppress as requiring 2FA"]] [22:46:57] !log sbassett@deploy1003 sbassett: Backport for [[gerrit:1148956|Revert "OATHAuth: Mark checkuser and suppress as requiring 2FA"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:47:54] !log sbassett@deploy1003 sbassett: Continuing with sync [22:52:28] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release [22:54:43] !log sbassett@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148956|Revert "OATHAuth: Mark checkuser and suppress as requiring 2FA"]] (duration: 10m 05s) [22:56:10] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2003.codfw.wmnet with OS bookworm [22:56:13] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10846097 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm executed with errors: - sretest2003 (... [23:00:14] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release [23:01:15] 06SRE: when servers are about to run out of disk, monitoring should notify the owners - https://phabricator.wikimedia.org/T394955#10846113 (10Reedy) [23:02:48] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10846117 (10Jhancock.wm) this server did take to uefi but does not want to reimage to bios for some reason. bios image had an issue with the drive/raid config but uefi did not. will reim... [23:13:11] (03PS1) 10Dzahn: mediawiki/apache: redirect tj.*.org to tg.*.org for all projects [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) [23:14:40] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is CRITICAL: 1.005e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [23:16:03] (03PS2) 10Dzahn: mediawiki/apache: redirect tj.*.org to tg.*.org for all projects [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) [23:18:32] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:21:45] (03CR) 10Dzahn: [C:03+2] lists: include nftables throttling profile [puppet] - 10https://gerrit.wikimedia.org/r/1148432 (https://phabricator.wikimedia.org/T394519) (owner: 10Dzahn) [23:23:00] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1103.eqiad.wmnet with OS bullseye [23:30:41] (03CR) 10Dzahn: [C:03+2] "originally I just wanted to include this but not enable it yet.. but then I did." [puppet] - 10https://gerrit.wikimedia.org/r/1148432 (https://phabricator.wikimedia.org/T394519) (owner: 10Dzahn) [23:31:54] (03CR) 10Dzahn: [C:03+2] "active on lists2002 - but not active yet on lists1004 because puppet is still disabled for now" [puppet] - 10https://gerrit.wikimedia.org/r/1148432 (https://phabricator.wikimedia.org/T394519) (owner: 10Dzahn) [23:39:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1148984 [23:39:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1148984 (owner: 10TrainBranchBot) [23:52:19] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1148984 (owner: 10TrainBranchBot)