[00:03:29] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:08:37] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1148488
[00:08:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1148488 (owner: 10TrainBranchBot)
[00:10:18] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 612.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:10:58] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10842269 (10thcipriani) 05Stalled→03Open >>! In T393723#10805887, @Eevans wrote: > @Jdlrobson-WMF this seems like an odd question after all this time, but have you signed {L3}? >  > And,...
[00:11:54] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10842274 (10thcipriani) a:05Jdlrobson-WMF→03None
[00:12:15] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10842275 (10thcipriani)
[00:12:19] <rzl>	 jouncebot: nowandnext
[00:12:19] <jouncebot>	 No deployments scheduled for the next 5 hour(s) and 47 minute(s)
[00:12:19] <jouncebot>	 In 5 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T0600)
[00:12:53] <rzl>	 scapping out a no-op chart version bump to clean up the diff, only meaningful to mw-script
[00:14:40] <logmsgbot>	 !log rzl@deploy1003 Started scap sync-world: 1147918
[00:16:54] <logmsgbot>	 !log rzl@deploy1003 Finished scap sync-world: 1147918 (duration: 03m 27s)
[00:19:30] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1206 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:20:26] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1033.eqiad.wmnet
[00:20:50] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1073 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:21:22] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1034.eqiad.wmnet
[00:21:50] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1073 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:22:12] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1035.eqiad.wmnet
[00:25:15] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.dns.netbox
[00:25:30] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1206 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:25:59] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to deployment for jdlrobson - https://phabricator.wikimedia.org/T393723#10842303 (10Jdlrobson-WMF) For clarity, I signed with this account on phab ( @Jdlrobson-WMF  ) {F60326325}
[00:28:27] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1148488 (owner: 10TrainBranchBot)
[00:30:17] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1034.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002"
[00:30:26] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.dns.netbox
[00:30:39] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1034.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002"
[00:30:39] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:30:40] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt1034.eqiad.wmnet
[00:33:07] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:33:07] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1035.eqiad.wmnet
[00:33:20] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:33:49] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1034.eqiad.wmnet
[00:33:52] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.dns.netbox
[00:36:29] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:36:30] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1033.eqiad.wmnet
[00:38:47] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.dns.netbox
[00:40:20] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1082 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:41:22] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1036.eqiad.wmnet
[00:41:35] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:41:36] <logmsgbot>	 !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt1034.eqiad.wmnet
[00:46:14] <icinga-wm>	 RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[00:46:20] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.dns.netbox
[00:46:38] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1037.eqiad.wmnet
[00:49:56] <wikibugs>	 (03PS3) 10RLazarus: deployment_server: Add --dblist to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148450 (https://phabricator.wikimedia.org/T378479)
[00:50:07] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1036.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002"
[00:51:06] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1036.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002"
[00:51:06] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:51:06] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1036.eqiad.wmnet
[00:52:13] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.dns.netbox
[00:52:42] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1038.eqiad.wmnet
[00:56:28] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1037.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002"
[00:57:10] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1037.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002"
[00:57:10] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:57:11] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1037.eqiad.wmnet
[00:58:02] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.dns.netbox
[00:58:29] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:59:20] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudvirt1039.eqiad.wmnet
[00:59:48] <wikibugs>	 (03PS4) 10RLazarus: deployment_server: Add --dblist to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148450 (https://phabricator.wikimedia.org/T378479)
[01:01:45] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1038.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002"
[01:02:21] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1038.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002"
[01:02:22] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[01:02:22] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1038.eqiad.wmnet
[01:05:25] <logmsgbot>	 !log andrew@cumin1002 START - Cookbook sre.dns.netbox
[01:08:05] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[01:08:06] <logmsgbot>	 !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1039.eqiad.wmnet
[01:09:00] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Remove mention of cloudvirt103[1-9].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1147883 (https://phabricator.wikimedia.org/T394727) (owner: 10Andrew Bogott)
[01:12:55] <wikibugs>	 10ops-eqiad, 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission cloudvirt103[1-9].eqiad.wmnet - https://phabricator.wikimedia.org/T394727#10842371 (10Andrew)
[01:26:41] <jinxer-wm>	 FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[01:42:24] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Reuven! Two optional wording suggestions, but otherwise LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1148450 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus)
[01:51:20] <wikibugs>	 (03PS5) 10RLazarus: deployment_server: Add --dblist to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148450 (https://phabricator.wikimedia.org/T378479)
[01:57:29] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1148450 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus)
[02:08:29] <jinxer-wm>	 FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:20:34] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: OpenSent - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:43:29] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[02:56:49] <wikibugs>	 (03PS2) 10DLynch: VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester)
[03:04:24] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 3.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:15:35] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[03:28:29] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[03:38:29] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:55:40] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:55:46] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 139, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:56:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:01:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:03:29] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:10:40] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:10:46] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 140, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:11:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[04:53:50] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s7 on clouddb1014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@db1155.eqiad.wmnet:3317 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:54:10] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86588.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:54:20] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86591.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:54:22] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86594.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:54:22] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86612.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:54:30] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86620.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:54:30] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86620.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:54:48] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86626.76 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:55:22] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86654.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:55:22] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86661.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:55:34] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1014.eqiad.wmnet with reason: Maintenance
[04:55:53] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1015.eqiad.wmnet with reason: Maintenance
[04:56:14] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1018.eqiad.wmnet with reason: Maintenance
[04:56:29] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1019.eqiad.wmnet with reason: Maintenance
[04:56:35] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Maintenance
[04:56:49] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1155.eqiad.wmnet with reason: Maintenance
[04:57:21] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] wikimedia.org: add gerrit-ssh, gerrit-replica-ssh records [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn)
[04:58:29] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:07:27] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[05:07:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1169', diff saved to https://phabricator.wikimedia.org/P76342 and previous config saved to /var/cache/conftool/dbconfig/20250521-050730-marostegui.json
[05:08:29] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:19:32] <wikibugs>	 (03CR) 10Marostegui: "Can you test this on db2186 and/or db2187?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto)
[05:22:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool db1169 with 10%', diff saved to https://phabricator.wikimedia.org/P76343 and previous config saved to /var/cache/conftool/dbconfig/20250521-052258-marostegui.json
[05:26:41] <jinxer-wm>	 FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[05:31:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Increase weight for db1169', diff saved to https://phabricator.wikimedia.org/P76344 and previous config saved to /var/cache/conftool/dbconfig/20250521-053116-marostegui.json
[05:58:57] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 22616
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T0600)
[06:03:02] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 22616
[06:03:29] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:08:29] <jinxer-wm>	 FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:13:40] <wikibugs>	 (03PS1) 10Ayounsi: Revert "BFDdown: don't deploy in codfw" [alerts] - 10https://gerrit.wikimedia.org/r/1148501
[06:16:18] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to https://idm.wikimedia.org/ for maxbinderWMF - https://phabricator.wikimedia.org/T394523#10842591 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF
[06:16:29] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Revert "BFDdown: don't deploy in codfw" [alerts] - 10https://gerrit.wikimedia.org/r/1148501 (owner: 10Ayounsi)
[06:17:55] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "BFDdown: don't deploy in codfw" [alerts] - 10https://gerrit.wikimedia.org/r/1148501 (owner: 10Ayounsi)
[06:18:35] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10842593 (10ayounsi)
[06:18:45] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:ldap::client::ldaptui Add missing aux schemas [puppet] - 10https://gerrit.wikimedia.org/r/1148279 (https://phabricator.wikimedia.org/T394341) (owner: 10Slyngshede)
[06:26:39] <wikibugs>	 (03Abandoned) 10Stang: Add main page on non-English privatewiki to wgWhitelistRead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850266 (https://phabricator.wikimedia.org/T321796) (owner: 10Stang)
[06:34:58] <wikibugs>	 (03PS1) 10Muehlenhoff: snapshot: Remove now unused Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1148751 (https://phabricator.wikimedia.org/T394647)
[06:43:29] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[06:44:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Increase weight for db1169', diff saved to https://phabricator.wikimedia.org/P76345 and previous config saved to /var/cache/conftool/dbconfig/20250521-064444-marostegui.json
[06:48:40] <wikibugs>	 (03CR) 10Ayounsi: "I'm not that familiar with this piece of code, but lgtm overall." [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[06:52:28] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10842631 (10hashar)
[06:52:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet
[06:55:13] <XioNoX>	 !log push pfw policies - T394728
[06:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Increase weight for db1169', diff saved to https://phabricator.wikimedia.org/P76346 and previous config saved to /var/cache/conftool/dbconfig/20250521-065618-marostegui.json
[06:58:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T0700)
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Enable the remaining two maps nodes as replicas [puppet] - 10https://gerrit.wikimedia.org/r/1148351 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[07:01:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Increase weight for db1169', diff saved to https://phabricator.wikimedia.org/P76347 and previous config saved to /var/cache/conftool/dbconfig/20250521-070156-marostegui.json
[07:05:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet
[07:06:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146) (owner: 10Federico Ceratto)
[07:08:20] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good to me. Bumping to "stable" seems reasonable :-)" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148434 (owner: 10Volans)
[07:09:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet
[07:11:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1148434 (owner: 10Volans)
[07:15:35] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[07:18:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet
[07:24:28] <hashar>	 I am restarting Gerrit
[07:24:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans)
[07:24:53] <vgutierrez>	 503s in gerrit.. oh I see...
[07:27:22] <wikibugs>	 (03PS1) 10Elukey: conftool-data: remove ml-serve1001 from lvs/pybal [puppet] - 10https://gerrit.wikimedia.org/r/1148787 (https://phabricator.wikimedia.org/T387854)
[07:27:24] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::worker: set ml-serve1001 for Bookworm/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1148788 (https://phabricator.wikimedia.org/T387854)
[07:27:28] <wikibugs>	 (03PS1) 10Elukey: conftool-data: add ml-serve1001 back to its pool after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1148789 (https://phabricator.wikimedia.org/T387854)
[07:28:29] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[07:29:14] <wikibugs>	 (03CR) 10Elukey: role::ml_k8s::worker: set ml-serve1001 for Bookworm/containerd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148788 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[07:29:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage pc1018 [puppet] - 10https://gerrit.wikimedia.org/r/1148786 (https://phabricator.wikimedia.org/T394260) (owner: 10Marostegui)
[07:31:58] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5630/" [puppet] - 10https://gerrit.wikimedia.org/r/1148788 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[07:32:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1187 T394623', diff saved to https://phabricator.wikimedia.org/P76348 and previous config saved to /var/cache/conftool/dbconfig/20250521-073207-marostegui.json
[07:32:11] <stashbot>	 T394623: MariaDB 10.6.22 released - https://phabricator.wikimedia.org/T394623
[07:32:29] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[07:32:45] <marostegui>	 !log Install 10.6.22 on db1187 T394623
[07:32:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76349 and previous config saved to /var/cache/conftool/dbconfig/20250521-073336-root.json
[07:34:49] <marostegui>	 !log Move s5 codfw to SBR T383795
[07:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:52] <stashbot>	 T383795: Move sX to STATEMENT based replication - https://phabricator.wikimedia.org/T383795
[07:36:42] <wikibugs>	 (03CR) 10Vgutierrez: haproxy: normalize host header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148373 (https://phabricator.wikimedia.org/T392880) (owner: 10Fabfur)
[07:40:27] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10842696 (10Jelto) `s3://gitlab-packages` is empty after several hours (`s3cmd del --force --recursive s3://gitlab-packages/`). Usi...
[07:43:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet
[07:44:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet
[07:48:02] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:48:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76350 and previous config saved to /var/cache/conftool/dbconfig/20250521-074841-root.json
[07:48:57] <wikibugs>	 (03CR) 10Jelto: "I think we also need `PTR` records?" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn)
[07:50:02] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:50:12] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.codfw.wmnet
[07:50:54] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1016.eqiad.wmnet with OS bullseye
[07:50:59] <wikibugs>	 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10842722 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin2002 for host kafka-jumbo...
[07:51:56] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: robots.txt: add crawl-delay directive for semrushbot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148791
[07:52:54] <wikibugs>	 (03PS1) 10Ayounsi: Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641)
[07:53:02] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:53:29] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:54:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[07:56:02] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:56:24] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2005-dev.codfw.wmnet
[07:56:38] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:56:56] <wikibugs>	 (03PS2) 10Ayounsi: Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641)
[07:58:18] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: staging-eqiad: Specify MTU of 1460 [puppet] - 10https://gerrit.wikimedia.org/r/1148795 (https://phabricator.wikimedia.org/T352956)
[07:59:34] <wikibugs>	 (03CR) 10Elukey: homer: make private repo support multiple peers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148268 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[07:59:54] <wikibugs>	 (03PS1) 10Jelto: gitlab: enable object storage for gitlab-artifacts in production [puppet] - 10https://gerrit.wikimedia.org/r/1148796 (https://phabricator.wikimedia.org/T378922)
[08:00:05] <jouncebot>	 andre and jnuche: MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T0800). Please do the needful.
[08:00:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] staging-eqiad: Specify MTU of 1460 [puppet] - 10https://gerrit.wikimedia.org/r/1148795 (https://phabricator.wikimedia.org/T352956) (owner: 10Alexandros Kosiaris)
[08:01:16] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "Looking at our manifests, the **sole usage** is `modules/homer/manifests/init.pp`:" [puppet] - 10https://gerrit.wikimedia.org/r/1148267 (owner: 10Volans)
[08:01:54] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 145 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:02:59] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1148796 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[08:03:29] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:03:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76351 and previous config saved to /var/cache/conftool/dbconfig/20250521-080346-root.json
[08:05:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet
[08:06:34] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1016.eqiad.wmnet with reason: host reimage
[08:10:17] <logmsgbot>	 !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1016.eqiad.wmnet with reason: host reimage
[08:13:08] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148800 (https://phabricator.wikimedia.org/T392172)
[08:13:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148800 (https://phabricator.wikimedia.org/T392172) (owner: 10TrainBranchBot)
[08:13:51] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10842850 (10jcrespo) gitlab-artifacts is failing quite a lot to backup- so many entries on the log with missing file. Unsure if due...
[08:13:57] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148800 (https://phabricator.wikimedia.org/T392172) (owner: 10TrainBranchBot)
[08:18:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76352 and previous config saved to /var/cache/conftool/dbconfig/20250521-081851-root.json
[08:19:36] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1017.eqiad.wmnet with OS bullseye
[08:19:42] <wikibugs>	 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10842866 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin2002 for host kafka-jumbo...
[08:19:55] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS bullseye
[08:20:07] <wikibugs>	 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10842870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin2002 for host kafka-jumbo...
[08:21:41] <wikibugs>	 (03CR) 10Klausman: "Thanks for making this!" [puppet] - 10https://gerrit.wikimedia.org/r/1148787 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[08:23:31] <logmsgbot>	 !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.2  refs T392172
[08:23:35] <stashbot>	 T392172: 1.45.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T392172
[08:26:29] <logmsgbot>	 !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1016.eqiad.wmnet with OS bullseye
[08:26:35] <wikibugs>	 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10842914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin2002 for host kafka-jumbo1016...
[08:27:42] <wikibugs>	 (03PS2) 10Hnowlan: trafficserver: route testwiki reading lists APIs without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1148285 (https://phabricator.wikimedia.org/T384891)
[08:29:03] <wikibugs>	 (03CR) 10Vgutierrez: "no, it needs to be a new key called `block_help` under `profile::cache::varnish::frontend::fe_vcl_config`" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah)
[08:29:30] <Emperor>	 !log disable puppet on thanos-fe1001 and thanos-fe1004 T391352
[08:29:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:34] <stashbot>	 T391352: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352
[08:29:44] <wikibugs>	 (03CR) 10Klausman: [C:03+2] conftool-data: remove ml-serve1001 from lvs/pybal [puppet] - 10https://gerrit.wikimedia.org/r/1148787 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[08:29:46] <wikibugs>	 (03CR) 10Klausman: [C:03+2] conftool-data: add ml-serve1001 back to its pool after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1148789 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[08:29:50] <wikibugs>	 (03CR) 10Klausman: [C:03+2] role::ml_k8s::worker: set ml-serve1001 for Bookworm/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1148788 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[08:32:05] <wikibugs>	 (03PS1) 10Majavah: hieradata: Update striker-toolsbeta to 2025-05-21-082129-production [puppet] - 10https://gerrit.wikimedia.org/r/1148801
[08:32:28] <wikibugs>	 (03CR) 10MVernon: [C:03+2] thanos: remove old frontends thanos-fe100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1148330 (https://phabricator.wikimedia.org/T391352) (owner: 10MVernon)
[08:32:54] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] liberica: Don't deploy ipip-multiqueue-optimizer with katran [puppet] - 10https://gerrit.wikimedia.org/r/1148337 (https://phabricator.wikimedia.org/T380450) (owner: 10Vgutierrez)
[08:33:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76353 and previous config saved to /var/cache/conftool/dbconfig/20250521-083358-root.json
[08:34:21] <wikibugs>	 (03PS1) 10Gkyziridis: admin_ng/LiftWing: add edit-check namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148803 (https://phabricator.wikimedia.org/T394779)
[08:34:27] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10842963 (10Jelto) >>! In T378922#10842850, @jcrespo wrote: > gitlab-artifacts is failing quite a lot to backup- so many entries on...
[08:34:35] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve1001.eqiad.wmnet
[08:34:35] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host ml-serve1001.eqiad.wmnet
[08:35:00] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1017.eqiad.wmnet with reason: host reimage
[08:35:46] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Update striker-toolsbeta to 2025-05-21-082129-production [puppet] - 10https://gerrit.wikimedia.org/r/1148801 (owner: 10Majavah)
[08:38:25] <logmsgbot>	 !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1017.eqiad.wmnet with reason: host reimage
[08:38:30] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10842971 (10jcrespo) ` 21-May 07:02 gitlab2002.wikimedia.org-fd JobId 627067:      Could not stat "/srv/gitlab-backup/artifacts/02/...
[08:42:55] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bookworm
[08:43:33] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on P{thanos-fe100[4-7]*} or P{thanos-fe2*} and (A:thanos-fe or A:thanos-fe-codfw or A:thanos-fe-eqiad)
[08:44:45] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:44:45] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:47:28] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on P{thanos-fe100[4-7]*} or P{thanos-fe2*} and (A:thanos-fe or A:thanos-fe-codfw or A:thanos-fe-eqiad)
[08:48:58] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.decommission for hosts thanos-fe[1001-1003].eqiad.wmnet
[08:49:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76354 and previous config saved to /var/cache/conftool/dbconfig/20250521-084904-root.json
[08:50:42] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10843015 (10Jelto) Interesting, thank you. I think this has nothing to do with the ongoing work here. Bacula is trying to back up t...
[08:53:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:53:48] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10843044 (10jcrespo) Thank you. I will try to separate those jobs to the dedicated storage hosts asap.  > @jcrespo is there a fixed...
[08:53:52] <wikibugs>	 06SRE, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: improve docker registry architecture - https://phabricator.wikimedia.org/T209271#10843046 (10elukey) 05Open→03Resolved a:03elukey
[08:54:09] <wikibugs>	 (03PS1) 10Jelto: gitlab: also exclude artifacts from partial backups [puppet] - 10https://gerrit.wikimedia.org/r/1148804 (https://phabricator.wikimedia.org/T378922)
[08:54:12] <wikibugs>	 06SRE, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Set up a local redis proxy since docker-registry can only connect to one redis instance for caching - https://phabricator.wikimedia.org/T215809#10843055 (10elukey) 05Open→03Resolved a:03elukey Already implemented.
[08:54:52] <logmsgbot>	 !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1017.eqiad.wmnet with OS bullseye
[08:55:01] <wikibugs>	 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10843062 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin2002 for host kafka-jumbo1017...
[08:56:19] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-fe100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T394894 (10MatthewVernon) 03NEW
[08:57:05] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10843082 (10MatthewVernon)
[08:57:18] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10843086 (10MatthewVernon)
[08:59:07] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage
[09:00:09] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.sanitize-wiki - handle multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1139035 (https://phabricator.wikimedia.org/T366146) (owner: 10Federico Ceratto)
[09:02:26] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.dns.netbox
[09:02:59] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage
[09:02:59] <XioNoX>	 !log cr2-eqdfw# set protocols bgp graceful-shutdown sender - T364092
[09:03:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:04] <stashbot>	 T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092
[09:04:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76355 and previous config saved to /var/cache/conftool/dbconfig/20250521-090409-root.json
[09:06:16] <logmsgbot>	 elukey@cumin1002 reimage (PID 4089853) is awaiting input
[09:08:16] <logmsgbot>	 !log ayounsi@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr2-eqdfw,cr2-eqdfw IPv6,cr2-eqdfw.mgmt with reason: router upgrade
[09:08:28] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thanos-fe[1001-1003].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin1002"
[09:08:29] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:08:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10843106 (10FCeratto-WMF) a:05FCeratto-WMF→03VRiley-WMF
[09:09:44] <logmsgbot>	 !log brouberol@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1018.eqiad.wmnet with OS bullseye
[09:09:52] <wikibugs>	 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10843111 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin2002 for host kafka-jumbo1018...
[09:09:59] <logmsgbot>	 elukey@cumin1002 reimage (PID 4089853) is awaiting input
[09:10:12] <wikibugs>	 (03PS2) 10Elukey: conftool-data: add ml-serve1001 back to its pool after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1148789 (https://phabricator.wikimedia.org/T387854)
[09:10:12] <wikibugs>	 (03PS1) 10Elukey: Remove ROCM version for ml-serve1001 [puppet] - 10https://gerrit.wikimedia.org/r/1148806
[09:10:49] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS bullseye
[09:10:59] <wikibugs>	 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10843113 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin2002 for host kafka-jumbo...
[09:11:04] <wikibugs>	 (03PS1) 10David Caro: cloud: move images to use docker-registry.svc.t.o [puppet] - 10https://gerrit.wikimedia.org/r/1148808
[09:11:08] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10843114 (10MatthewVernon) >>! In T378922#10842696, @Jelto wrote: > `s3://gitlab-packages` is empty after several hours (`s3cmd del...
[09:11:33] <logmsgbot>	 mvernon@cumin1002 decommission (PID 4090552) is awaiting input
[09:11:34] <logmsgbot>	 !log ayounsi@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr2-eqdfw,cr2-eqdfw IPv6 with reason: router upgrade
[09:11:37] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Remove ROCM version for ml-serve1001 [puppet] - 10https://gerrit.wikimedia.org/r/1148806 (owner: 10Elukey)
[09:11:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10843115 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=048b70e3-25f1-4871-b6c8-5ea7b074de1e) set by ayounsi@cumin1002 for 2:00:00 on 2 host(s) and their servic...
[09:12:04] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-serve1001.eqiad.wmnet with OS bookworm
[09:12:20] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thanos-fe[1001-1003].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin1002"
[09:12:21] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:12:21] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts thanos-fe[1001-1003].eqiad.wmnet
[09:12:29] <wikibugs>	 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352#10843117 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin1002 for hosts: `thanos-fe[1001-1003].eqiad.wmnet` - thanos-fe1001.eqiad.wmnet (**PASS**)   - Downti...
[09:12:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10843121 (10FCeratto-WMF) @VRiley-WMF I've been suggested to assign this task to you while we wait for the RMA, I hope you don't mind :)
[09:13:07] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bookworm
[09:13:17] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host ml-serve1001
[09:13:17] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host ml-serve1001
[09:13:26] <XioNoX>	 !log cr2-eqdfw - shutdown transit/ix BGP sessions - T364092
[09:13:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:29] <stashbot>	 T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092
[09:15:56] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10843130 (10Jelto) >>! In T378922#10843114, @MatthewVernon wrote: >>>! In T378922#10842696, @Jelto wrote: >> `s3://gitlab-packages`...
[09:16:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet
[09:17:57] <wikibugs>	 (03CR) 10JMeybohm: "As with the service-catalog change I do not understand why this should be an active/passive (e.g. -ro) service. In my understanding this i" [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[09:19:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1187 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76356 and previous config saved to /var/cache/conftool/dbconfig/20250521-091914-root.json
[09:21:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet
[09:22:24] <XioNoX>	 !log cr2-eqdfw> request vmhost reboot - T364092
[09:22:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:28] <stashbot>	 T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092
[09:22:59] <XioNoX>	 now we wait
[09:23:25] <jynus>	 😯
[09:24:04] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver: route testwiki reading lists APIs without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1148285 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan)
[09:25:20] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1001.eqiad.wmnet with OS bookworm
[09:25:41] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:25:41] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:25:41] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:25:49] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:25:49] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:25:49] <icinga-wm>	 PROBLEM - BFD status on cr2-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:25:49] <icinga-wm>	 PROBLEM - BFD status on cr2-magru is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:25:49] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:25:50] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:25:50] <icinga-wm>	 PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:26:04] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bookworm
[09:26:10] <jinxer-wm>	 FIRING: [4x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:26:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-drmrs and cr2-eqdfw (208.80.153.204) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:26:41] <jinxer-wm>	 FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[09:27:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1148808 (owner: 10David Caro)
[09:30:05] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1148796 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[09:30:12] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1148804 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[09:30:55] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[09:31:02] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[09:31:10] <jinxer-wm>	 FIRING: [5x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:31:18] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[09:31:31] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[09:31:39] <jinxer-wm>	 FIRING: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqdfw (208.80.153.198) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:32:24] <Emperor>	 !log radosgw-admin bucket rm --bucket=gitlab-packages --bypass-gc --purge-objects T378922
[09:32:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:27] <stashbot>	 T378922: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922
[09:32:44] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:32:44] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:32:48] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:32:48] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:32:48] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:32:48] <icinga-wm>	 RECOVERY - BFD status on cr2-drmrs is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:32:50] <icinga-wm>	 RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:32:50] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:33:40] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:34:06] <Emperor>	 !log radosgw-admin bucket rm --bucket=gitlab-artifacts --bypass-gc --purge-objects T378922
[09:34:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:48] <icinga-wm>	 RECOVERY - BFD status on cr2-magru is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:36:10] <jinxer-wm>	 RESOLVED: [5x] BFDdown: BFD session down between cr2-drmrs and 208.80.153.204 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status  - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
[09:36:39] <jinxer-wm>	 RESOLVED: [7x] CoreBGPDown: Core BGP session down between cr1-codfw and cr2-eqdfw (208.80.153.198) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[09:36:41] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: fix typo in incoming URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148813 (https://phabricator.wikimedia.org/T384891)
[09:38:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10843229 (10cmooney) @papaul one thing I noticed looking at the cables in Netbox from the new spine switches to the ones in row A-D is that they look like a straight patch?  But I believe the...
[09:38:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet
[09:38:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10843244 (10cmooney) 05Open→03Resolved License is now applied and inventory items updated for cr1-codfw and cr2-codfw.
[09:38:54] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable edge uniques on cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/1148814 (https://phabricator.wikimedia.org/T391411)
[09:39:14] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148814 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[09:39:14] <icinga-wm>	 RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 194, down: 12, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:40:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10843254 (10ayounsi)
[09:41:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10843256 (10ayounsi) 05Open→03Resolved All done! Thank you all.
[09:43:18] <wikibugs>	 (03PS1) 10Majavah: openstack: cinder: Ensure cinder-volume service is running [puppet] - 10https://gerrit.wikimedia.org/r/1148815
[09:44:14] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate lonelypages job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1148391 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[09:44:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack: cinder: Ensure cinder-volume service is running [puppet] - 10https://gerrit.wikimedia.org/r/1148815 (owner: 10Majavah)
[09:44:51] <wikibugs>	 (03PS2) 10Majavah: openstack: cinder: Ensure cinder-volume service is running [puppet] - 10https://gerrit.wikimedia.org/r/1148815
[09:46:30] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1148815 (owner: 10Majavah)
[09:46:42] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: cinder: Ensure cinder-volume service is running [puppet] - 10https://gerrit.wikimedia.org/r/1148815 (owner: 10Majavah)
[09:46:51] <wikibugs>	 (03PS3) 10Ayounsi: Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641)
[09:48:51] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] rest-gateway: fix typo in incoming URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148813 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan)
[09:50:22] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: fix typo in incoming URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148813 (https://phabricator.wikimedia.org/T384891) (owner: 10Hnowlan)
[09:52:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:54:10] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[09:56:26] <wikibugs>	 (03PS2) 10Cathal Mooney: New device additions for codfw expansion plus policy changes [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021)
[09:56:57] <wikibugs>	 (03PS1) 10Gmodena: EventStreamConfig: add staging page_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148818 (https://phabricator.wikimedia.org/T394899)
[09:57:12] <wikibugs>	 (03PS2) 10Hnowlan: mw::maintenance: migrate cleanupUploadStash job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144507 (https://phabricator.wikimedia.org/T385868)
[09:57:15] <hnowlan>	 jouncebot: nowandnext
[09:57:15] <jouncebot>	 For the next 0 hour(s) and 2 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T0800)
[09:57:15] <jouncebot>	 In 0 hour(s) and 2 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1000)
[09:57:23] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[09:57:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:57:52] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[09:58:16] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[09:58:48] <wikibugs>	 (03Merged) 10jenkins-bot: Add alerting for important switch interfaces [alerts] - 10https://gerrit.wikimedia.org/r/1148792 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[09:58:56] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: legacy_redirector: Enable IPv6 monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1148819 (https://phabricator.wikimedia.org/T392506)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1000)
[10:00:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet
[10:00:31] <wikibugs>	 (03CR) 10Herron: [C:03+1] grafana: Remove stunnel from quickdatacopy sync [puppet] - 10https://gerrit.wikimedia.org/r/1148468 (https://phabricator.wikimedia.org/T393738) (owner: 10Andrea Denisse)
[10:00:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2036', diff saved to https://phabricator.wikimedia.org/P76357 and previous config saved to /var/cache/conftool/dbconfig/20250521-100055-marostegui.json
[10:01:24] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5633/" [puppet] - 10https://gerrit.wikimedia.org/r/1148819 (https://phabricator.wikimedia.org/T392506) (owner: 10Majavah)
[10:01:45] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] alertmanager: add receiver and routing for MediaWiki-File-management tasks [puppet] - 10https://gerrit.wikimedia.org/r/1148485 (https://phabricator.wikimedia.org/T385868) (owner: 10Scott French)
[10:01:54] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate cleanupUploadStash job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1144507 (https://phabricator.wikimedia.org/T385868) (owner: 10Hnowlan)
[10:02:08] <wikibugs>	 (03PS1) 10Marostegui: es2036: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1148820 (https://phabricator.wikimedia.org/T394469)
[10:02:15] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1144507 (https://phabricator.wikimedia.org/T385868) (owner: 10Hnowlan)
[10:02:17] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2036.codfw.wmnet with reason: Maintenance
[10:03:14] <wikibugs>	 (03CR) 10Cathal Mooney: New device additions for codfw expansion plus policy changes (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[10:03:22] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2036: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1148820 (https://phabricator.wikimedia.org/T394469) (owner: 10Marostegui)
[10:03:23] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1148819 (https://phabricator.wikimedia.org/T392506) (owner: 10Majavah)
[10:03:24] <jinxer-wm>	 FIRING: SystemdUnitFailed: networking.service on mc-misc2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:03:34] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: legacy_redirector: Enable IPv6 monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1148819 (https://phabricator.wikimedia.org/T392506) (owner: 10Majavah)
[10:04:06] <marostegui>	 taavi: can I merge your change?
[10:04:17] <taavi>	 marostegui: yes please
[10:04:21] <marostegui>	 doing!
[10:04:27] <taavi>	 thanks!
[10:07:08] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync
[10:07:23] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync
[10:08:29] <jinxer-wm>	 FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:11:00] <wikibugs>	 (03CR) 10David Caro: [C:03+2] cloud: move images to use docker-registry.svc.t.o [puppet] - 10https://gerrit.wikimedia.org/r/1148808 (owner: 10David Caro)
[10:11:10] <wikibugs>	 (03CR) 10David Caro: [C:03+2] "deployed in toolsbeta and tools" [puppet] - 10https://gerrit.wikimedia.org/r/1148808 (owner: 10David Caro)
[10:12:47] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] P:mw:maint:update_special_pages: all updateSpecialPages shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French)
[10:12:51] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Change x2 and x3 ports to avoid conflicts with extra port [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274)
[10:13:43] <wikibugs>	 (03Abandoned) 10Hnowlan: mobileapps: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133079 (https://phabricator.wikimedia.org/T388140) (owner: 10Hnowlan)
[10:14:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76358 and previous config saved to /var/cache/conftool/dbconfig/20250521-101412-root.json
[10:15:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet
[10:15:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mariadb: Change x2 and x3 ports to avoid conflicts with extra port [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo)
[10:17:00] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] admin_ng/LiftWing: add edit-check namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148803 (https://phabricator.wikimedia.org/T394779) (owner: 10Gkyziridis)
[10:17:59] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "whenever you want..." [puppet] - 10https://gerrit.wikimedia.org/r/1148814 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[10:19:27] <wikibugs>	 10ops-codfw, 06DC-Ops: Power Supply - PS Redundancy - issue on ms-fe2012:9290 - https://phabricator.wikimedia.org/T394901 (10phaultfinder) 03NEW
[10:20:03] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] New device additions for codfw expansion plus policy changes [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[10:21:30] <wikibugs>	 (03PS1) 10Majavah: puppet_statsd: Uninstall now that statsd is read-only [puppet] - 10https://gerrit.wikimedia.org/r/1148825
[10:22:40] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10843403 (10MatthewVernon) @Jelto both buckets deleted.
[10:23:18] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Change x2 and x3 ports to avoid conflicts with extra port [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274)
[10:23:41] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1001.eqiad.wmnet with OS bookworm
[10:24:02] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bookworm
[10:24:12] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge uniques on cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/1148814 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[10:26:10] <vgutierrez>	 !log enabling edge uniques on  cp3066 - T391411
[10:26:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:13] <stashbot>	 T391411: SDS 2.4.4 Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411
[10:27:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet
[10:27:45] <logmsgbot>	 brouberol@cumin2002 reimage (PID 3290949) is awaiting input
[10:27:54] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: networking.service on mc-misc2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:28:44] <wikibugs>	 (03CR) 10Jcrespo: "Amir: let me know what you think. Once deployed done, I will run puppet and restart the x3 instances and move them to x3 upstream." [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo)
[10:29:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76360 and previous config saved to /var/cache/conftool/dbconfig/20250521-102917-root.json
[10:30:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet
[10:37:11] <moritzm>	 !log installing expat security updates
[10:37:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:36] <wikibugs>	 (03PS1) 10Jelto: gerrit/nftables_throttling: make abusers more generic [puppet] - 10https://gerrit.wikimedia.org/r/1148826 (https://phabricator.wikimedia.org/T394519)
[10:38:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public
[10:39:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet
[10:39:17] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[10:40:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public
[10:40:49] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5634/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148826 (https://phabricator.wikimedia.org/T394519) (owner: 10Jelto)
[10:40:50] <wikibugs>	 (03PS1) 10Tchanders: Temp accounts: Set group requirements for IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148827 (https://phabricator.wikimedia.org/T393615)
[10:41:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:41:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all
[10:42:53] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[10:43:24] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1001.eqiad.wmnet with OS bookworm
[10:43:32] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[10:44:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P76362 and previous config saved to /var/cache/conftool/dbconfig/20250521-104422-root.json
[10:44:26] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Temp accounts: Set group requirements for IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148827 (https://phabricator.wikimedia.org/T393615) (owner: 10Tchanders)
[10:46:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:46:34] <wikibugs>	 (03CR) 10Dreamy Jazz: [C:03+1] Temp accounts: Set group requirements for IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148827 (https://phabricator.wikimedia.org/T393615) (owner: 10Tchanders)
[10:51:21] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS bullseye
[10:51:32] <wikibugs>	 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10843485 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin2002 for host kafka-jumbo...
[10:53:30] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "lgtm! thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1148826 (https://phabricator.wikimedia.org/T394519) (owner: 10Jelto)
[10:53:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet
[10:53:58] <wikibugs>	 (03PS3) 10Máté Szabó: Update IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086)
[10:54:14] <wikibugs>	 (03PS4) 10Máté Szabó: Update IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146969 (https://phabricator.wikimedia.org/T375086)
[10:55:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all
[10:56:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-codfw
[10:58:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-codfw
[10:59:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76363 and previous config saved to /var/cache/conftool/dbconfig/20250521-105928-root.json
[11:00:05] <jouncebot>	 mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1100).
[11:01:07] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync
[11:01:10] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync
[11:02:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet
[11:08:25] <wikibugs>	 (03PS1) 10STran: Add mediawiki.ForeignApi.core as a dependency [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148830 (https://phabricator.wikimedia.org/T387720)
[11:10:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148830 (https://phabricator.wikimedia.org/T387720) (owner: 10STran)
[11:14:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P76364 and previous config saved to /var/cache/conftool/dbconfig/20250521-111433-root.json
[11:15:31] <wikibugs>	 (03PS1) 10Arthur taylor: Enabled ScopedTypeaheadSearch for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148834 (https://phabricator.wikimedia.org/T394669)
[11:17:16] <wikibugs>	 (03PS6) 10Slyngshede: VueJS Permissions App [software/bitu] - 10https://gerrit.wikimedia.org/r/1140498
[11:18:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[11:21:17] <wikibugs>	 (03PS2) 10Arthur taylor: Enabled ScopedTypeaheadSearch for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148834 (https://phabricator.wikimedia.org/T394669)
[11:22:28] <wikibugs>	 (03PS1) 10Majavah: openstack: wmcs-enc-cli: Pass project to keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/1148835 (https://phabricator.wikimedia.org/T394775)
[11:22:31] <wikibugs>	 (03PS1) 10Majavah: openstack: wmcs-webproxy: Pass project to keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/1148836
[11:22:45] <wikibugs>	 (03PS2) 10Gmodena: EventStreamConfig: add staging page_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148818 (https://phabricator.wikimedia.org/T394899)
[11:24:01] <wikibugs>	 (03PS1) 10Slyngshede: data.yaml: Tracking entry for guilherme [puppet] - 10https://gerrit.wikimedia.org/r/1148838
[11:24:52] <wikibugs>	 (03PS1) 10Brouberol: Duplicate partman/custom/kafka-jumbo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1148839 (https://phabricator.wikimedia.org/T377874)
[11:24:54] <wikibugs>	 (03PS1) 10Brouberol: partman: define a kafka-jumbo-ba recipe [puppet] - 10https://gerrit.wikimedia.org/r/1148840 (https://phabricator.wikimedia.org/T377874)
[11:25:18] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] mariadb: Change x2 and x3 ports to avoid conflicts with extra port [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo)
[11:26:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Duplicate partman/custom/kafka-jumbo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1148839 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol)
[11:26:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10843676 (10cmooney) Also we can add the links to the CRs now:  |Switch|Port|CR|Port| |--------|-----|---|-----| |ssw1-e1-codfw|et-0/0/31|cr1-codfw|et-3/0/2| |ssw1-f1-codfw|et-0/0/31|cr2-codf...
[11:26:52] <wikibugs>	 (03CR) 10Arthur taylor: "ready for review. Should not be deployed until we have a confirmation about a go-live date for the change to test.wikidata.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148834 (https://phabricator.wikimedia.org/T394669) (owner: 10Arthur taylor)
[11:27:07] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] mariadb: Change x2 and x3 ports to avoid conflicts with extra port [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo)
[11:27:18] <wikibugs>	 (03PS2) 10Brouberol: Duplicate partman/custom/kafka-jumbo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1148839 (https://phabricator.wikimedia.org/T377874)
[11:27:18] <wikibugs>	 (03PS2) 10Brouberol: partman: define a kafka-jumbo-ba recipe [puppet] - 10https://gerrit.wikimedia.org/r/1148840 (https://phabricator.wikimedia.org/T377874)
[11:28:32] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[11:29:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76365 and previous config saved to /var/cache/conftool/dbconfig/20250521-112939-root.json
[11:33:03] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[11:33:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet
[11:33:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-eqiad
[11:35:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-eqiad
[11:37:20] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: push IPv6 address changes for codfw expansion link networks - cmooney@cumin1002"
[11:37:40] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: push IPv6 address changes for codfw expansion link networks - cmooney@cumin1002"
[11:37:41] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:37:51] <wikibugs>	 (03PS1) 10Cathal Mooney: Add new INCLUDE statement for 2620:0:860:139::/64 reverse [dns] - 10https://gerrit.wikimedia.org/r/1148842 (https://phabricator.wikimedia.org/T394021)
[11:42:09] <wikibugs>	 (03PS6) 10Fabfur: haproxy: use maxmind lua bindings to lookup client ISP [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219)
[11:44:25] <topranks>	 SSW's in codfw BGP alerts will probably land, nothing to worry about I'm tidying it up now 
[11:44:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76366 and previous config saved to /var/cache/conftool/dbconfig/20250521-114444-root.json
[11:46:39] <jinxer-wm>	 FIRING: [6x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:18b::2) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[11:50:08] <wikibugs>	 (03CR) 10Máté Szabó: [C:03+1] Add mediawiki.ForeignApi.core as a dependency [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148830 (https://phabricator.wikimedia.org/T387720) (owner: 10STran)
[11:51:39] <jinxer-wm>	 FIRING: [8x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:18b::2) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[11:53:40] <wikibugs>	 (03PS2) 10Clément Goubert: mw::maintenance: migrate continuousScan-commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1140484 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan)
[11:54:04] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "nice!" [dns] - 10https://gerrit.wikimedia.org/r/1148842 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[11:55:01] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "Other scripts are going well, I think this can be migrated." [puppet] - 10https://gerrit.wikimedia.org/r/1140484 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan)
[11:55:22] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add new INCLUDE statement for 2620:0:860:139::/64 reverse [dns] - 10https://gerrit.wikimedia.org/r/1148842 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[11:55:51] <logmsgbot>	 !log cmooney@dns2005 START - running authdns-update
[11:56:30] <logmsgbot>	 !log cmooney@dns2005 END - running authdns-update
[11:59:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2036 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76367 and previous config saved to /var/cache/conftool/dbconfig/20250521-115950-root.json
[12:03:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:04:29] <logmsgbot>	 brouberol@cumin2002 reimage (PID 3336134) is awaiting input
[12:15:55] <wikibugs>	 (03CR) 10Jforrester: VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester)
[12:19:45] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "LGTM. Do we know when/why this started to be required?" [puppet] - 10https://gerrit.wikimedia.org/r/1148835 (https://phabricator.wikimedia.org/T394775) (owner: 10Majavah)
[12:20:17] <wikibugs>	 (03CR) 10FNegri: "Is this still related to T394775 or is it solving a different issue?" [puppet] - 10https://gerrit.wikimedia.org/r/1148836 (owner: 10Majavah)
[12:22:28] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146970 (https://phabricator.wikimedia.org/T392219) (owner: 10Fabfur)
[12:24:36] <wikibugs>	 (03CR) 10Majavah: [C:03+2] "Likely related to the new openstack authentication scope enforcement thing." [puppet] - 10https://gerrit.wikimedia.org/r/1148835 (https://phabricator.wikimedia.org/T394775) (owner: 10Majavah)
[12:25:08] <wikibugs>	 (03CR) 10Majavah: "Similar issue but in a different script." [puppet] - 10https://gerrit.wikimedia.org/r/1148836 (owner: 10Majavah)
[12:26:23] <wikibugs>	 (03PS1) 10Ayounsi: Disable pint alerting for SwitchCoreInterfaceDown [alerts] - 10https://gerrit.wikimedia.org/r/1148849 (https://phabricator.wikimedia.org/T388641)
[12:26:27] <wikibugs>	 (03CR) 10FNegri: [C:03+1] openstack: wmcs-webproxy: Pass project to keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/1148836 (owner: 10Majavah)
[12:26:37] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: wmcs-webproxy: Pass project to keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/1148836 (owner: 10Majavah)
[12:28:27] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: Add zarcillo k8s service on traffic server [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[12:30:21] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS bullseye
[12:30:24] <wikibugs>	 (03PS1) 10Hashar: Merge tag 'v3.10.6' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148851 (https://phabricator.wikimedia.org/T390666)
[12:30:37] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:04-1] "I think a better approach would be to edit modules/prometheus/templates/prometheus-apache-vhost.erb and add the `RewriteEngine on` directi" [puppet] - 10https://gerrit.wikimedia.org/r/1146973 (owner: 10Majavah)
[12:31:39] <jinxer-wm>	 FIRING: [9x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:139::19) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[12:32:41] <wikibugs>	 (03PS2) 10Hashar: Merge tag 'v3.10.6' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148851 (https://phabricator.wikimedia.org/T390666)
[12:34:11] <logmsgbot>	 !log brouberol@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-jumbo1018.eqiad.wmnet with OS bullseye
[12:34:17] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] grafana: Remove stunnel from quickdatacopy sync [puppet] - 10https://gerrit.wikimedia.org/r/1148468 (https://phabricator.wikimedia.org/T393738) (owner: 10Andrea Denisse)
[12:34:24] <wikibugs>	 10ops-eqiad, 06SRE, 10Data-Platform, 06DC-Ops, and 2 others: Q2:rack/setup/install kafka-jumbo10[16-18] - https://phabricator.wikimedia.org/T377874#10843893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin2002 for host kafka-jumbo1018.eqiad.wmnet with OS bullseye execu...
[12:36:09] <wikibugs>	 (03CR) 10Majavah: "> I think a better approach would be to edit modules/prometheus/templates/prometheus-apache-vhost.erb and add the RewriteEngine on directi" [puppet] - 10https://gerrit.wikimedia.org/r/1146973 (owner: 10Majavah)
[12:46:36] <wikibugs>	 (03PS3) 10Sbisson: Remove unused wgContentTranslationEnableSectionTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970)
[12:46:41] <wikibugs>	 (03CR) 10Sbisson: Remove unused wgContentTranslationEnableSectionTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson)
[12:47:35] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-05-14-112404 to 2025-05-20-173017 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148857 (https://phabricator.wikimedia.org/T393631)
[12:48:25] <pmiazga>	 !log Ran fixStuckGlobalRename.php for T394905
[12:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:29] <stashbot>	 T394905: Unblock stuck global rename of 大筒木博人 - https://phabricator.wikimedia.org/T394905
[12:48:47] <logmsgbot>	 btullis@cumin1002 reimage (PID 19998) is awaiting input
[12:49:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1148838 (owner: 10Slyngshede)
[12:49:36] <topranks>	 !log test new core_out bgp policy on asw1-bw27-esams (T394530)
[12:49:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:41] <stashbot>	 T394530: Homer: redefine IBGP definitions to support both Unicast & EVPN clusters - https://phabricator.wikimedia.org/T394530
[12:49:59] <wikibugs>	 (03PS1) 10Reedy: Stop setting $wgCaptchaClass in extension.json files [extensions/ConfirmEdit] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148859 (https://phabricator.wikimedia.org/T394814)
[12:50:07] <wikibugs>	 (03PS1) 10Reedy: Stop setting $wgCaptchaClass in extension.json files [extensions/ConfirmEdit] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148860 (https://phabricator.wikimedia.org/T394814)
[12:50:55] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] "Since it's not a paging alert, LGTM." [alerts] - 10https://gerrit.wikimedia.org/r/1148849 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[12:50:55] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mc-misc2002.codfw.wmnet
[12:51:39] <jinxer-wm>	 FIRING: [9x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:139::19) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[12:54:52] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] data.yaml: Tracking entry for guilherme [puppet] - 10https://gerrit.wikimedia.org/r/1148838 (owner: 10Slyngshede)
[12:56:36] <wikibugs>	 (03PS3) 10Cathal Mooney: New device additions for codfw expansion plus policy changes [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021)
[12:56:39] <jinxer-wm>	 FIRING: [9x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:139::19) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[12:57:17] <wikibugs>	 (03CR) 10Cathal Mooney: New device additions for codfw expansion plus policy changes (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[12:57:33] <wikibugs>	 (03CR) 10Ssingh: "Yes, you will." [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn)
[12:58:55] <Reedy>	 jouncebot: nowandnext
[12:58:56] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 1 minute(s)
[12:58:56] <jouncebot>	 In 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1300)
[12:59:24] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Stop setting $wgCaptchaClass in extension.json files [extensions/ConfirmEdit] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148859 (https://phabricator.wikimedia.org/T394814) (owner: 10Reedy)
[12:59:29] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Stop setting $wgCaptchaClass in extension.json files [extensions/ConfirmEdit] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148860 (https://phabricator.wikimedia.org/T394814) (owner: 10Reedy)
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1300).
[13:00:05] <jouncebot>	 Tran: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:01:01] <Tran>	 👋
[13:01:30] <Reedy>	 I misread that
[13:01:38] <Reedy>	 I guess I should do the window then :D
[13:02:03] <Reedy>	 Tran: Just to double check, it doesn't need to go into .1 too?
[13:02:12] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Add mediawiki.ForeignApi.core as a dependency [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148830 (https://phabricator.wikimedia.org/T387720) (owner: 10STran)
[13:02:43] <Tran>	 Just .2 as the patch it fixes was deployed as part of .2
[13:02:45] <Tran>	 Thank you!
[13:02:54] <Reedy>	 Ah, yeah https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/1138127 is only in .2
[13:02:56] <Reedy>	 Nice and easy then
[13:03:14] <wikibugs>	 (03CR) 10Reedy: [C:03+2] "For reference, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/1138127 only landed in .2" [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148830 (https://phabricator.wikimedia.org/T387720) (owner: 10STran)
[13:04:27] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS bookworm
[13:07:16] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm! the less prefix-list the better" [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[13:08:07] <wikibugs>	 (03PS1) 10Clément Goubert: P:kafka::broker: Set cpu governor to performance [puppet] - 10https://gerrit.wikimedia.org/r/1148862 (https://phabricator.wikimedia.org/T393513)
[13:08:22] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Disable pint alerting for SwitchCoreInterfaceDown [alerts] - 10https://gerrit.wikimedia.org/r/1148849 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[13:08:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:09:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] New device additions for codfw expansion plus policy changes [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[13:09:46] <wikibugs>	 (03Merged) 10jenkins-bot: Disable pint alerting for SwitchCoreInterfaceDown [alerts] - 10https://gerrit.wikimedia.org/r/1148849 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[13:09:47] <wikibugs>	 (03Merged) 10jenkins-bot: Stop setting $wgCaptchaClass in extension.json files [extensions/ConfirmEdit] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148859 (https://phabricator.wikimedia.org/T394814) (owner: 10Reedy)
[13:10:30] <wikibugs>	 (03Merged) 10jenkins-bot: New device additions for codfw expansion plus policy changes [homer/public] - 10https://gerrit.wikimedia.org/r/1147014 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[13:11:31] <wikibugs>	 (03Merged) 10jenkins-bot: Stop setting $wgCaptchaClass in extension.json files [extensions/ConfirmEdit] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148860 (https://phabricator.wikimedia.org/T394814) (owner: 10Reedy)
[13:11:32] <wikibugs>	 (03Merged) 10jenkins-bot: Add mediawiki.ForeignApi.core as a dependency [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148830 (https://phabricator.wikimedia.org/T387720) (owner: 10STran)
[13:11:36] <Reedy>	 There we go
[13:11:42] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] P:kafka::broker: Set cpu governor to performance [puppet] - 10https://gerrit.wikimedia.org/r/1148862 (https://phabricator.wikimedia.org/T393513) (owner: 10Clément Goubert)
[13:11:52] <Reedy>	 Tran: Do you need/want to test yours? Or just happy to let it go through?
[13:11:56] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] P:kafka::broker: Set cpu governor to performance [puppet] - 10https://gerrit.wikimedia.org/r/1148862 (https://phabricator.wikimedia.org/T393513) (owner: 10Clément Goubert)
[13:12:02] <Tran>	 I can test, please hold
[13:12:38] <Reedy>	 It's not ready to go yet ;)
[13:13:32] <jinxer-wm>	 FIRING: KubernetesCalicoDown: ml-serve1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:13:35] <Tran>	 whoops I just realized that 😅 but yes I'd like to test when it's up
[13:13:44] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1148859|Stop setting $wgCaptchaClass in extension.json files (T394814)]], [[gerrit:1148860|Stop setting $wgCaptchaClass in extension.json files (T394814)]], [[gerrit:1148830|Add mediawiki.ForeignApi.core as a dependency (T387720)]]
[13:13:49] <stashbot>	 T394814: Unable to have multiple captcha implementations in extension-list - https://phabricator.wikimedia.org/T394814
[13:13:49] <stashbot>	 T387720: Prefer a parameter over a configuration for importing translations in SecurePoll - https://phabricator.wikimedia.org/T387720
[13:14:16] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Merge tag 'v3.10.6' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148851 (https://phabricator.wikimedia.org/T390666) (owner: 10Hashar)
[13:15:59] <logmsgbot>	 !log reedy@deploy1003 reedy, stran: Backport for [[gerrit:1148859|Stop setting $wgCaptchaClass in extension.json files (T394814)]], [[gerrit:1148860|Stop setting $wgCaptchaClass in extension.json files (T394814)]], [[gerrit:1148830|Add mediawiki.ForeignApi.core as a dependency (T387720)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:16:39] <jinxer-wm>	 FIRING: [9x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:139::19) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:16:43] <Reedy>	 Tran: should be good to test now
[13:16:49] <logmsgbot>	 !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "import new switches from netbox to hiera now they are status active - cmooney@cumin1003 - T394021"
[13:16:52] <stashbot>	 T394021: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021
[13:16:55] <wikibugs>	 (03PS1) 10Btullis: Adapt dump scripts for running in containers [dumps] - 10https://gerrit.wikimedia.org/r/1148863 (https://phabricator.wikimedia.org/T394389)
[13:17:26] <wikibugs>	 (03PS1) 10Slyngshede: Signup: command for generating activation link [software/bitu] - 10https://gerrit.wikimedia.org/r/1148865
[13:17:37] <Tran>	 Confirmed working as expected 🎉
[13:17:43] <Reedy>	 sweet, mine looks good too
[13:17:44] <logmsgbot>	 !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "import new switches from netbox to hiera now they are status active - cmooney@cumin1003 - T394021"
[13:17:46] <logmsgbot>	 !log reedy@deploy1003 reedy, stran: Continuing with sync
[13:18:56] <wikibugs>	 (03PS2) 10Reedy: Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148398 (https://phabricator.wikimedia.org/T382148)
[13:19:29] <wikibugs>	 (03PS2) 10Slyngshede: Signup: command for generating activation link [software/bitu] - 10https://gerrit.wikimedia.org/r/1148865
[13:19:38] <wikibugs>	 (03PS3) 10Reedy: Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148398 (https://phabricator.wikimedia.org/T382148)
[13:19:49] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148398 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy)
[13:20:27] <wikibugs>	 (03PS3) 10Slyngshede: Signup: command for generating activation link [software/bitu] - 10https://gerrit.wikimedia.org/r/1148865
[13:20:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148398 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy)
[13:20:56] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: eventgate-main: Increase CPU limit to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148866 (https://phabricator.wikimedia.org/T393513)
[13:21:02] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage
[13:21:39] <jinxer-wm>	 FIRING: [9x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:139::19) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:21:42] <wikibugs>	 (03PS4) 10Slyngshede: Signup: command for generating activation link [software/bitu] - 10https://gerrit.wikimedia.org/r/1148865
[13:22:29] <wikibugs>	 (03PS1) 10Cathal Mooney: Codfw Spine EBGP: Fix typo in peer IP on ssw1-e1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1148867 (https://phabricator.wikimedia.org/T394021)
[13:23:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Merge tag 'v3.10.6' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148851 (https://phabricator.wikimedia.org/T390666) (owner: 10Hashar)
[13:24:02] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Codfw Spine EBGP: Fix typo in peer IP on ssw1-e1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1148867 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[13:24:26] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage
[13:24:28] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Codfw Spine EBGP: Fix typo in peer IP on ssw1-e1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1148867 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[13:24:36] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148859|Stop setting $wgCaptchaClass in extension.json files (T394814)]], [[gerrit:1148860|Stop setting $wgCaptchaClass in extension.json files (T394814)]], [[gerrit:1148830|Add mediawiki.ForeignApi.core as a dependency (T387720)]] (duration: 10m 52s)
[13:24:41] <stashbot>	 T394814: Unable to have multiple captcha implementations in extension-list - https://phabricator.wikimedia.org/T394814
[13:24:41] <stashbot>	 T387720: Prefer a parameter over a configuration for importing translations in SecurePoll - https://phabricator.wikimedia.org/T387720
[13:25:30] <wikibugs>	 (03PS1) 10Hashar: Gerrit 3.10.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148868 (https://phabricator.wikimedia.org/T390666)
[13:25:36] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1001.eqiad.wmnet
[13:25:40] <wikibugs>	 (03Merged) 10jenkins-bot: Codfw Spine EBGP: Fix typo in peer IP on ssw1-e1-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/1148867 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[13:25:57] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable edge uniques on cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/1148869 (https://phabricator.wikimedia.org/T391411)
[13:26:09] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Merge tag 'v3.10.6' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148851 (https://phabricator.wikimedia.org/T390666) (owner: 10Hashar)
[13:26:40] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1148398|Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" (T382148 T394814)]]
[13:26:44] <stashbot>	 T382148: Enable hCaptcha on test2wiki - https://phabricator.wikimedia.org/T382148
[13:26:51] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::proxy: Remove unnecessary Prometheus term [puppet] - 10https://gerrit.wikimedia.org/r/1148870
[13:26:51] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::proxy: Listen on IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575)
[13:27:10] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148869 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[13:28:32] <jinxer-wm>	 FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[13:29:46] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5637/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575) (owner: 10Majavah)
[13:31:12] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Enable edge uniques on cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/1148869 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[13:31:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] eventgate-main: Increase CPU limit to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148866 (https://phabricator.wikimedia.org/T393513) (owner: 10Alexandros Kosiaris)
[13:31:39] <jinxer-wm>	 RESOLVED: [9x] CoreBGPDown: Core BGP session down between ssw1-a1-codfw and ssw1-e1-codfw (2620:0:860:139::19) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[13:33:02] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate-main: Increase CPU limit to 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148866 (https://phabricator.wikimedia.org/T393513) (owner: 10Alexandros Kosiaris)
[13:34:11] <wikibugs>	 (03CR) 10Vgutierrez: "same as with `is_alt_domain`, after a var.set() we should log its value with std.log()" [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall)
[13:34:22] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge uniques on cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/1148869 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[13:36:26] <vgutierrez>	 !log enabling edge uniques on  cp4045 - T391411
[13:36:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:30] <stashbot>	 T391411: SDS 2.4.4 Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411
[13:37:09] <wikibugs>	 (03Merged) 10jenkins-bot: Merge tag 'v3.10.6' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1148851 (https://phabricator.wikimedia.org/T390666) (owner: 10Hashar)
[13:37:10] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet
[13:37:33] <akosiaris>	 !log deploy eventgate-main to pickup the CPU change as well as the change in envoy histogram buckets
[13:37:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:32] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: dse-k8s-worker1001.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-dse&var-instance=dse-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:39:57] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply
[13:40:33] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply
[13:41:10] <sukhe>	 !log updating dns-root-data on A:dnsbox
[13:41:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:13] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1001.eqiad.wmnet with OS bookworm
[13:41:25] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[13:41:33] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
[13:42:23] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply
[13:42:56] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply
[13:43:13] <sukhe>	 !log updating dns-root-data on A:wikidough
[13:43:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:39] <wikibugs>	 (03PS2) 10Majavah: P:toolforge::proxy: Listen on IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575)
[13:44:37] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1 C:03+2] grafana: Remove stunnel from quickdatacopy sync [puppet] - 10https://gerrit.wikimedia.org/r/1148468 (https://phabricator.wikimedia.org/T393738) (owner: 10Andrea Denisse)
[13:44:44] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5638/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575) (owner: 10Majavah)
[13:45:34] <wikibugs>	 (03CR) 10Elukey: [C:03+2] conftool-data: add ml-serve1001 back to its pool after reimage [puppet] - 10https://gerrit.wikimedia.org/r/1148789 (https://phabricator.wikimedia.org/T387854) (owner: 10Elukey)
[13:45:58] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "Also checked that this works on existing IPv4-only hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575) (owner: 10Majavah)
[13:46:30] <elukey>	 denisse: o/ ok to merge?
[13:46:49] <denisse>	 elukey: Yes, please. I can't ssh to puppetserver for some reason. :(
[13:47:05] <elukey>	 {{done}}
[13:47:10] <denisse>	 Thanks!!
[13:47:37] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough
[13:48:12] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox and not P{dns7001*} and A:dnsbox
[13:48:27] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve1001.eqiad.wmnet
[13:48:28] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve1001.eqiad.wmnet
[13:49:12] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-jumbo1018.eqiad.wmnet with OS bullseye
[13:49:36] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-jumbo1018.eqiad.wmnet with OS bullseye
[13:50:12] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=1; selector: name=ml-serve1001.eqiad.wmnet,dc=eqiad,cluster=maps,service=inference
[13:51:00] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable edge uniques in one server per DC and cluster [puppet] - 10https://gerrit.wikimedia.org/r/1148877 (https://phabricator.wikimedia.org/T391411)
[13:51:30] <wikibugs>	 (03CR) 10FNegri: [C:03+1] P:toolforge::proxy: Remove unnecessary Prometheus term [puppet] - 10https://gerrit.wikimedia.org/r/1148870 (owner: 10Majavah)
[13:51:34] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148877 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[13:51:50] <wikibugs>	 (03CR) 10FNegri: [C:03+1] P:exim::smarthost: Convert unsupported domain warn to reject [puppet] - 10https://gerrit.wikimedia.org/r/1137758 (https://phabricator.wikimedia.org/T366935) (owner: 10Majavah)
[13:52:35] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575) (owner: 10Majavah)
[13:52:42] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge::proxy: Remove unnecessary Prometheus term [puppet] - 10https://gerrit.wikimedia.org/r/1148870 (owner: 10Majavah)
[13:52:48] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge::proxy: Listen on IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/1148871 (https://phabricator.wikimedia.org/T211575) (owner: 10Majavah)
[13:52:59] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:exim::smarthost: Convert unsupported domain warn to reject [puppet] - 10https://gerrit.wikimedia.org/r/1137758 (https://phabricator.wikimedia.org/T366935) (owner: 10Majavah)
[13:53:53] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:54:16] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/weight=1; selector: name=ml-serve1001.eqiad.wmnet
[13:56:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Move Kartotherian/staging to the new Bookworm nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148881 (https://phabricator.wikimedia.org/T381565)
[13:56:52] <logmsgbot>	 !log reedy@deploy1003 reedy: Backport for [[gerrit:1148398|Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" (T382148 T394814)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:56:56] <stashbot>	 T382148: Enable hCaptcha on test2wiki - https://phabricator.wikimedia.org/T382148
[13:56:57] <stashbot>	 T394814: Unable to have multiple captcha implementations in extension-list - https://phabricator.wikimedia.org/T394814
[13:57:04] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=ml-serve10.*.eqiad.wmnet
[13:57:12] <Reedy>	 this is taking a while
[13:57:21] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:57:59] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=ml-serve20.*.codfw.wmnet
[13:58:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1140498 (owner: 10Slyngshede)
[13:58:29] <logmsgbot>	 !log reedy@deploy1003 reedy: Continuing with sync
[13:59:27] <wikibugs>	 (03PS2) 10Muehlenhoff: Move Kartotherian/staging to the new Bookworm nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148881 (https://phabricator.wikimedia.org/T381565)
[13:59:44] <wikibugs>	 (03PS1) 10Ssingh: dnsdist: set setMaxTCPQueriesPerConnection and a reasonable default [puppet] - 10https://gerrit.wikimedia.org/r/1148883
[14:00:07] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1400)
[14:00:23] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5639/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh)
[14:00:45] <wikibugs>	 (03CR) 10Ssingh: dnsdist: set setMaxTCPQueriesPerConnection and a reasonable default [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh)
[14:01:00] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough
[14:02:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one typo inline" [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh)
[14:02:35] <sbassett>	 reedy: are you you still doing backport stuff?
[14:03:17] <sbassett>	 we need to get an updated version of the 04 securepoll patch out to wmf.2 as it’s currently causing https://phabricator.wikimedia.org/T394900
[14:03:55] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Checked host names to ensure one per cluster and DC." [puppet] - 10https://gerrit.wikimedia.org/r/1148877 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[14:04:16] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1018.eqiad.wmnet with reason: host reimage
[14:05:01] <wikibugs>	 (03PS2) 10Ssingh: dnsdist: set setMaxTCPQueriesPerConnection and a reasonable default [puppet] - 10https://gerrit.wikimedia.org/r/1148883
[14:06:18] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5640/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh)
[14:06:21] <wikibugs>	 (03CR) 10Ssingh: dnsdist: set setMaxTCPQueriesPerConnection and a reasonable default (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh)
[14:07:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh)
[14:07:30] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] dnsdist: set setMaxTCPQueriesPerConnection and a reasonable default [puppet] - 10https://gerrit.wikimedia.org/r/1148883 (owner: 10Ssingh)
[14:07:40] <Reedy>	 sbassett: it's doing a localisation rebuild, so taking a while
[14:07:44] <Reedy>	 14:07:42 K8s deployment progress:  57% (ok: 1420; fail: 0; left: 1058) \        
[14:08:03] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1018.eqiad.wmnet with reason: host reimage
[14:08:03] <Reedy>	 shouldn't be too much longer...
[14:08:21] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge uniques in one server per DC and cluster [puppet] - 10https://gerrit.wikimedia.org/r/1148877 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[14:08:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:08:57] <sukhe>	 !log running agent on A:wikidough
[14:09:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:12] <vgutierrez>	 !log enabling edge uniques in one server per DC and cluster (cp[1100-1101],cp[2027-2028],cp3074,cp[5017,5025],cp[6001,6009],cp[7001,7009])- T391411
[14:09:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:16] <stashbot>	 T391411: SDS 2.4.4 Edge Uniques Production Cookie Deployment - https://phabricator.wikimedia.org/T391411
[14:10:07] <sbassett>	 Reedy: ok - looks like the wikifunctions folks aren’t using their window rn, so we should be good to sec-deploy once your done.  Assuming that’s it?
[14:10:16] <Reedy>	 Yeah,I'm done
[14:10:21] <sbassett>	 Ok, thanks
[14:10:21] <Reedy>	 (when this is done)
[14:11:27] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough
[14:11:50] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:11:54] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:12:33] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148398|Revert^2 "extension-list: Add ConfirmEdit/hCaptcha/extension.json" (T382148 T394814)]] (duration: 45m 53s)
[14:12:38] <stashbot>	 T382148: Enable hCaptcha on test2wiki - https://phabricator.wikimedia.org/T382148
[14:12:38] <stashbot>	 T394814: Unable to have multiple captcha implementations in extension-list - https://phabricator.wikimedia.org/T394814
[14:12:43] <Reedy>	 sbassett: that's me clear now
[14:13:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10844388 (10Papaul) @cmooney thank you. I do not have any preference on how this is done. What works best for all is good with me.
[14:16:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10844404 (10GPSLeo) When is this expected to be solved? Because of this problem many important maintenance and monitoring tools are broken. This should have UBN priority.
[14:20:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw
[14:22:47] <wikibugs>	 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10844425 (10bking) Not to get too far off topic, but let me contextualize this.  >  > This is the first time I heard about that repository. Was the already exi...
[14:23:53] <wikibugs>	 (03PS3) 10Ayounsi: Icinga: remove some network devices checks [puppet] - 10https://gerrit.wikimedia.org/r/1144650 (https://phabricator.wikimedia.org/T388641)
[14:24:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw
[14:24:42] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1018.eqiad.wmnet with OS bullseye
[14:24:53] <wikibugs>	 (03PS1) 10ZhaoFJx: arbcom_zhwiki: Change wgWhitelistRead Setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919)
[14:25:15] <wikibugs>	 (03PS3) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637)
[14:25:20] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough
[14:25:40] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Upgrading to Java 11.0.27 - eevans@cumin1002
[14:26:49] <wikibugs>	 (03CR) 10Hoo man: [C:03+1] "Fine to deploy whenever we want." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148834 (https://phabricator.wikimedia.org/T394669) (owner: 10Arthur taylor)
[14:26:49] <wikibugs>	 (03CR) 10Nikerabbit: [C:03+1] Remove unused wgContentTranslationEnableSectionTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson)
[14:27:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad
[14:27:54] <wikibugs>	 (03PS1) 10CDobbins: replace X-WMF-UUID with vmod_var variable [puppet] - 10https://gerrit.wikimedia.org/r/1148889
[14:28:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri)
[14:29:07] <wikibugs>	 (03PS2) 10ZhaoFJx: arbcom_zhwiki: Change wgWhitelistRead Setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919)
[14:29:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on ms-fe2012:9290 - https://phabricator.wikimedia.org/T394901#10844461 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated power cord. alert cleared.
[14:30:34] <sbassett>	 !log Deployed updated security fix for T392341 (04) to 1.45-wmf.2
[14:30:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:51] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate continuousScan-commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/1140484 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan)
[14:31:54] <jhathaway>	 !incidents
[14:31:54] <sirenbot>	 6181 (ACKED)  kafka-jumbo1016/Kafka Broker Server (paged)
[14:31:54] <sirenbot>	 6182 (ACKED)  kafka-jumbo1017/Kafka Broker Server (paged)
[14:32:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad
[14:32:19] <jynus>	 I asked brouberol twice if to resolve them
[14:32:41] <jhathaway>	 ok, so re-page from yesterday?
[14:32:46] <jynus>	 yeah
[14:33:04] <jynus>	 wanted the greenlight first
[14:33:16] <jynus>	 but it is what it is
[14:33:41] <jhathaway>	 :)
[14:36:19] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#10844521 (10Jhancock.wm) I've connected the intel NIC via a 1000BASE-TX SFP to port 44 on the switch.  Let me know when you need it rem...
[14:36:31] <wikibugs>	 (03PS1) 10Ayounsi: Netops: remove check_bgp [puppet] - 10https://gerrit.wikimedia.org/r/1148891 (https://phabricator.wikimedia.org/T388641)
[14:37:55] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[14:37:56] <icinga-wm>	 PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:38:14] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148891 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[14:38:17] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[14:38:32] <Dreamy_Jazz>	 `1
[14:39:17] <Dreamy_Jazz>	 Fat fingers that I have....
[14:40:23] <logmsgbot>	 !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2200.codfw.wmnet,db1216.eqiad.wmnet with reason: Restart x3
[14:42:33] <wikibugs>	 (03CR) 10DLynch: VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester)
[14:42:55] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox and not P{dns7001*} and A:dnsbox
[14:43:06] <moritzm>	 !log installing postgresql-15 security updates
[14:43:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:32] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[14:43:44] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host pc2018.codfw.wmnet with OS bookworm
[14:43:53] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10844573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host pc2018.codfw.wmnet with OS bookworm
[14:47:34] <wikibugs>	 (03PS1) 10Brouberol: airflow-dev: pull from the remote branch every 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148893 (https://phabricator.wikimedia.org/T393998)
[14:47:51] <wikibugs>	 (03PS1) 10ZhaoFJx: zh.arbcom: Enable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148894 (https://phabricator.wikimedia.org/T394920)
[14:48:23] <wikibugs>	 (03CR) 10Jforrester: VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester)
[14:49:09] <wikibugs>	 (03PS1) 10Elukey: kubernetes: add maps-test codfw as external service [puppet] - 10https://gerrit.wikimedia.org/r/1148896 (https://phabricator.wikimedia.org/T381565)
[14:49:35] <wikibugs>	 (03PS2) 10ZhaoFJx: zh.arbcom: Enable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148894 (https://phabricator.wikimedia.org/T394920)
[14:50:52] <wikibugs>	 (03CR) 10Elukey: Move Kartotherian/staging to the new Bookworm nodes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148881 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[14:51:22] <wikibugs>	 (03PS3) 10ZhaoFJx: arbcom_zhwiki: Enable local upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148894 (https://phabricator.wikimedia.org/T394920)
[14:52:05] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5642/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148896 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[14:52:16] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) (owner: 10ZhaoFJx)
[14:52:25] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148894 (https://phabricator.wikimedia.org/T394920) (owner: 10ZhaoFJx)
[14:53:22] <wikibugs>	 (03CR) 10CDobbins: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148889 (owner: 10CDobbins)
[14:55:26] <wikibugs>	 (03CR) 10Muehlenhoff: kubernetes: add maps-test codfw as external service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148896 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey)
[14:55:54] <icinga-wm>	 RECOVERY - Host an-worker1068 is UP: PING WARNING - Packet loss = 33%, RTA = 930.22 ms
[14:58:06] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-dev: pull from the remote branch every 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148893 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol)
[14:58:06] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Change x2 and x3 ports to avoid conflicts with extra port [puppet] - 10https://gerrit.wikimedia.org/r/1148822 (https://phabricator.wikimedia.org/T384274) (owner: 10Jcrespo)
[14:59:18] <wikibugs>	 (03Abandoned) 10Brouberol: Duplicate partman/custom/kafka-jumbo.cfg [puppet] - 10https://gerrit.wikimedia.org/r/1148839 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol)
[14:59:24] <wikibugs>	 (03Abandoned) 10Brouberol: partman: define a kafka-jumbo-ba recipe [puppet] - 10https://gerrit.wikimedia.org/r/1148840 (https://phabricator.wikimedia.org/T377874) (owner: 10Brouberol)
[14:59:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc2018.codfw.wmnet with reason: host reimage
[15:00:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10844697 (10Marostegui) >>! In T394624#10844404, @GPSLeo wrote: > When is this expected to be solved? Because of this problem many important maintenance and monitoring tools are broken. This should have...
[15:02:39] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc2018.codfw.wmnet with reason: host reimage
[15:03:03] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1068 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[15:05:36] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Adapt dump scripts for running in containers [dumps] - 10https://gerrit.wikimedia.org/r/1148863 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis)
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:27] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] airflow-dev: pull from the remote branch every 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148893 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol)
[15:08:30] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-dev: pull from the remote branch every 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148893 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol)
[15:09:39] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] hiera: Add zarcillo k8s service on traffic server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135387 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[15:12:56] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Adapt dump scripts for running in containers [dumps] - 10https://gerrit.wikimedia.org/r/1148863 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis)
[15:15:41] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[15:16:30] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[15:16:47] <wikibugs>	 (03CR) 10BryanDavis: [C:03+1] "Should I cherry-pick this in Beta to prove that it works there?" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah)
[15:18:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[15:18:57] <wikibugs>	 (03CR) 10Stang: arbcom_zhwiki: Change wgWhitelistRead Setting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) (owner: 10ZhaoFJx)
[15:19:11] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[15:20:45] <wikibugs>	 (03CR) 10Eamedina: [C:03+1] Remove unused wgContentTranslationEnableSectionTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson)
[15:21:20] <wikibugs>	 (03CR) 10Ladsgroup: openstack: wikireplica_dns: Point x3 records to new VIP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148313 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah)
[15:21:58] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[15:21:59] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2018.codfw.wmnet with OS bookworm
[15:22:04] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10844770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host pc2018.codfw.wmnet with OS bookworm completed: - pc2018 (**PASS**)   - Remov...
[15:22:34] <wikibugs>	 (03PS1) 10Cathal Mooney: Switch IBGP Policy: set next-hop self on routes being originated [homer/public] - 10https://gerrit.wikimedia.org/r/1148898 (https://phabricator.wikimedia.org/T394530)
[15:22:49] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10844773 (10Jhancock.wm) 05Open→03Resolved
[15:23:01] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10844777 (10Jhancock.wm) @Marostegui this is completed.
[15:23:58] <wikibugs>	 (03CR) 10FNegri: [C:03+1] openstack: wikireplica_dns: Point x3 records to new VIP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148313 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah)
[15:24:30] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc2018 - https://phabricator.wikimedia.org/T393110#10844781 (10Ladsgroup) Thank you!
[15:25:07] <jynus>	 !log forgetting 4 old instances @ orchestrator-web T384274
[15:25:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:11] <stashbot>	 T384274: Backups for x3 - https://phabricator.wikimedia.org/T384274
[15:25:31] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[15:26:47] <wikibugs>	 (03PS1) 10Majavah: toolforge: toolviews: Do not log secret changes [puppet] - 10https://gerrit.wikimedia.org/r/1148899
[15:27:13] <wikibugs>	 (03PS4) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637)
[15:28:32] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[15:30:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri)
[15:31:57] <wikibugs>	 (03PS3) 10ZhaoFJx: arbcom_zhwiki: Change wgWhitelistRead Setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919)
[15:32:01] <wikibugs>	 (03CR) 10ZhaoFJx: arbcom_zhwiki: Change wgWhitelistRead Setting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) (owner: 10ZhaoFJx)
[15:33:01] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Switch IBGP Policy: set next-hop self on routes being originated [homer/public] - 10https://gerrit.wikimedia.org/r/1148898 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[15:34:16] <wikibugs>	 (03CR) 10FNegri: [C:03+1] toolforge: toolviews: Do not log secret changes [puppet] - 10https://gerrit.wikimedia.org/r/1148899 (owner: 10Majavah)
[15:37:20] <wikibugs>	 (03PS5) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637)
[15:40:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri)
[15:40:26] <wikibugs>	 (03PS1) 10Effie Mouzeli: memcached: add option to switch to the  performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881)
[15:41:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] memcached: add option to switch to the  performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli)
[15:41:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:43:35] <wikibugs>	 (03CR) 10Stang: [C:03+1] arbcom_zhwiki: Change wgWhitelistRead Setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148888 (https://phabricator.wikimedia.org/T394919) (owner: 10ZhaoFJx)
[15:48:26] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148818 (https://phabricator.wikimedia.org/T394899) (owner: 10Gmodena)
[15:48:26] <wikibugs>	 (03PS2) 10Effie Mouzeli: memcached: add option to switch to the  performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881)
[15:49:14] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki::memcached enable performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148901 (https://phabricator.wikimedia.org/T371881)
[15:49:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] memcached: add option to switch to the  performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli)
[15:49:36] <wikibugs>	 (03PS2) 10Effie Mouzeli: mediawiki::memcached enable performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148901 (https://phabricator.wikimedia.org/T371881)
[15:50:25] <wikibugs>	 (03CR) 10Federico Ceratto: "Ok, tracking progress in https://phabricator.wikimedia.org/T394884" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto)
[15:51:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148818 (https://phabricator.wikimedia.org/T394899) (owner: 10Gmodena)
[15:52:20] <wikibugs>	 (03PS3) 10Effie Mouzeli: memcached: add option to switch to the  performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881)
[15:52:58] <logmsgbot>	 jhancock@cumin2002 provision (PID 3468025) is awaiting input
[15:53:09] <wikibugs>	 (03PS4) 10Effie Mouzeli: memcached: add option to switch to the  performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881)
[15:55:24] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10844927 (10Dzahn)
[15:55:53] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, and 2 others: eqiad/codfw: 6 VM request for Zuul upgrade project - https://phabricator.wikimedia.org/T393873#10844928 (10Dzahn) 05Open→03Resolved 6 VMs have been created:  2 VMs - main zuul (8GB)      zuul1001     zuul200...
[15:56:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] memcached: add option to switch to the  performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli)
[15:56:56] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync
[15:56:59] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync
[15:57:37] <wikibugs>	 (03CR) 10Majavah: [C:03+2] toolforge: toolviews: Do not log secret changes [puppet] - 10https://gerrit.wikimedia.org/r/1148899 (owner: 10Majavah)
[15:58:42] <wikibugs>	 (03CR) 10Joal: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148818 (https://phabricator.wikimedia.org/T394899) (owner: 10Gmodena)
[16:01:29] <wikibugs>	 (03PS1) 10Dzahn: site: separate zuul regex, make it clear what is doing what [puppet] - 10https://gerrit.wikimedia.org/r/1148902 (https://phabricator.wikimedia.org/T393873)
[16:02:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Supermicro: test if Intel card exhibits the same cold boot behavior - https://phabricator.wikimedia.org/T394847#10844977 (10RobH)  > I kind of like this idea, but it might complicate the reimage process.  So probably the easiest thing is: >  > * C...
[16:02:09] <wikibugs>	 (03CR) 10Dzahn: "also avoids the string "zuul3"" [puppet] - 10https://gerrit.wikimedia.org/r/1148902 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn)
[16:02:16] <wikibugs>	 (03PS5) 10Effie Mouzeli: memcached: add option to switch to the  performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881)
[16:02:37] <icinga-wm>	 RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms
[16:03:03] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[16:03:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:03:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm
[16:03:47] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10844995 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm
[16:04:19] <icinga-wm>	 PROBLEM - Host cirrussearch2079 is DOWN: PING CRITICAL - Packet loss = 100%
[16:05:43] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate all refreshLinkRecommendations jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1143529 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[16:05:55] <icinga-wm>	 RECOVERY - Host cirrussearch2079 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms
[16:07:27] <icinga-wm>	 PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:09] <wikibugs>	 (03CR) 10Dzahn: "yes, PTR records needed for sure.. I basically just forgot to amend one more time.. doing" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn)
[16:08:12] <wikibugs>	 (03PS4) 10Scott French: P:mw:maint:update_special_pages: all updateSpecialPages shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534)
[16:08:45] <icinga-wm>	 RECOVERY - Host sretest2001 is UP: PING OK - Packet loss = 0%, RTA = 30.70 ms
[16:09:27] <wikibugs>	 (03PS6) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637)
[16:11:30] <wikibugs>	 (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French)
[16:11:48] <wikibugs>	 (03PS1) 10Ssingh: templates: lower TTLs for dyna.wm.org and upload.wm.org to 210s. [dns] - 10https://gerrit.wikimedia.org/r/1148904 (https://phabricator.wikimedia.org/T394312)
[16:11:52] <wikibugs>	 (03CR) 10Dzahn: "well.. ACTUALLY.. multiple A records for one IP is considered standard but multiple PTR records for one IP is considered "not recommended"" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn)
[16:12:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637) (owner: 10FNegri)
[16:12:40] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[16:12:46] <hashar>	 jouncebot: nowandnext
[16:12:46] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 47 minute(s)
[16:12:46] <jouncebot>	 In 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1700)
[16:13:06] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[16:16:30] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "Completely optional nit, up to you, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1148900 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli)
[16:16:41] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki::memcached enable performance cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/1148901 (https://phabricator.wikimedia.org/T371881) (owner: 10Effie Mouzeli)
[16:17:06] <wikibugs>	 (03CR) 10Dzahn: "https://serverfault.com/questions/618700/why-multiple-ptr-records-in-dns-is-not-recommended" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn)
[16:17:50] <wikibugs>	 (03PS1) 10Effie Mouzeli: WIP: profile::kubernetes::node: Add script to pull and mount latest mw [puppet] - 10https://gerrit.wikimedia.org/r/1148905 (https://phabricator.wikimedia.org/T276994)
[16:18:11] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] P:mw:maint:update_special_pages: all updateSpecialPages shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French)
[16:18:18] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] P:mw:maint:update_special_pages: all updateSpecialPages shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French)
[16:19:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: profile::kubernetes::node: Add script to pull and mount latest mw [puppet] - 10https://gerrit.wikimedia.org/r/1148905 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli)
[16:25:28] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Switch IBGP Policy: set next-hop self on routes being originated [homer/public] - 10https://gerrit.wikimedia.org/r/1148898 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[16:25:43] <wikibugs>	 (03PS3) 10Cathal Mooney: Network: add puppet data for new devices and networks codfw expansion [puppet] - 10https://gerrit.wikimedia.org/r/1145194 (https://phabricator.wikimedia.org/T394021)
[16:26:40] <wikibugs>	 (03Merged) 10jenkins-bot: Switch IBGP Policy: set next-hop self on routes being originated [homer/public] - 10https://gerrit.wikimedia.org/r/1148898 (https://phabricator.wikimedia.org/T394530) (owner: 10Cathal Mooney)
[16:33:17] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Network: add puppet data for new devices and networks codfw expansion [puppet] - 10https://gerrit.wikimedia.org/r/1145194 (https://phabricator.wikimedia.org/T394021) (owner: 10Cathal Mooney)
[16:35:02] <wikibugs>	 (03PS1) 10DLynch: Extend the mobile insert menu config so that tools can be specified [extensions/VisualEditor] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148908 (https://phabricator.wikimedia.org/T388604)
[16:35:18] <wikibugs>	 (03PS1) 10DLynch: Extend the mobile insert menu config so that tools can be specified [extensions/VisualEditor] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148909 (https://phabricator.wikimedia.org/T388604)
[16:37:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/VisualEditor] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148909 (https://phabricator.wikimedia.org/T388604) (owner: 10DLynch)
[16:37:10] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/VisualEditor] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148908 (https://phabricator.wikimedia.org/T388604) (owner: 10DLynch)
[16:37:23] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester)
[16:37:34] <wikibugs>	 (03CR) 10Dzahn: "recently on #dns" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn)
[16:38:02] <wikibugs>	 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops: Re-opening our DMarcian Trial - https://phabricator.wikimedia.org/T394788#10845153 (10AKanji-WMF)
[16:40:58] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Upgrading to Java 11.0.27 - eevans@cumin1002
[16:41:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:41:31] <wikibugs>	 (03CR) 10Btullis: [C:03+1] snapshot: Remove now unused Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1148751 (https://phabricator.wikimedia.org/T394647) (owner: 10Muehlenhoff)
[16:42:56] <wikibugs>	 (03CR) 10Dzahn: "yea.. so since/if I am supposed to pick a single PTR record.. that would mean I just use the existing one and this change is ok as is." [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn)
[16:45:48] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium-restart
[16:45:59] <logmsgbot>	 !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.sanitarium-restart (exit_code=97)
[16:46:14] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium-restart
[16:46:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:46:27] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitarium-restart (exit_code=99)
[16:46:37] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.sanitarium-restart
[16:46:50] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.sanitarium-restart (exit_code=99)
[16:47:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:52:15] <jinxer-wm>	 FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[16:55:00] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "yes, you observations are correct and at least I certainly missed the fact that you already have a PTR for the "primary" record and that's" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn)
[16:56:25] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "$ dig -x 208.80.154.151 +short" [dns] - 10https://gerrit.wikimedia.org/r/1148438 (https://phabricator.wikimedia.org/T394271) (owner: 10Dzahn)
[16:57:08] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: migrate all remaining growthexperiments jobs [puppet] - 10https://gerrit.wikimedia.org/r/1148914 (https://phabricator.wikimedia.org/T385782)
[16:57:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[17:00:05] <jouncebot>	 swfrench-wmf: How many deployers does it take to do MediaWiki infrastructure (UTC late) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1700).
[17:00:29] <swfrench-wmf>	 o/ I'll get started in a couple of minutes
[17:01:15] <jinxer-wm>	 FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[17:05:02] <wikibugs>	 (03PS1) 10Andrea Denisse: grafana: Disable dashboard sync to ugprade Grafana version [puppet] - 10https://gerrit.wikimedia.org/r/1148915 (https://phabricator.wikimedia.org/T394470)
[17:06:15] <jinxer-wm>	 RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate
[17:08:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:09:38] <wikibugs>	 (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5643/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148915 (https://phabricator.wikimedia.org/T394470) (owner: 10Andrea Denisse)
[17:09:56] <wikibugs>	 (03PS2) 10Dzahn: lists: include nftables throttling profile [puppet] - 10https://gerrit.wikimedia.org/r/1148432 (https://phabricator.wikimedia.org/T394519)
[17:10:12] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "start simple.. then enable it" [puppet] - 10https://gerrit.wikimedia.org/r/1148432 (https://phabricator.wikimedia.org/T394519) (owner: 10Dzahn)
[17:13:02] <logmsgbot>	 jhancock@cumin2002 reimage (PID 3486418) is awaiting input
[17:20:54] <wikibugs>	 (03CR) 10Scott French: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French)
[17:20:56] <wikibugs>	 (03CR) 10Scott French: [C:03+2] P:mw:maint:update_special_pages: all updateSpecialPages shards to mw-cron [puppet] - 10https://gerrit.wikimedia.org/r/1146789 (https://phabricator.wikimedia.org/T388534) (owner: 10Scott French)
[17:27:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10845246 (10VRiley-WMF) 05Open→03In progress Taking this unit down for the memory swap.
[17:28:32] <jinxer-wm>	 FIRING: [4x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[17:29:26] <wikibugs>	 (03PS1) 10Bking: cirrussearch: Add cirrussearch row C/remove elastic row D [puppet] - 10https://gerrit.wikimedia.org/r/1148922 (https://phabricator.wikimedia.org/T388610)
[17:29:39] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148922 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[17:33:17] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[17:33:29] <logmsgbot>	 !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[17:37:14] <swfrench-wmf>	 I'm done with the infra window
[17:41:13] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate all remaining growthexperiments jobs [puppet] - 10https://gerrit.wikimedia.org/r/1148914 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan)
[17:41:22] <wikibugs>	 (03PS1) 10Dzahn: role: delete requesttracker role [puppet] - 10https://gerrit.wikimedia.org/r/1148923
[17:41:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10845303 (10VRiley-WMF) 05In progress→03Resolved This is completed
[17:42:15] <wikibugs>	 (03PS2) 10Dzahn: role: delete requesttracker role [puppet] - 10https://gerrit.wikimedia.org/r/1148923 (https://phabricator.wikimedia.org/T385777)
[17:45:53] <wikibugs>	 (03CR) 10Btullis: [C:03+1] cirrussearch: Add cirrussearch row C/remove elastic row D [puppet] - 10https://gerrit.wikimedia.org/r/1148922 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[17:48:26] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-fe100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T394894#10845334 (10VRiley-WMF)
[17:51:28] <Dreamy_Jazz>	 jouncebot: nowandnext
[17:51:28] <jouncebot>	 For the next 0 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T1700)
[17:51:29] <jouncebot>	 In 2 hour(s) and 8 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T2000)
[17:51:44] <wikibugs>	 (03PS1) 10Dreamy Jazz: Support creating logs in emptyUserGroup.php [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148924 (https://phabricator.wikimedia.org/T394914)
[17:52:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[17:54:57] <Dreamy_Jazz>	 Anyone mind if I deploy?
[17:55:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148924 (https://phabricator.wikimedia.org/T394914) (owner: 10Dreamy Jazz)
[17:57:03] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[17:57:08] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[17:57:19] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[17:57:30] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[17:57:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[17:57:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[17:58:03] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: Add cirrussearch row C/remove elastic row D [puppet] - 10https://gerrit.wikimedia.org/r/1148922 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[17:58:04] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-fe100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T394894#10845357 (10VRiley-WMF)
[17:58:08] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[17:58:37] <wikibugs>	 (03CR) 10Muehlenhoff: "Also needs to be dropped from profile::prometheus::ops" [puppet] - 10https://gerrit.wikimedia.org/r/1148923 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[18:01:21] <wikibugs>	 (03PS7) 10FNegri: wikireplicas scripts: setup pytest, add first test [puppet] - 10https://gerrit.wikimedia.org/r/1148394 (https://phabricator.wikimedia.org/T351637)
[18:01:56] <jinxer-wm>	 FIRING: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ...
[18:02:01] <jinxer-wm>	 Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[18:02:07] <jinxer-wm>	 RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:02:23] <jinxer-wm>	 RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:02:34] <jinxer-wm>	 RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2018:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:02:44] <jinxer-wm>	 RESOLVED: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:02:56] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[18:03:08] <jinxer-wm>	 RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:03:14] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:03:30] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[18:06:56] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ...
[18:07:01] <jinxer-wm>	 Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[18:08:06] <wikibugs>	 (03Merged) 10jenkins-bot: Support creating logs in emptyUserGroup.php [core] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148924 (https://phabricator.wikimedia.org/T394914) (owner: 10Dreamy Jazz)
[18:08:30] <logmsgbot>	 !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1148924|Support creating logs in emptyUserGroup.php (T394914)]]
[18:08:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:08:34] <stashbot>	 T394914: Update emptyUserGroup.php to optionally support creating log entries for removal - https://phabricator.wikimedia.org/T394914
[18:09:01] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission thanos-fe100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T394894#10845376 (10VRiley-WMF) 05Open→03Resolved
[18:10:50] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1148924|Support creating logs in emptyUserGroup.php (T394914)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[18:12:00] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] site: separate zuul regex, make it clear what is doing what [puppet] - 10https://gerrit.wikimedia.org/r/1148902 (https://phabricator.wikimedia.org/T393873) (owner: 10Dzahn)
[18:13:55] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: Add cirrussearch row C/remove elastic row D [puppet] - 10https://gerrit.wikimedia.org/r/1148922 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[18:14:35] <wikibugs>	 (03PS1) 10Dzahn: zuul: create basic role/profile for zuul::man and install docker.io [puppet] - 10https://gerrit.wikimedia.org/r/1148930 (https://phabricator.wikimedia.org/T393873)
[18:14:49] <logmsgbot>	 !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync
[18:15:21] <mutante>	 yes, it's ok to merge multiple ;)
[18:17:54] <wikibugs>	 (03PS7) 10BCornwall: varnish: Replace date/stamp headers with vars [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550)
[18:17:59] <wikibugs>	 (03CR) 10BCornwall: "I had done that originally but removed it in PS6 because the values are output in the `RespHeader`. For example:" [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall)
[18:21:48] <logmsgbot>	 !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148924|Support creating logs in emptyUserGroup.php (T394914)]] (duration: 13m 18s)
[18:21:52] <stashbot>	 T394914: Update emptyUserGroup.php to optionally support creating log entries for removal - https://phabricator.wikimedia.org/T394914
[18:22:53] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2003.codfw.wmnet with OS bookworm
[18:23:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10845407 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm executed with errors: - sretest2003 (...
[18:23:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10845409 (10Ladsgroup) >>! In T394624#10845303, @VRiley-WMF wrote: > This is completed  Thanks!  I started the mariadb deamons.
[18:23:22] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=no:weight=10; selector: name=elastic1060.eqiad.wmnet|name=elastic1061.eqiad.wmnet|name=elastic1062.eqiad.wmnet|name=elastic1063.eqiad.wmnet|name=elastic1064.eqiad.wmnet|name=elastic1065.eqiad.wmnet|name=elastic1066.eqiad.wmnet|name=elastic1067.eqiad.wmnet|name=elastic1103.eqiad.wmnet
[18:23:25] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s7 on clouddb1018 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:23:31] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 on clouddb1015 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:23:32] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[18:23:41] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:23:41] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s2 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:23:53] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on clouddb1019 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:23:53] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s7 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:23:53] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 on clouddb1019 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:23:53] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s2 on clouddb1018 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:23:55] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on clouddb1015 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:23:55] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s7 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:23:55] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s4 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:23:55] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s2 on an-redacteddb1001 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:26:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1155 HW memory errors - https://phabricator.wikimedia.org/T394624#10845425 (10Ladsgroup) Start replication.
[18:30:59] <wikibugs>	 (03PS2) 10Dzahn: zuul: create basic role/profile for zuul::man and install docker.io [puppet] - 10https://gerrit.wikimedia.org/r/1148930 (https://phabricator.wikimedia.org/T393873)
[18:32:33] <wikibugs>	 (03PS3) 10Dzahn: role: delete requesttracker role [puppet] - 10https://gerrit.wikimedia.org/r/1148923 (https://phabricator.wikimedia.org/T385777)
[18:32:47] <wikibugs>	 (03CR) 10Dzahn: "oh! thanks. done!" [puppet] - 10https://gerrit.wikimedia.org/r/1148923 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn)
[18:35:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1103 to cirrussearch1103
[18:35:40] <wikibugs>	 (03PS3) 10Dzahn: zuul: create basic role/profile for zuul::man and install docker.io [puppet] - 10https://gerrit.wikimedia.org/r/1148930 (https://phabricator.wikimedia.org/T393873)
[18:35:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[18:36:12] <wikibugs>	 (03PS4) 10Dzahn: zuul: create role/profile for new zuul main servers, install docker.io [puppet] - 10https://gerrit.wikimedia.org/r/1148930 (https://phabricator.wikimedia.org/T393873)
[18:38:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1103 to cirrussearch1103 - bking@cumin2002"
[18:39:41] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1103 to cirrussearch1103 - bking@cumin2002"
[18:39:41] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:39:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1103 on all recursors
[18:39:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1103 on all recursors
[18:39:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1103
[18:40:56] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06collaboration-services, 10Continuous-Integration-Infrastructure (Zuul upgrade): create new admin group for "zuul devs" - https://phabricator.wikimedia.org/T394819#10845468 (10Dzahn) Fair enough. I'd be fine just using contint-roots. Though decom'ing groups is also not a big...
[18:42:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1103
[18:42:52] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1103 to cirrussearch1103
[18:43:32] <jinxer-wm>	 FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[18:45:00] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06collaboration-services, 10Continuous-Integration-Infrastructure (Zuul upgrade): create new admin group for "zuul devs" - https://phabricator.wikimedia.org/T394819#10845486 (10Dzahn) ftr, the existing contint servers have all of these:   ` profile::admin::groups:   - contint-...
[18:45:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye
[18:45:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103
[18:45:20] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103
[18:52:08] <wikibugs>	 (03PS1) 10Dzahn: zuul: add contint-roots admin group to new zuul::main role [puppet] - 10https://gerrit.wikimedia.org/r/1148937 (https://phabricator.wikimedia.org/T394819)
[18:53:45] <wikibugs>	 (03PS1) 10Jdlrobson: Fixes: TypeError: Cannot read properties of undefined (reading 'contains') [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148938
[18:53:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [core] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148938 (owner: 10Jdlrobson)
[18:54:42] <wikibugs>	 (03PS1) 10Jdlrobson: bookmark: Fix click event not working [extensions/ReadingLists] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148939 (https://phabricator.wikimedia.org/T394736)
[18:54:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 21 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ReadingLists] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148939 (https://phabricator.wikimedia.org/T394736) (owner: 10Jdlrobson)
[18:56:52] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06collaboration-services, 10Continuous-Integration-Infrastructure (Zuul upgrade), 13Patch-For-Review: create new admin group for "zuul devs" - https://phabricator.wikimedia.org/T394819#10845524 (10Dzahn) @thcipriani As the existing "approval"-person for the contint-roots rol...
[19:01:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1146020 (https://phabricator.wikimedia.org/T389970) (owner: 10Sbisson)
[19:10:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1066 to cirrussearch1066
[19:10:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:13:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:15:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1066 to cirrussearch1066 - bking@cumin2002"
[19:16:22] <logmsgbot>	 jhancock@cumin2002 provision (PID 3570058) is awaiting input
[19:17:25] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:17:25] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:17:25] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:18:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1066 to cirrussearch1066 - bking@cumin2002"
[19:18:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:18:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1066 on all recursors
[19:18:15] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: BAD PEM3 on cr2-codfw - https://phabricator.wikimedia.org/T394868#10845555 (10Papaul) ` Case 2025-0520-703157 has been updated by Mathias Zuniga   UPDATE HAS BEEN ADDED:   Hello Team,  Please could you bring me the following com...
[19:18:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1066 on all recursors
[19:18:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1066
[19:18:27] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:18:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[19:19:25] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1066
[19:19:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:20:04] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1066 to cirrussearch1066
[19:21:07] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:21:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm
[19:21:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10845559 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm
[19:24:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1066.eqiad.wmnet with OS bullseye
[19:24:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1066
[19:24:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1066
[19:24:49] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm
[19:24:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845567 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2004.codfw.wmnet with OS bookworm
[19:28:32] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[19:35:09] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s6 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:35:25] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s6 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:35:53] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s6 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:41:28] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1066.eqiad.wmnet with reason: host reimage
[19:43:35] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1103.eqiad.wmnet with OS bullseye
[19:45:15] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1066.eqiad.wmnet with reason: host reimage
[19:50:04] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Upgrading to Java 11.0.27 - eevans@cumin1002
[19:51:26] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:51:32] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:51:32] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[19:52:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage
[19:52:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye
[19:52:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103
[19:52:46] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103
[19:56:15] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage
[19:56:45] <wikibugs>	 (03PS1) 10Arlolra: Remove $wgParserEnableLegacyMediaDOM option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148950 (https://phabricator.wikimedia.org/T394054)
[19:58:40] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch1080.eqiad.wmnet|name=cirrussearch1081.eqiad.wmnet|name=cirrussearch1082.eqiad.wmnet|name=cirrussearch1083.eqiad.wmnet|name=cirrussearch1087.eqiad.wmnet|name=cirrussearch1088.eqiad.wmnet|name=cirrussearch1118.eqiad.wmnet|name=cirrussearch1119.eqiad.wmnet
[19:59:21] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T2000).
[20:00:05] <jouncebot>	 ZhaoFJx, Tchanders, Kemayo, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:07] <Kemayo>	 o/
[20:00:10] <ZhaoFJx>	 Just on time
[20:00:46] <Tchanders>	 o/
[20:02:24] <Kemayo>	 My three need to all be deployed together (two backports and a config-change that makes them have an effect). I don't mind spiderpigging them myself.
[20:02:27] <wikibugs>	 (03PS1) 10Jforrester: [wikifunctions] Don't grant new generic-enum rights to Functioneers for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148951 (https://phabricator.wikimedia.org/T391913)
[20:02:35] <Jdlrobson>	 o/
[20:03:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage
[20:03:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:04:23] <Tchanders>	 Kemayo: Are you going first?
[20:04:52] <Kemayo>	 Tchanders: Sure, I can. I was going to wait and see whether a deployer was going to show up first, but I don't mind just doing it.
[20:05:42] <Tchanders>	 Ah. I haven't seen a deployer at one of these for a while, but then I haven't done this time slot for a while...
[20:07:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148909 (https://phabricator.wikimedia.org/T388604) (owner: 10DLynch)
[20:07:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148908 (https://phabricator.wikimedia.org/T388604) (owner: 10DLynch)
[20:07:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester)
[20:07:55] <Kemayo>	 Tchanders: I'll admit, I haven't done a backport myself since before spiderpig came about, so I'm not 100% sure what the etiquette is these days.
[20:08:25] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10845666 (10Jclark-ctr)
[20:08:32] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[20:09:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic1065 to cirrussearch1065
[20:10:08] <James_F>	 Kemayo: "Do as many patches as reasonable in one deploy, because they each take ~15 minutes at best".
[20:11:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1066.eqiad.wmnet with OS bullseye
[20:12:47] <James_F>	 I can do the deploy, I suppose.
[20:13:15] <James_F>	 Oh, Kemayo is already on it, never mind.
[20:13:59] <Kemayo>	 It's even the deploy to unblock *you*. :D
[20:14:04] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[20:14:05] <logmsgbot>	 jclark@cumin1002 netbox (PID 138500) is awaiting input
[20:14:14] <James_F>	 Yes yes, hence why I was going to do the deploy for you rather than have a very late lunch.
[20:14:19] <James_F>	 But given that, bye. :-)
[20:15:00] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for thanos-be1006,1007 - jclark@cumin1002"
[20:15:24] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for thanos-be1006,1007 - jclark@cumin1002"
[20:15:24] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:18:17] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1065 to cirrussearch1065 - bking@cumin2002"
[20:18:23] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic1065 to cirrussearch1065 - bking@cumin2002"
[20:18:23] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:18:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch1065 on all recursors
[20:18:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch1065 on all recursors
[20:18:28] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch1065
[20:18:55] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:19:12] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[20:19:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch1065
[20:19:55] <wikibugs>	 (03Merged) 10jenkins-bot: Extend the mobile insert menu config so that tools can be specified [extensions/VisualEditor] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1148909 (https://phabricator.wikimedia.org/T388604) (owner: 10DLynch)
[20:19:56] <wikibugs>	 (03Merged) 10jenkins-bot: Extend the mobile insert menu config so that tools can be specified [extensions/VisualEditor] (wmf/1.45.0-wmf.1) - 10https://gerrit.wikimedia.org/r/1148908 (https://phabricator.wikimedia.org/T388604) (owner: 10DLynch)
[20:20:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic1065 to cirrussearch1065
[20:20:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[20:20:46] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1065.eqiad.wmnet with OS bullseye
[20:20:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1065
[20:20:51] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1065
[20:22:39] <logmsgbot>	 jclark@cumin1002 provision (PID 139211) is awaiting input
[20:23:24] <logmsgbot>	 jclark@cumin1002 provision (PID 139216) is awaiting input
[20:23:41] <logmsgbot>	 jhancock@cumin2002 reimage (PID 3574856) is awaiting input
[20:25:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[20:28:47] <logmsgbot>	 jhancock@cumin2002 reimage (PID 3573569) is awaiting input
[20:30:51] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[20:31:26] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[20:34:24] <jinxer-wm>	 RESOLVED: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (27.111.228.81) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit4&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[20:34:43] <wikibugs>	 (03PS2) 10Jforrester: wikifunctions: Update orchestrator from 2025-05-14-112404 to 2025-05-21-192453 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148857 (https://phabricator.wikimedia.org/T385899)
[20:34:46] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-05-12-235119 to 2025-05-21-192515 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148954 (https://phabricator.wikimedia.org/T385899)
[20:36:32] <Kemayo>	 Quick update: two of my patches merged, but I'm still waiting for the third one to finish.
[20:37:59] <Kemayo>	 Actually... I wonder if something is wedged out of position. The +2 is there on the patch, but no gate-and-submit.
[20:38:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch1065.eqiad.wmnet with reason: host reimage
[20:39:45] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[20:39:46] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2004.codfw.wmnet with OS bookworm
[20:39:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2004.codfw.wmnet with OS bookworm completed: - sretest2004 (**PASS**)...
[20:40:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845770 (10Jhancock.wm)
[20:41:13] <Tchanders>	 Kemayo: Thanks for the update. (Also it's been easy to watch along via Spiderpig so thanks RelEng)
[20:41:37] <Tchanders>	 I'll bow out of this deployment window, since it's getting late here and not looking likely we'll get round to my patches
[20:41:46] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch1065.eqiad.wmnet with reason: host reimage
[20:41:53] <Kemayo>	 Sorry about it taking so long
[20:42:07] <logmsgbot>	 jclark@cumin1002 provision (PID 139211) is awaiting input
[20:43:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845780 (10Jhancock.wm) @RobH  It passed reimaging with UEFI. You do have to turn some things off that might be a security issue. We discussed it in the last dcops meeting. The issue Joh...
[20:43:05] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester)
[20:43:14] <James_F>	 Kemayo: I've re-triggered it
[20:43:52] <wikibugs>	 (03Merged) 10jenkins-bot: VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148422 (https://phabricator.wikimedia.org/T388604) (owner: 10Jforrester)
[20:44:09] <Kemayo>	 James_F: thanks! Was just commenting on the patch enough, or did it need to be by someone who had +2 on the repo?
[20:44:23] <logmsgbot>	 !log kemayo@deploy1003 Started scap sync-world: Backport for [[gerrit:1148909|Extend the mobile insert menu config so that tools can be specified (T388604)]], [[gerrit:1148908|Extend the mobile insert menu config so that tools can be specified (T388604)]], [[gerrit:1148422|VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis (T388604)]]
[20:44:27] <stashbot>	 T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604
[20:44:39] <James_F>	 Kemayo: Needed to be a C+2 comment.
[20:45:00] <James_F>	 Kemayo: But anyone with spiderpig deploy access will have C+2 access.
[20:45:45] <Kemayo>	 James_F: Not sure that's true -- I certainly don't on mediawiki-config.
[20:46:37] <logmsgbot>	 !log kemayo@deploy1003 jforrester, kemayo: Backport for [[gerrit:1148909|Extend the mobile insert menu config so that tools can be specified (T388604)]], [[gerrit:1148908|Extend the mobile insert menu config so that tools can be specified (T388604)]], [[gerrit:1148422|VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis (T388604)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug
[20:46:37] <logmsgbot>	 ). Changes can now be verified there.
[20:46:47] <James_F>	 Kemayo: Oh dear, you should file a task about that. C+2 is available for wmf-deployment https://gerrit.wikimedia.org/r/admin/repos/operations/mediawiki-config,access
[20:46:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845786 (10RobH) So there isn't a specific team in mind for this host, it really depends on what we have for Config D use in next fiscal.  This fiscal, we purchased the following hostnam...
[20:48:20] <logmsgbot>	 !log kemayo@deploy1003 jforrester, kemayo: Continuing with sync
[20:48:52] <Tchanders>	 Kemayo: No need to apologise!
[20:50:27] <logmsgbot>	 jclark@cumin1002 provision (PID 139211) is awaiting input
[20:53:03] <wikibugs>	 (03PS1) 10SBassett: Revert "OATHAuth: Mark checkuser and suppress as requiring 2FA" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148956
[20:53:32] <Kemayo>	 James_F: I guess that I skirted through the need to be in wmf-deployment, presumably because scap removed the actual requirement of being able to directly merge things.
[20:53:53] <James_F>	 Hmm, I thought spiderpig was just meant to be a nicer way of having the same access.
[20:55:19] <logmsgbot>	 !log kemayo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148909|Extend the mobile insert menu config so that tools can be specified (T388604)]], [[gerrit:1148908|Extend the mobile insert menu config so that tools can be specified (T388604)]], [[gerrit:1148422|VisualEditor: Deploy '+' mobile menu (and new tools) to Phase 1 wikis (T388604)]] (duration: 10m 56s)
[20:55:21] <Kemayo>	 I assume it's because it offloads the +2 to TrainBranchBot, so the actual user running scap or spiderpig doesn't need the access.
[20:55:23] <stashbot>	 T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604
[20:55:38] <James_F>	 Aye.
[20:55:43] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[20:55:44] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2003.codfw.wmnet with OS bookworm
[20:55:49] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10845818 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm completed: - sretest2003 (**WARN**)...
[20:55:56] <James_F>	 Anyway, you're done but the next window is in five minutes' time (and it's mine).
[20:55:56] <logmsgbot>	 jclark@cumin1002 provision (PID 139211) is awaiting input
[20:56:32] <Kemayo>	 o7
[20:56:34] <James_F>	 sbassett: Did you need to emergency-deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1148956 ?
[20:57:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[20:57:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[21:00:04] <logmsgbot>	 !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@2bce0c7]: Deploy Airflow artifact for T392494 and T394310.
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T2100)
[21:00:08] <stashbot>	 T392494: Add data quality metrics to mediawiki_content_current_v1 - https://phabricator.wikimedia.org/T392494
[21:00:14] <logmsgbot>	 jclark@cumin1002 provision (PID 139216) is awaiting input
[21:00:46] <icinga-wm>	 PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100%
[21:00:56] <icinga-wm>	 PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100%
[21:01:00] <logmsgbot>	 !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@2bce0c7]: Deploy Airflow artifact for T392494 and T394310. (duration: 00m 55s)
[21:02:14] <icinga-wm>	 RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms
[21:02:16] <wikibugs>	 (03CR) 10David Martin: [C:03+2] wikifunctions: Update evaluators from 2025-05-12-235119 to 2025-05-21-192515 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148954 (https://phabricator.wikimedia.org/T385899) (owner: 10Jforrester)
[21:03:59] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-05-12-235119 to 2025-05-21-192515 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148954 (https://phabricator.wikimedia.org/T385899) (owner: 10Jforrester)
[21:04:46] <logmsgbot>	 jhancock@cumin2002 provision (PID 3624840) is awaiting input
[21:06:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[21:07:56] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch1065.eqiad.wmnet with OS bullseye
[21:08:20] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[21:08:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_krb5-admin-server.service on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:09:02] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[21:10:24] <logmsgbot>	 !log dmartin@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[21:10:29] <wikibugs>	 06SRE: when servers are about to run out of disk monitoring should notify the owners - https://phabricator.wikimedia.org/T394955 (10Dzahn) 03NEW
[21:10:46] <wikibugs>	 06SRE: when servers are about to run out of disk monitoring should notify the owners - https://phabricator.wikimedia.org/T394955#10845869 (10Dzahn)
[21:11:11] <logmsgbot>	 !log dmartin@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[21:11:31] <wikibugs>	 06SRE, 06serviceops-radar: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10845871 (10Dzahn) >>! In T392834#10782948, @Ladsgroup wrote: > Yeah. Can you file a ticket for better monitoring?  done. T394955
[21:12:02] <logmsgbot>	 !log dmartin@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[21:12:59] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1103.eqiad.wmnet with OS bullseye
[21:13:09] <logmsgbot>	 !log dmartin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[21:15:01] <wikibugs>	 (03CR) 10David Martin: [C:03+2] wikifunctions: Update orchestrator from 2025-05-14-112404 to 2025-05-21-192453 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148857 (https://phabricator.wikimedia.org/T385899) (owner: 10Jforrester)
[21:15:43] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10845884 (10Jclark-ctr)
[21:15:59] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:16:21] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1007.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:16:56] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-05-14-112404 to 2025-05-21-192453 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148857 (https://phabricator.wikimedia.org/T385899) (owner: 10Jforrester)
[21:18:06] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[21:18:30] <logmsgbot>	 !log dmartin@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[21:19:05] <logmsgbot>	 !log dmartin@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[21:19:36] <logmsgbot>	 !log dmartin@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[21:19:51] <logmsgbot>	 !log dmartin@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[21:20:25] <logmsgbot>	 !log dmartin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[21:25:12] <wikibugs>	 (03PS1) 10Dzahn: aprepo: allow gitlab-ce and gitlab-runner versions > 17.10 < 17.11 [puppet] - 10https://gerrit.wikimedia.org/r/1148966 (https://phabricator.wikimedia.org/T394953)
[21:25:34] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[21:25:44] <icinga-wm>	 PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100%
[21:26:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845911 (10Jhancock.wm) okay. i'll do this bios test and wrap up the task then. I'll keep some notes for the next dcops meeting
[21:26:12] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] aprepo: allow gitlab-ce and gitlab-runner versions > 17.10 < 17.11 [puppet] - 10https://gerrit.wikimedia.org/r/1148966 (https://phabricator.wikimedia.org/T394953) (owner: 10Dzahn)
[21:29:24] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for thanos-be1008,1009 - jclark@cumin1002"
[21:29:56] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for thanos-be1008,1009 - jclark@cumin1002"
[21:29:56] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:33:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[21:36:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm
[21:36:21] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10845938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm
[21:36:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2004.codfw.wmnet with OS bookworm
[21:36:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10845939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2004.codfw.wmnet with OS bookworm
[21:37:31] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:37:43] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:43:14] <icinga-wm>	 RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms
[21:43:38] <icinga-wm>	 RECOVERY - Host sretest2004 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms
[21:49:50] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966)
[21:50:10] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[21:51:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[21:51:16] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966)
[21:51:32] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1008.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:51:56] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1009.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[21:52:25] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966)
[21:52:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[21:53:34] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] "@volans great catch! will fix these and/or delete these cookbooks" [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper)
[21:56:32] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[21:58:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage
[21:59:34] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1006.eqiad.wmnet with OS bullseye
[21:59:47] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10846017 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host thanos-be1006.eqiad.wmnet with OS bull...
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250521T2200)
[22:00:36] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10846019 (10Jclark-ctr)
[22:02:30] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2004.codfw.wmnet with reason: host reimage
[22:02:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch1103.eqiad.wmnet with OS bullseye
[22:02:38] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host thanos-be1007.eqiad.wmnet with OS bullseye
[22:02:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch1103
[22:02:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch1103
[22:02:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.02 - 2025.05.23): Upgrade an-worker hard drives from 4TB to 8TB (group 4 - rack F3) - https://phabricator.wikimedia.org/T390171#10846026 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host thanos-be1...
[22:02:59] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Upgrading to Java 11.0.27 - eevans@cumin1002
[22:06:49] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be100[6-9] - https://phabricator.wikimedia.org/T392909#10846031 (10Jclark-ctr) @MatthewVernon  these have been provisioned but look like i need to disable TLS. Will try again tomorrow
[22:06:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: BAD PEM3 on cr2-codfw - https://phabricator.wikimedia.org/T394868#10846032 (10Papaul) ` Case 2025-0520-703157 has been updated by Mathias Zuniga   UPDATE HAS BEEN ADDED:   Hi Papaul,  Thank you for your update, I have opened a t...
[22:08:32] <jinxer-wm>	 FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:08:43] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: add SLIs for main & scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1148976
[22:09:22] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: BAD PEM3 on cr2-codfw - https://phabricator.wikimedia.org/T394868#10846033 (10Papaul) p:05Triage→03High a:03Jhancock.wm
[22:09:31] <wikibugs>	 (03CR) 10Bking: [C:03+1] wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[22:09:49] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: remove SLI/SLO for public wdqs [puppet] - 10https://gerrit.wikimedia.org/r/1148974 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper)
[22:13:20] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: nuke previously absented pyrra update lag [puppet] - 10https://gerrit.wikimedia.org/r/1148979 (https://phabricator.wikimedia.org/T393966)
[22:16:10] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 54.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:16:26] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:19:58] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on an-redacteddb1001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:20:21] <logmsgbot>	 jhancock@cumin2002 reimage (PID 3641329) is awaiting input
[22:20:42] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2004.codfw.wmnet with OS bookworm
[22:20:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2004 (Config D 1P) - https://phabricator.wikimedia.org/T393986#10846049 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2004.codfw.wmnet with OS bookworm completed: - sretest2004 (**WARN**)...
[22:23:32] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[22:25:56] <icinga-wm>	 PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Fri 06 Jun 2025 10:25:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[22:41:01] <sbassett>	 Anybody care if I do a quick config deploy?  It’s basically reverting the recent os/cu 2fa enforcement to allow for more comms: https://gerrit.wikimedia.org/r/1148956
[22:43:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbassett@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148956 (owner: 10SBassett)
[22:44:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "OATHAuth: Mark checkuser and suppress as requiring 2FA" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148956 (owner: 10SBassett)
[22:44:37] <logmsgbot>	 !log sbassett@deploy1003 Started scap sync-world: Backport for [[gerrit:1148956|Revert "OATHAuth: Mark checkuser and suppress as requiring 2FA"]]
[22:46:57] <logmsgbot>	 !log sbassett@deploy1003 sbassett: Backport for [[gerrit:1148956|Revert "OATHAuth: Mark checkuser and suppress as requiring 2FA"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:47:54] <logmsgbot>	 !log sbassett@deploy1003 sbassett: Continuing with sync
[22:52:28] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release
[22:54:43] <logmsgbot>	 !log sbassett@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148956|Revert "OATHAuth: Mark checkuser and suppress as requiring 2FA"]] (duration: 10m 05s)
[22:56:10] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2003.codfw.wmnet with OS bookworm
[22:56:13] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10846097 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm executed with errors: - sretest2003 (...
[23:00:14] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release
[23:01:15] <wikibugs>	 06SRE: when servers are about to run out of disk, monitoring should notify the owners  - https://phabricator.wikimedia.org/T394955#10846113 (10Reedy)
[23:02:48] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10846117 (10Jhancock.wm) this server did take to uefi but does not want to reimage to bios for some reason. bios image had an issue with the drive/raid config but uefi did not. will reim...
[23:13:11] <wikibugs>	 (03PS1) 10Dzahn: mediawiki/apache: redirect tj.*.org to tg.*.org for all projects [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803)
[23:14:40] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1002 is CRITICAL: 1.005e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[23:16:03] <wikibugs>	 (03PS2) 10Dzahn: mediawiki/apache: redirect tj.*.org to tg.*.org for all projects [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803)
[23:18:32] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[23:21:45] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] lists: include nftables throttling profile [puppet] - 10https://gerrit.wikimedia.org/r/1148432 (https://phabricator.wikimedia.org/T394519) (owner: 10Dzahn)
[23:23:00] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch1103.eqiad.wmnet with OS bullseye
[23:30:41] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "originally I just wanted to include this but not enable it yet.. but then I did." [puppet] - 10https://gerrit.wikimedia.org/r/1148432 (https://phabricator.wikimedia.org/T394519) (owner: 10Dzahn)
[23:31:54] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "active on lists2002 - but not active yet on lists1004 because puppet is still disabled for now" [puppet] - 10https://gerrit.wikimedia.org/r/1148432 (https://phabricator.wikimedia.org/T394519) (owner: 10Dzahn)
[23:39:24] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1148984
[23:39:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1148984 (owner: 10TrainBranchBot)
[23:52:19] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1148984 (owner: 10TrainBranchBot)