[00:04:25] FIRING: [2x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:26] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1138126 [00:10:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1138126 (owner: 10TrainBranchBot) [00:10:39] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 631.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:23:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:29:14] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1138126 (owner: 10TrainBranchBot) [00:46:37] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/4fbc422225e74397d6ee914983db2462e878535572f84fd2161502f1174caaef/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:03:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:08:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:38:06] 06SRE: Image not rendering properly on most projects - https://phabricator.wikimedia.org/T392435#10759238 (10Samwilson) There is a similar error with https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/Rail_Bridge_%28Humphery%29_Gayndah_%282002%29.jpg/330px-Rail_Bridge_%28Humphery%29_Gayndah_%282002%29.jpg... [01:46:37] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:08:41] FIRING: [4x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:16:57] 06SRE: Image not rendering properly on most projects - https://phabricator.wikimedia.org/T392435#10759249 (10Dylsss) > However with a different Accept header it works strangely : I think this was just cache splitting, it is now not working either The actual thumbnailing error from thumbor is the same for both... [02:17:27] 06SRE: Image not rendering properly on most projects - https://phabricator.wikimedia.org/T392435#10759254 (10Dylsss) →14Duplicate dup:03T381594 [02:22:39] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:42:13] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:44:11] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:51:53] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:52:51] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:05:53] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:06:51] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:35:48] FIRING: PuppetDisabled: Puppet disabled on elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=elasticsearch&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [03:53:41] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:26] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:09:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:14:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136975 (https://phabricator.wikimedia.org/T391311) (owner: 10Abijeet Patro) [04:23:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [05:03:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:08:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:13:50] (03PS3) 10Dzahn: gerrit: switchover to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [05:16:00] (03PS4) 10Arnaudb: gerrit: switchover to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) [05:16:35] (03CR) 10Arnaudb: gerrit: switchover to gerrit1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [05:31:59] (03PS1) 10KartikMistry: Update cxserver to 2025-04-15-070132-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138149 (https://phabricator.wikimedia.org/T391289) [05:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:59:54] (03CR) 10Abijeet Patro: [C:03+1] Update cxserver to 2025-04-15-070132-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138149 (https://phabricator.wikimedia.org/T391289) (owner: 10KartikMistry) [06:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:05:21] RECOVERY - Disk space on an-worker1165 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1165&var-datasource=eqiad+prometheus/ops [06:07:33] RECOVERY - Disk space on an-worker1116 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1116&var-datasource=eqiad+prometheus/ops [06:08:41] FIRING: [4x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:11:09] RECOVERY - Disk space on an-worker1089 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1089&var-datasource=eqiad+prometheus/ops [06:31:42] !log installing erlang security updates [06:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:53] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on cloudcephmon1004 - https://phabricator.wikimedia.org/T392424#10759380 (10taavi) [06:40:56] (03PS1) 10Muehlenhoff: Fix typo in Cirrus Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1138152 [06:43:52] (03PS1) 10Ilias Sarantopoulos: ml-services: enable multiprocessing for ptwiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127851 [06:43:57] (03PS2) 10Ilias Sarantopoulos: ml-services: enable multiprocessing for ptwiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127851 [06:48:58] (03CR) 10Kevin Bazira: [C:03+1] ml-services: enable multiprocessing for ptwiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1127851 (owner: 10Ilias Sarantopoulos) [06:55:11] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:55:45] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:58:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:00:05] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T0700). [07:00:05] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:08] (03CR) 10Hashar: [C:04-1] "For the context, I am fine having the replica brought down for a short period of time. It supposedly only feed non interactive users (code" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [07:02:15] abijeet: here? [07:03:32] kart_, yes [07:05:15] OK. Let's deploy your change. [07:06:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136975 (https://phabricator.wikimedia.org/T391311) (owner: 10Abijeet Patro) [07:06:53] (03Merged) 10jenkins-bot: Add channel for ContentTranslation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136975 (https://phabricator.wikimedia.org/T391311) (owner: 10Abijeet Patro) [07:07:32] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1136975|Add channel for ContentTranslation logging (T391311)]] [07:07:36] T391311: ContentTranslation: DBQueryError: Error 1062: Duplicate entry for key 'cx_corpora_unique' when saving - https://phabricator.wikimedia.org/T391311 [07:12:23] !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1136975|Add channel for ContentTranslation logging (T391311)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:13:56] abijeet: possible to test on the testserver(s)? [07:15:12] (03CR) 10Slyngshede: [C:03+1] idp-test: disable monitoring notifications, copy theme setting [puppet] - 10https://gerrit.wikimedia.org/r/1137329 (owner: 10Dzahn) [07:15:28] (03CR) 10Slyngshede: [C:03+2] idp-test: disable monitoring notifications, copy theme setting [puppet] - 10https://gerrit.wikimedia.org/r/1137329 (owner: 10Dzahn) [07:15:43] kart_, hmm, no, right now there is nothing logging to that channel [07:17:42] OK. Let's deploy. [07:17:46] !log kartik@deploy1003 abi, kartik: Continuing with sync [07:19:31] !log installing libapache2-mod-auth-openidc security updates [07:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:26] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136975|Add channel for ContentTranslation logging (T391311)]] (duration: 16m 53s) [07:24:30] T391311: ContentTranslation: DBQueryError: Error 1062: Duplicate entry for key 'cx_corpora_unique' when saving - https://phabricator.wikimedia.org/T391311 [07:26:54] (03CR) 10Jelto: [V:03+1 C:03+2] miscweb: remove query-service from legacy vms [puppet] - 10https://gerrit.wikimedia.org/r/1136724 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [07:26:57] !log elukey@ganeti2032:~$ sudo gnt-instance modify -B memory=6g,vcpus=4 ml-serve-ctrl2001.codfw.wmnet - T392289 [07:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:01] T392289: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289 [07:27:02] !log elukey@ganeti2032:~$ sudo gnt-instance modify -B memory=6g,vcpus=4 ml-serve-ctrl2002.codfw.wmnet - T392289 [07:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:44] !log elukey@ganeti1048:~$ sudo gnt-instance modify -B memory=6g,vcpus=4 ml-serve-ctrl1002.eqiad.wmnet - T392289 [07:27:47] !log elukey@ganeti1048:~$ sudo gnt-instance modify -B memory=6g,vcpus=4 ml-serve-ctrl1001.eqiad.wmnet - T392289 [07:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:02] !log reboot ml-serve-ctrl* VMs to pick up new cpu/memory settings - T392289 [07:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:29] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet [07:28:51] 06SRE, 07Kubernetes: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289#10759557 (10ops-monitoring-bot) VM ml-serve-ctrl2002.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase vcores and memory [07:32:13] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:32:45] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:32:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:32:56] FIRING: [2x] ProbeDown: Service miscweb2003:443 has failed probes (http_commons_query_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:33:18] (03PS1) 10Majavah: quarry: Drop obsolete files [puppet] - 10https://gerrit.wikimedia.org/r/1138241 [07:33:24] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet [07:35:48] FIRING: PuppetDisabled: Puppet disabled on elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=elasticsearch&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [07:36:45] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet [07:36:58] 06SRE, 07Kubernetes: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289#10759603 (10ops-monitoring-bot) VM ml-serve-ctrl2001.codfw.wmnet rebooted by elukey@cumin1002 with reason: Increase vcores and memory [07:37:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:37:56] FIRING: [8x] ProbeDown: Service miscweb1003:443 has failed probes (http_commons_query_wikimedia_org_collab_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:38:30] (03CR) 10Muehlenhoff: [C:03+2] Fix typo in Cirrus Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1138152 (owner: 10Muehlenhoff) [07:40:17] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:41:13] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:41:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet [07:44:46] (03PS1) 10Brouberol: deployment_server: remove stream-enrichment-poc from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1138244 (https://phabricator.wikimedia.org/T392449) [07:45:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1137747 (owner: 10Majavah) [07:45:58] (03CR) 10Majavah: [C:03+2] hieradata: puppet-compiler: Drop obsolete key [puppet] - 10https://gerrit.wikimedia.org/r/1137747 (owner: 10Majavah) [07:46:53] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet [07:47:23] 06SRE, 07Kubernetes: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289#10759636 (10ops-monitoring-bot) VM ml-serve-ctrl1001.eqiad.wmnet rebooted by elukey@cumin1002 with reason: Increase vcores and memory [07:47:35] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:47:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:47:56] FIRING: [9x] ProbeDown: Service miscweb1003:443 has failed probes (http_commons_query_wikimedia_org_collab_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:48:24] jouncebot: nowandnext [07:48:24] For the next 0 hour(s) and 11 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T0700) [07:48:25] In 2 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T1000) [07:49:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137731 (https://phabricator.wikimedia.org/T386689) (owner: 10Majavah) [07:50:35] (03Merged) 10jenkins-bot: Add WMCS v6 range to relevant exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137731 (https://phabricator.wikimedia.org/T386689) (owner: 10Majavah) [07:50:47] (03PS1) 10Brouberol: deployment_server: remove echoserver from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1138246 (https://phabricator.wikimedia.org/T392455) [07:50:56] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1137731|Add WMCS v6 range to relevant exclusions (T386689)]] [07:51:00] T386689: Add new WMCS IPv6 ranges to MediaWiki configuration where required - https://phabricator.wikimedia.org/T386689 [07:51:17] (03PS1) 10Brouberol: dse-k8s: uninstall echoserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138247 (https://phabricator.wikimedia.org/T392455) [07:51:51] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet [07:52:06] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet [07:52:25] 06SRE, 07Kubernetes: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289#10759692 (10ops-monitoring-bot) VM ml-serve-ctrl1002.eqiad.wmnet rebooted by elukey@cumin1002 with reason: Increase vcores and memory [07:52:35] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:52:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:52:50] (03PS1) 10Brouberol: deployment_server: remove postresql-test from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1138248 (https://phabricator.wikimedia.org/T392456) [07:52:56] FIRING: [10x] ProbeDown: Service miscweb1003:30443 has failed probes (http_query_wikidata_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:53:18] (03PS1) 10Brouberol: dse-k8s: uninstall postgresql-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138249 (https://phabricator.wikimedia.org/T392456) [07:53:41] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:55:26] !log taavi@deploy1003 taavi: Backport for [[gerrit:1137731|Add WMCS v6 range to relevant exclusions (T386689)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:55:45] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:56:22] !log taavi@deploy1003 taavi: Continuing with sync [07:56:23] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:56:47] (03CR) 10Muehlenhoff: "This component includes the external containerd and the external docker-ce on Buster/Bullseye. These are no longer needed on Bookworm and " [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [07:56:59] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet [07:57:33] 06SRE, 07Kubernetes: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289#10759718 (10elukey) 05Open→03Resolved a:03elukey [07:57:36] 06SRE, 07Kubernetes: Increase vcpus on K8s control plane VMs - https://phabricator.wikimedia.org/T392289#10759720 (10elukey) [07:57:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:02:55] !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137731|Add WMCS v6 range to relevant exclusions (T386689)]] (duration: 11m 58s) [08:02:59] T386689: Add new WMCS IPv6 ranges to MediaWiki configuration where required - https://phabricator.wikimedia.org/T386689 [08:05:26] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:30] (03PS1) 10Filippo Giunchedi: pontoon: disable package upgrades on first boot [puppet] - 10https://gerrit.wikimedia.org/r/1138251 (https://phabricator.wikimedia.org/T390822) [08:09:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:16:16] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: disable package upgrades on first boot [puppet] - 10https://gerrit.wikimedia.org/r/1138251 (https://phabricator.wikimedia.org/T390822) (owner: 10Filippo Giunchedi) [08:18:38] !log installing openjpeg2 security updates [08:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:04] (03PS1) 10Jelto: microsites: fix regex_matches for query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1138255 (https://phabricator.wikimedia.org/T350793) [08:23:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:24:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host mc-misc2001.codfw.wmnet [08:24:39] (03CR) 10Jelto: [C:03+2] microsites: fix regex_matches for query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1138255 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [08:29:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2001.codfw.wmnet [08:31:56] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on cloudcephmon1004 - https://phabricator.wikimedia.org/T392424#10759788 (10dcaro) First error from dmesg: ` [Tue Apr 22 12:17:29 2025] sd 0:0:1:0: attempting task abort!scmd(0x000000008c6219ca), outstanding for 61144 ms & timeout 60000 ms [Tue Apr 22 12:17:29 2025... [08:33:46] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on cloudcephmon1004 - https://phabricator.wikimedia.org/T392424#10759807 (10dcaro) [08:35:29] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on cloudcephmon1004 - https://phabricator.wikimedia.org/T392424#10759816 (10dcaro) ` root@cloudcephmon1004:~# sudo mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Tue Nov 26 11:34:32 2024 Raid Level : raid10 Array Siz... [08:37:02] 10ops-eqiad, 06DC-Ops: hw troubleshooting: disk failure (sdb) on coludcephmon1004 - https://phabricator.wikimedia.org/T392458#10759818 (10taavi) [08:43:59] (03PS1) 10Superpes15: Add throttle exemptions for some Edit-a-thons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138289 (https://phabricator.wikimedia.org/T391764) [08:47:56] FIRING: [2x] ProbeDown: Service miscweb1003:30443 has failed probes (http_query_wikidata_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:49:12] (03CR) 10FNegri: prometheus: cloudvirt-libvirt-stats: Ignore file paths as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137218 (https://phabricator.wikimedia.org/T289563) (owner: 10Majavah) [08:49:25] (03CR) 10Stevemunene: [C:03+1] deployment_server: remove stream-enrichment-poc from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1138244 (https://phabricator.wikimedia.org/T392449) (owner: 10Brouberol) [08:50:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (1) discovery-search elastic1067 out of eqiad D6 - https://phabricator.wikimedia.org/T391542#10759894 (10brouberol) 05Invalid→03Resolved [08:50:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (4) data-platform-sre hosts out of eqiad D6 - https://phabricator.wikimedia.org/T391539#10759897 (10brouberol) 05Invalid→03Resolved [08:50:24] (03PS1) 10Muehlenhoff: os-reports: Fix report generation [puppet] - 10https://gerrit.wikimedia.org/r/1138290 [08:50:40] (03CR) 10FNegri: [C:03+1] "Nice, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1137727 (https://phabricator.wikimedia.org/T366471) (owner: 10Majavah) [08:52:37] (03CR) 10Brouberol: [C:03+2] Configure a scap deployment of mediwiki-dumps-legacy [puppet] - 10https://gerrit.wikimedia.org/r/1130683 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [08:52:56] RESOLVED: [2x] ProbeDown: Service miscweb1003:30443 has failed probes (http_query_wikidata_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:53:17] (03CR) 10FNegri: hieradata: Drop old cloudinfra cumin hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137728 (https://phabricator.wikimedia.org/T367725) (owner: 10Majavah) [08:53:35] (03CR) 10Stevemunene: [C:03+1] deployment_server: remove postresql-test from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1138248 (https://phabricator.wikimedia.org/T392456) (owner: 10Brouberol) [08:54:28] (03CR) 10FNegri: "LGTM, but I don't know anything about the backup, maybe Andrew does." [puppet] - 10https://gerrit.wikimedia.org/r/1138241 (owner: 10Majavah) [08:55:20] (03CR) 10Stevemunene: [C:03+1] dse-k8s: uninstall postgresql-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138249 (https://phabricator.wikimedia.org/T392456) (owner: 10Brouberol) [08:56:02] (03CR) 10Majavah: [C:03+2] P:toolforge: redis_sentinel: Fix non-breaking spaces [puppet] - 10https://gerrit.wikimedia.org/r/1137726 (owner: 10Majavah) [08:56:09] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: redis_sentinel: Don't try to set client name [puppet] - 10https://gerrit.wikimedia.org/r/1137727 (https://phabricator.wikimedia.org/T366471) (owner: 10Majavah) [08:56:22] (03CR) 10Brouberol: [C:03+2] dse-k8s: uninstall postgresql-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138249 (https://phabricator.wikimedia.org/T392456) (owner: 10Brouberol) [08:56:48] (03CR) 10Stevemunene: [C:03+1] dse-k8s: uninstall echoserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138247 (https://phabricator.wikimedia.org/T392455) (owner: 10Brouberol) [08:57:20] (03CR) 10Brouberol: [C:03+2] dse-k8s: uninstall echoserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138247 (https://phabricator.wikimedia.org/T392455) (owner: 10Brouberol) [08:57:29] (03CR) 10Slyngshede: [C:03+2] Netbox: simplify query [alerts] - 10https://gerrit.wikimedia.org/r/1136989 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:57:33] (03CR) 10Stevemunene: [C:03+1] deployment_server: remove echoserver from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1138246 (https://phabricator.wikimedia.org/T392455) (owner: 10Brouberol) [08:57:34] 10ops-eqiad, 06DC-Ops: hw troubleshooting: disk failure (sdb) on coludcephmon1004 - https://phabricator.wikimedia.org/T392458#10759910 (10dcaro) The host is still [under warranty](https://www.dell.com/support/home/en-ie/product-support/servicetag/0-WGwrNmNaMUlyQVdYcG9BL2FaU3J4dz090/overview). ` root@cloudceph... [08:57:36] (03CR) 10Majavah: hieradata: Drop old cloudinfra cumin hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137728 (https://phabricator.wikimedia.org/T367725) (owner: 10Majavah) [08:57:41] (03CR) 10Elukey: [C:03+1] os-reports: Fix report generation [puppet] - 10https://gerrit.wikimedia.org/r/1138290 (owner: 10Muehlenhoff) [08:58:36] (03PS2) 10Cathal Mooney: Delegate WMCS Eqiad ranges to OpenStack auth dns [dns] - 10https://gerrit.wikimedia.org/r/1113527 (https://phabricator.wikimedia.org/T380746) [08:59:08] (03PS4) 10Majavah: prometheus: cloudvirt-libvirt-stats: Ignore file paths as well [puppet] - 10https://gerrit.wikimedia.org/r/1137218 (https://phabricator.wikimedia.org/T289563) [08:59:09] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! Addresses match what we've set up in terms of static routes on the cloudsw." [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [08:59:16] (03CR) 10Majavah: prometheus: cloudvirt-libvirt-stats: Ignore file paths as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137218 (https://phabricator.wikimedia.org/T289563) (owner: 10Majavah) [09:00:29] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: enable IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1137793 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [09:03:18] (03CR) 10Muehlenhoff: [C:03+2] os-reports: Fix report generation [puppet] - 10https://gerrit.wikimedia.org/r/1138290 (owner: 10Muehlenhoff) [09:03:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:15] !log aborrero@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet [09:06:44] (03CR) 10FNegri: [C:03+1] hieradata: Drop old cloudinfra cumin hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137728 (https://phabricator.wikimedia.org/T367725) (owner: 10Majavah) [09:07:27] (03CR) 10FNegri: [C:03+1] prometheus: cloudvirt-libvirt-stats: Ignore file paths as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137218 (https://phabricator.wikimedia.org/T289563) (owner: 10Majavah) [09:08:01] (03CR) 10Cathal Mooney: Delegate WMCS Eqiad ranges to OpenStack auth dns [dns] - 10https://gerrit.wikimedia.org/r/1113527 (https://phabricator.wikimedia.org/T380746) (owner: 10Cathal Mooney) [09:08:44] (03CR) 10Majavah: [C:03+2] prometheus: cloudvirt-libvirt-stats: Ignore file paths as well [puppet] - 10https://gerrit.wikimedia.org/r/1137218 (https://phabricator.wikimedia.org/T289563) (owner: 10Majavah) [09:10:34] (03CR) 10Brouberol: [C:03+2] deployment_server: remove stream-enrichment-poc from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1138244 (https://phabricator.wikimedia.org/T392449) (owner: 10Brouberol) [09:10:39] (03CR) 10Brouberol: [C:03+2] deployment_server: remove postresql-test from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1138248 (https://phabricator.wikimedia.org/T392456) (owner: 10Brouberol) [09:10:43] (03CR) 10Brouberol: [C:03+2] deployment_server: remove echoserver from dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1138246 (https://phabricator.wikimedia.org/T392455) (owner: 10Brouberol) [09:10:54] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1004.eqiad.wmnet [09:16:44] (03CR) 10Jaime Nuche: [C:03+1] "`deploy_user` and `scap::user` are two different users. The former is used to deploy a tool using scap, the latter is used by scap to inst" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [09:17:39] !log installing tomcat9 security updates [09:17:45] (03PS1) 10Jelto: microsites: remove profile::microsites::query_service* [puppet] - 10https://gerrit.wikimedia.org/r/1138296 (https://phabricator.wikimedia.org/T350793) [09:17:47] !log aborrero@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet [09:22:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10760010 (10fnegri) 05In progress→03Resolved There was definitely an improvement, but Inlet Temp for clouddumps1001 remains about... [09:23:10] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5328/console" [puppet] - 10https://gerrit.wikimedia.org/r/1138296 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:24:12] (03CR) 10Arnaudb: [C:03+1] "looks good to me !" [puppet] - 10https://gerrit.wikimedia.org/r/1138296 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:24:22] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet [09:24:27] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: fix IPv6 range on VIPs [puppet] - 10https://gerrit.wikimedia.org/r/1138297 (https://phabricator.wikimedia.org/T380174) [09:25:26] (03CR) 10Cathal Mooney: [C:03+1] cloudgw: fix IPv6 range on VIPs [puppet] - 10https://gerrit.wikimedia.org/r/1138297 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [09:25:52] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: fix IPv6 range on VIPs [puppet] - 10https://gerrit.wikimedia.org/r/1138297 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [09:28:25] * elukey bbl, interview [09:28:29] uff sorry [09:29:07] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [09:33:55] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: correct dns record for cloudgw vip eqiad - cmooney@cumin1002" [09:33:58] (03CR) 10Jelto: [V:03+1 C:03+2] microsites: remove profile::microsites::query_service* [puppet] - 10https://gerrit.wikimedia.org/r/1138296 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:34:11] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: correct dns record for cloudgw vip eqiad - cmooney@cumin1002" [09:34:11] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:34:27] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache wan.cloudgw.eqiad1.wikimediacloud.org on all recursors [09:34:31] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wan.cloudgw.eqiad1.wikimediacloud.org on all recursors [09:35:31] FIRING: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:35:41] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache 2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.e.f.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa on all recursors [09:35:45] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.e.f.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa on all recursors [09:36:29] FIRING: NodeTextfileStale: Stale textfile for elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:40:31] RESOLVED: [2x] ProbeDown: Service gerrit2002:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:44:23] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: update wan VIP [puppet] - 10https://gerrit.wikimedia.org/r/1138304 (https://phabricator.wikimedia.org/T380174) [09:47:08] (03CR) 10Cathal Mooney: [C:03+1] cloudgw: update wan VIP [puppet] - 10https://gerrit.wikimedia.org/r/1138304 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [09:47:23] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: update wan VIP [puppet] - 10https://gerrit.wikimedia.org/r/1138304 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [09:49:50] 06SRE, 10SRE-swift-storage, 06Commons: File not found: /v1/AUTH_mw/wikipedia-commons-local-public on Wikimedia Commons - https://phabricator.wikimedia.org/T321869#10760117 (10PMG) Same situation here: https://commons.wikimedia.org/wiki/File:Yankees_Baseball_(1)_(10562830654).jpg When I click on thumbnail I... [09:49:58] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [09:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:39] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:52:41] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [09:53:11] 06SRE, 10SRE-swift-storage, 06Commons: File not found: /v1/AUTH_mw/wikipedia-commons-local-public on Wikimedia Commons - https://phabricator.wikimedia.org/T321869#10760121 (10PMG) Similar issues were reported in T231078 [09:54:04] 10SRE-swift-storage: 404 File not found: /v1/AUTH_mw/wikipedia-commons-local-public.ab/a/ab/Chirimena_%281341689114%29.jpg - https://phabricator.wikimedia.org/T231078#10760127 (10PMG) Similar issues were reported in T321869 [09:54:14] (03PS3) 10Arturo Borrero Gonzalez: openstack: networktests: enable IPv6 tests on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1137785 (https://phabricator.wikimedia.org/T391325) [09:56:53] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: correct dns record for cloudgw vip eqiad - cmooney@cumin1002" [09:56:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: correct dns record for cloudgw vip eqiad - cmooney@cumin1002" [09:56:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:57:14] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache 2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.e.f.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa on all recursors [09:57:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.e.f.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa on all recursors [09:57:24] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache 3.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.e.f.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa on all recursors [09:57:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 3.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.e.f.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa on all recursors [09:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:58:09] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: use rack-specific default IPv6 routes [puppet] - 10https://gerrit.wikimedia.org/r/1138306 (https://phabricator.wikimedia.org/T380174) [09:58:20] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138306 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T1000) [10:02:33] (03CR) 10Cathal Mooney: [C:03+1] cloudgw: use rack-specific default IPv6 routes [puppet] - 10https://gerrit.wikimedia.org/r/1138306 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [10:04:04] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: use rack-specific default IPv6 routes [puppet] - 10https://gerrit.wikimedia.org/r/1138306 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [10:06:18] (03PS2) 10Majavah: hieradata: Drop old cloudinfra cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1137728 (https://phabricator.wikimedia.org/T367725) [10:06:32] (03CR) 10Majavah: hieradata: Drop old cloudinfra cumin hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137728 (https://phabricator.wikimedia.org/T367725) (owner: 10Majavah) [10:06:49] !log aborrero@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet [10:08:41] FIRING: [4x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:10:09] (03PS1) 10Vgutierrez: secret: Add wmfuniq snakeoil [labs/private] - 10https://gerrit.wikimedia.org/r/1138307 (https://phabricator.wikimedia.org/T391411) [10:11:09] (03PS1) 10Jelto: gerrit/nftables_throttling: add tracking_duration parameter [puppet] - 10https://gerrit.wikimedia.org/r/1138308 (https://phabricator.wikimedia.org/T392467) [10:11:18] (03CR) 10Vgutierrez: [C:03+2] secret: Add wmfuniq snakeoil [labs/private] - 10https://gerrit.wikimedia.org/r/1138307 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [10:11:24] (03CR) 10Vgutierrez: [V:03+2 C:03+2] secret: Add wmfuniq snakeoil [labs/private] - 10https://gerrit.wikimedia.org/r/1138307 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [10:12:12] 06SRE, 06Infrastructure-Foundations, 10netops: WMCS CloudGW: Null-route aggregate ranges in cloud vrf - https://phabricator.wikimedia.org/T392094#10760162 (10cmooney) 05Open→03Resolved a:03cmooney @arturo I think this has been solved by adding this /55 route on the cloudgw: ` cmooney@cloudgw1003:~$... [10:13:19] (03PS1) 10Majavah: P:openstack: cumin: Cleanup cumin master code [puppet] - 10https://gerrit.wikimedia.org/r/1138309 (https://phabricator.wikimedia.org/T367725) [10:13:34] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5329/co" [puppet] - 10https://gerrit.wikimedia.org/r/1138308 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [10:13:36] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1004.eqiad.wmnet [10:14:17] (03Abandoned) 10Majavah: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [10:16:25] (03PS3) 10Cathal Mooney: Delegate WMCS Eqiad ranges to OpenStack auth dns [dns] - 10https://gerrit.wikimedia.org/r/1113527 (https://phabricator.wikimedia.org/T380746) [10:17:03] (03CR) 10CI reject: [V:04-1] Delegate WMCS Eqiad ranges to OpenStack auth dns [dns] - 10https://gerrit.wikimedia.org/r/1113527 (https://phabricator.wikimedia.org/T380746) (owner: 10Cathal Mooney) [10:39:45] (03PS4) 10Vgutierrez: varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) [10:39:56] !log migrating various minor mobileapps/PCS APIs to serve via the rest-gateway instead of restbase [10:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:41] 06SRE, 10SRE-swift-storage, 06Commons: File not found: /v1/AUTH_mw/wikipedia-commons-local-public on Wikimedia Commons - https://phabricator.wikimedia.org/T321869#10760216 (10MatthewVernon) @PMG please don't report similar-but-different problems on old tickets, it makes it much more difficult for the Swift t... [10:40:45] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:40:48] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:40:54] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [10:41:08] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [10:41:14] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [10:41:52] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [10:41:58] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [10:42:45] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [10:46:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T391056)', diff saved to https://phabricator.wikimedia.org/P75292 and previous config saved to /var/cache/conftool/dbconfig/20250423-104627-fceratto.json [10:46:31] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [10:53:00] !log installing php8.2 security updates [10:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:04] (03CR) 10Jaime Nuche: "We should reword the commit message though before merging this" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [11:00:05] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T1100). nyaa~ [11:01:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P75293 and previous config saved to /var/cache/conftool/dbconfig/20250423-110134-fceratto.json [11:01:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:04:24] (03CR) 10Jaime Nuche: "I was thinking maybe we can use a less stringent `require` than what we did last time. The `include scap::user` at the top of the class in" [puppet] - 10https://gerrit.wikimedia.org/r/1137818 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [11:06:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:10:48] (03CR) 10Muehlenhoff: "I think we should rather assign this via modules/admin/data/data.yaml like all other sudo permissions. The current method makes the sudo r" [puppet] - 10https://gerrit.wikimedia.org/r/1130947 (https://phabricator.wikimedia.org/T387823) (owner: 10Hashar) [11:14:20] (03PS1) 10Brouberol: airflow-test-k8s: increase the memory/cpu limit quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138314 (https://phabricator.wikimedia.org/T392470) [11:16:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P75294 and previous config saved to /var/cache/conftool/dbconfig/20250423-111641-fceratto.json [11:27:53] !log installing libxml2 security updates [11:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:11] (03CR) 10Ladsgroup: "I'd argue it's more confusing now since the config is split between labswiki and wikitech which actually made me miss an important config " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [11:31:31] (03CR) 10Ladsgroup: [C:03+1] Use `sul` dblist in InitialiseSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [11:31:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T391056)', diff saved to https://phabricator.wikimedia.org/P75295 and previous config saved to /var/cache/conftool/dbconfig/20250423-113148-fceratto.json [11:31:53] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:31:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1184.eqiad.wmnet with reason: Maintenance [11:32:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T391056)', diff saved to https://phabricator.wikimedia.org/P75296 and previous config saved to /var/cache/conftool/dbconfig/20250423-113200-fceratto.json [11:35:48] FIRING: PuppetDisabled: Puppet disabled on elastic2098:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=elasticsearch&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [11:36:33] (03PS1) 10Jgiannelos: rest gateway: Fix mobile-html-offline-resources URL pattern [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138317 [11:37:54] (03PS1) 10Hnowlan: rest-gateway: fix pathing for offline resources with revision [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138318 (https://phabricator.wikimedia.org/T385033) [11:39:25] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:50] (03CR) 10Jgiannelos: rest-gateway: fix pathing for offline resources with revision (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138318 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan) [11:40:09] (03PS7) 10Cyndywikime: Growth: Remove unused PHP config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128828 (https://phabricator.wikimedia.org/T388787) [11:40:45] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:40:51] (03CR) 10Jgiannelos: rest-gateway: fix pathing for offline resources with revision (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138318 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan) [11:42:14] (03CR) 10Jgiannelos: rest-gateway: fix pathing for offline resources with revision (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138318 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan) [11:42:45] (03CR) 10Jgiannelos: rest-gateway: fix pathing for offline resources with revision (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138318 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan) [11:44:03] (03Abandoned) 10Jgiannelos: rest gateway: Fix mobile-html-offline-resources URL pattern [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138317 (owner: 10Jgiannelos) [11:46:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:48:06] (03CR) 10Stevemunene: [C:03+1] airflow-test-k8s: increase the memory/cpu limit quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138314 (https://phabricator.wikimedia.org/T392470) (owner: 10Brouberol) [11:50:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T391056)', diff saved to https://phabricator.wikimedia.org/P75297 and previous config saved to /var/cache/conftool/dbconfig/20250423-115054-fceratto.json [11:50:59] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:53:41] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:16] (03PS1) 10Filippo Giunchedi: envoyproxy: support loading stats_config [puppet] - 10https://gerrit.wikimedia.org/r/1138327 (https://phabricator.wikimedia.org/T391333) [11:57:59] 06SRE, 10SRE-swift-storage, 06Commons: File not found: /v1/AUTH_mw/wikipedia-commons-local-public on Wikimedia Commons (after delete and restore) - https://phabricator.wikimedia.org/T321869#10760407 (10Krinkle) [11:58:05] (03CR) 10CI reject: [V:04-1] envoyproxy: support loading stats_config [puppet] - 10https://gerrit.wikimedia.org/r/1138327 (https://phabricator.wikimedia.org/T391333) (owner: 10Filippo Giunchedi) [12:00:59] (03PS2) 10Hnowlan: rest-gateway: fix pathing for offline resources with revision [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138318 (https://phabricator.wikimedia.org/T385033) [12:01:13] (03CR) 10Hnowlan: rest-gateway: fix pathing for offline resources with revision (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138318 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan) [12:04:18] (03CR) 10Jgiannelos: [C:03+1] rest-gateway: fix pathing for offline resources with revision [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138318 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan) [12:04:42] (03CR) 10Hnowlan: [C:03+2] rest-gateway: fix pathing for offline resources with revision [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138318 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan) [12:06:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P75298 and previous config saved to /var/cache/conftool/dbconfig/20250423-120602-fceratto.json [12:06:14] (03Merged) 10jenkins-bot: rest-gateway: fix pathing for offline resources with revision [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138318 (https://phabricator.wikimedia.org/T385033) (owner: 10Hnowlan) [12:06:45] (03CR) 10Nikerabbit: [C:03+1] Update cxserver to 2025-04-15-070132-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138149 (https://phabricator.wikimedia.org/T391289) (owner: 10KartikMistry) [12:07:41] (03PS2) 10Filippo Giunchedi: envoyproxy: support loading stats_config [puppet] - 10https://gerrit.wikimedia.org/r/1138327 (https://phabricator.wikimedia.org/T391333) [12:07:41] (03PS1) 10Filippo Giunchedi: envoyproxy: tweak default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/1138329 (https://phabricator.wikimedia.org/T391333) [12:08:12] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:08:18] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:09:29] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:09:37] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:09:47] (03CR) 10CI reject: [V:04-1] envoyproxy: support loading stats_config [puppet] - 10https://gerrit.wikimedia.org/r/1138327 (https://phabricator.wikimedia.org/T391333) (owner: 10Filippo Giunchedi) [12:10:22] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:10:35] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:10:53] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:11:17] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:11:51] (03PS1) 10Hashar: gerrit: convert robots.txt to a flat file [puppet] - 10https://gerrit.wikimedia.org/r/1138330 [12:11:52] (03PS1) 10Hashar: gerrit: prevent crawling of some URLs [puppet] - 10https://gerrit.wikimedia.org/r/1138331 [12:11:52] (03CR) 10Alexandros Kosiaris: [C:03+1] envoyproxy: tweak default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/1138329 (https://phabricator.wikimedia.org/T391333) (owner: 10Filippo Giunchedi) [12:11:56] 06SRE, 10Cloud-Services, 06serviceops: Move cloudweb to Ganeti VMs and repurpose the servers as wikikube nodes - https://phabricator.wikimedia.org/T392478 (10MoritzMuehlenhoff) 03NEW The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.or... [12:12:51] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:13:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10760475 (10fnegri) [12:13:13] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:14:09] (03CR) 10Filippo Giunchedi: "I'm confused ATM by what's wrong with tox/CI in here: https://integration.wikimedia.org/ci/job/operations-puppet-tests-bullseye/9063/conso" [puppet] - 10https://gerrit.wikimedia.org/r/1138327 (https://phabricator.wikimedia.org/T391333) (owner: 10Filippo Giunchedi) [12:14:43] (03CR) 10Alexandros Kosiaris: [C:03+1] "Totally unclear to me why rake fails here. LGTM otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/1138327 (https://phabricator.wikimedia.org/T391333) (owner: 10Filippo Giunchedi) [12:14:57] 06SRE, 06cloud-services-team, 10Horizon, 06serviceops, 10Striker: Move cloudweb to Ganeti VMs and repurpose the servers as wikikube nodes - https://phabricator.wikimedia.org/T392478#10760479 (10taavi) [12:15:51] Deploying Cxserver (staging) [12:16:01] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-04-15-070132-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138149 (https://phabricator.wikimedia.org/T391289) (owner: 10KartikMistry) [12:16:12] (03CR) 10Arnaudb: [C:03+1] gerrit: convert robots.txt to a flat file [puppet] - 10https://gerrit.wikimedia.org/r/1138330 (owner: 10Hashar) [12:16:22] (03CR) 10Arnaudb: [C:03+1] gerrit: prevent crawling of some URLs [puppet] - 10https://gerrit.wikimedia.org/r/1138331 (owner: 10Hashar) [12:17:07] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [12:17:18] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [dns] - 10https://gerrit.wikimedia.org/r/1113527 (https://phabricator.wikimedia.org/T380746) (owner: 10Cathal Mooney) [12:17:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc7 T391454', diff saved to https://phabricator.wikimedia.org/P75301 and previous config saved to /var/cache/conftool/dbconfig/20250423-121722-marostegui.json [12:17:26] T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454 [12:17:50] (03Merged) 10jenkins-bot: Update cxserver to 2025-04-15-070132-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138149 (https://phabricator.wikimedia.org/T391289) (owner: 10KartikMistry) [12:17:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance [12:18:28] (03PS1) 10Marostegui: mariadb: Upgrade pc7 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1138332 (https://phabricator.wikimedia.org/T391454) [12:18:43] (03CR) 10Cathal Mooney: [C:03+2] Delegate WMCS Eqiad ranges to OpenStack auth dns [dns] - 10https://gerrit.wikimedia.org/r/1113527 (https://phabricator.wikimedia.org/T380746) (owner: 10Cathal Mooney) [12:19:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [12:19:10] !log cmooney@dns2005 START - running authdns-update [12:20:24] akosiaris: Bumping chat no longer shows diff? re: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1138149 diff only shows docker image change. [12:20:36] akoopal: chart* [12:21:05] ah, it took some time. Nevermind. [12:21:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P75302 and previous config saved to /var/cache/conftool/dbconfig/20250423-122110-fceratto.json [12:21:14] (03PS3) 10Filippo Giunchedi: envoyproxy: support loading stats_config [puppet] - 10https://gerrit.wikimedia.org/r/1138327 (https://phabricator.wikimedia.org/T391333) [12:21:14] (03PS2) 10Filippo Giunchedi: envoyproxy: tweak default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/1138329 (https://phabricator.wikimedia.org/T391333) [12:21:14] (03PS1) 10Filippo Giunchedi: envoyproxy: update tox python versions [puppet] - 10https://gerrit.wikimedia.org/r/1138334 [12:21:17] !log cmooney@dns2005 END - running authdns-update [12:23:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:24:15] (03CR) 10Marostegui: [C:03+2] mariadb: Upgrade pc7 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1138332 (https://phabricator.wikimedia.org/T391454) (owner: 10Marostegui) [12:25:02] (03CR) 10Filippo Giunchedi: "We weren't pythoning enough in tox" [puppet] - 10https://gerrit.wikimedia.org/r/1138334 (owner: 10Filippo Giunchedi) [12:26:02] (03PS1) 10KartikMistry: cxserver: Fix missing ' in the config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138335 [12:26:23] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all [12:26:29] (03CR) 10Filippo Giunchedi: [C:03+2] envoyproxy: update tox python versions [puppet] - 10https://gerrit.wikimedia.org/r/1138334 (owner: 10Filippo Giunchedi) [12:27:27] !log gerrit: removed obsolete 1024px-Sea_and_sky_light.cache.jpg file from all servers. File was replaced by 2006-12-28_10h26_33.jpg # T392479 [12:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:30] T392479: Background image of login page is served with a short TTL - https://phabricator.wikimedia.org/T392479 [12:28:13] (03PS2) 10KartikMistry: cxserver: Fix missing ' in the config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138335 [12:28:42] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: enable NAT for additional VXLAN subnets [puppet] - 10https://gerrit.wikimedia.org/r/1138336 (https://phabricator.wikimedia.org/T380174) [12:28:56] (03CR) 10Nikerabbit: Catalog ContentTranslation tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas) [12:29:02] (03CR) 10Filippo Giunchedi: [C:03+2] envoyproxy: support loading stats_config [puppet] - 10https://gerrit.wikimedia.org/r/1138327 (https://phabricator.wikimedia.org/T391333) (owner: 10Filippo Giunchedi) [12:29:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc7 T391454', diff saved to https://phabricator.wikimedia.org/P75303 and previous config saved to /var/cache/conftool/dbconfig/20250423-122924-marostegui.json [12:29:29] T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454 [12:29:35] (03CR) 10Cathal Mooney: [C:03+1] cloudgw: enable NAT for additional VXLAN subnets [puppet] - 10https://gerrit.wikimedia.org/r/1138336 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [12:29:54] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: enable NAT for additional VXLAN subnets [puppet] - 10https://gerrit.wikimedia.org/r/1138336 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [12:30:00] (03PS3) 10Muehlenhoff: Setup the new KDC with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133406 (https://phabricator.wikimedia.org/T390863) [12:57:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-codfw [12:58:43] (03CR) 10Slyngshede: [C:03+2] Release version 0.1.11 [software/bitu] - 10https://gerrit.wikimedia.org/r/1135741 (owner: 10Slyngshede) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T1300). [13:00:05] MatmaRex and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:18] I can’t deploy today, sorry [13:00:25] hi [13:00:28] oh oh oh I can! [13:00:51] (03PS1) 10Brouberol: mediawiki-dumps-legacy: increase the memory/cpu limit quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138348 (https://phabricator.wikimedia.org/T392470) [13:00:52] my config patches are no-ops, nothing much to test [13:01:26] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:01:45] (03Merged) 10jenkins-bot: Release version 0.1.11 [software/bitu] - 10https://gerrit.wikimedia.org/r/1135741 (owner: 10Slyngshede) [13:02:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138096 (owner: 10D3r1ck01) [13:02:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851 (owner: 10Bartosz Dziewoński) [13:02:28] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-eqiad [13:02:33] first time using spiderpig :D [13:03:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb1002.eqiad.wmnet [13:03:36] (03Merged) 10jenkins-bot: SUL3: Remove unused CentralAuthSharedDomainPrefix config setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138096 (owner: 10D3r1ck01) [13:03:39] (03Merged) 10jenkins-bot: Simplify CentralAuthEnableSul3 config setting value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851 (owner: 10Bartosz Dziewoński) [13:03:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:55] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1138096|SUL3: Remove unused CentralAuthSharedDomainPrefix config setting]], [[gerrit:1135851|Simplify CentralAuthEnableSul3 config setting value]] [13:04:22] Hi TheresNoTime Mine doesn't require testing too :) [13:04:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-eqiad [13:04:32] Superpes: ack :) [13:04:44] Thanks :3 [13:05:12] (03PS2) 10Superpes15: Add throttle exemptions for some Edit-a-thons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138289 (https://phabricator.wikimedia.org/T391764) [13:05:28] !log aborrero@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet [13:06:30] (03PS1) 10Cyndywikime: Regenerate speed-test snapshot without GENewcomerTasksGuidanceEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138350 (https://phabricator.wikimedia.org/T379568) [13:07:52] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: increase the memory/cpu limit quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138348 (https://phabricator.wikimedia.org/T392470) (owner: 10Brouberol) [13:08:06] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:08:30] !log samtar@deploy1003 d3r1ck01, matmarex, samtar: Backport for [[gerrit:1138096|SUL3: Remove unused CentralAuthSharedDomainPrefix config setting]], [[gerrit:1135851|Simplify CentralAuthEnableSul3 config setting value]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:08:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb1002.eqiad.wmnet [13:08:43] !log samtar@deploy1003 d3r1ck01, matmarex, samtar: Continuing with sync [13:09:23] (03CR) 10Cyndywikime: "This patch is now ready for review :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138350 (https://phabricator.wikimedia.org/T379568) (owner: 10Cyndywikime) [13:10:12] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: enable IPv6 tests on eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1137785 (https://phabricator.wikimedia.org/T391325) (owner: 10Arturo Borrero Gonzalez) [13:10:32] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet [13:11:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P75307 and previous config saved to /var/cache/conftool/dbconfig/20250423-131117-fceratto.json [13:12:57] (03PS1) 10Majavah: P:wmcs::google_api_proxy: Update network ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1138351 [13:14:40] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5333/console" [puppet] - 10https://gerrit.wikimedia.org/r/1138351 (owner: 10Majavah) [13:15:23] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138096|SUL3: Remove unused CentralAuthSharedDomainPrefix config setting]], [[gerrit:1135851|Simplify CentralAuthEnableSul3 config setting value]] (duration: 11m 28s) [13:15:44] MatmaRex: done ^ [13:16:04] thanks [13:16:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138289 (https://phabricator.wikimedia.org/T391764) (owner: 10Superpes15) [13:16:54] (03Merged) 10jenkins-bot: Add throttle exemptions for some Edit-a-thons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138289 (https://phabricator.wikimedia.org/T391764) (owner: 10Superpes15) [13:17:07] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1138289|Add throttle exemptions for some Edit-a-thons (T391764 T391999)]] [13:18:12] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138096|SUL3: Remove unused CentralAuthSharedDomainPrefix config setting]], [[gerrit:1135851|Simplify CentralAuthEnableSul3 config setting value]] (duration: 11m 28s) [13:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:36] (03PS1) 10Kamila Součková: mw-cron: set php.version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138352 (https://phabricator.wikimedia.org/T392441) [13:21:47] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: networktests: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/1138353 (https://phabricator.wikimedia.org/T380174) [13:21:58] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2098 to cirrussearch2098 [13:21:58] !log samtar@deploy1003 superpes, samtar: Backport for [[gerrit:1138289|Add throttle exemptions for some Edit-a-thons (T391764 T391999)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:22:04] !log samtar@deploy1003 superpes, samtar: Continuing with sync [13:22:05] T391764: Lift IP cap on 4 days in June and July 2025 for Editation for jawiki - https://phabricator.wikimedia.org/T391764 [13:22:06] T391999: Lift IP cap on 2025-04-29, 05-06, 05-13, 05-20, 06-03, 06-10, 06-17 for edit-a-thon for eswiki, commons and wikidata. - https://phabricator.wikimedia.org/T391999 [13:22:19] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:23:41] !log installing Linux 6.1.133 on Bookworm hosts [13:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:52] (03CR) 10Hnowlan: [C:03+1] mw-cron: set php.version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138352 (https://phabricator.wikimedia.org/T392441) (owner: 10Kamila Součková) [13:25:29] (03CR) 10Kamila Součková: [C:03+2] mw-cron: set php.version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138352 (https://phabricator.wikimedia.org/T392441) (owner: 10Kamila Součková) [13:25:49] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: eqiad1: networktests: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/1138353 (https://phabricator.wikimedia.org/T380174) (owner: 10Arturo Borrero Gonzalez) [13:26:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P75308 and previous config saved to /var/cache/conftool/dbconfig/20250423-132624-fceratto.json [13:26:28] fceratto@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [13:26:38] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2098 to cirrussearch2098 - bking@cumin2002" [13:26:51] (03Merged) 10jenkins-bot: mw-cron: set php.version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138352 (https://phabricator.wikimedia.org/T392441) (owner: 10Kamila Součková) [13:28:11] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2098 to cirrussearch2098 - bking@cumin2002" [13:28:11] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:28:12] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2098 [13:28:49] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138289|Add throttle exemptions for some Edit-a-thons (T391764 T391999)]] (duration: 11m 42s) [13:28:53] T391764: Lift IP cap on 4 days in June and July 2025 for Editation for jawiki - https://phabricator.wikimedia.org/T391764 [13:28:54] !log kamila@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [13:28:54] T391999: Lift IP cap on 2025-04-29, 05-06, 05-13, 05-20, 06-03, 06-10, 06-17 for edit-a-thon for eswiki, commons and wikidata. - https://phabricator.wikimedia.org/T391999 [13:29:03] Superpes: done ^ [13:29:19] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2098 [13:29:20] !log kamila@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [13:29:34] !log kamila@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [13:29:53] !log kamila@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [13:30:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2098 to cirrussearch2098 [13:31:05] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2098.codfw.wmnet on all recursors [13:31:09] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2098.codfw.wmnet on all recursors [13:31:28] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2098.codfw.wmnet with OS bullseye [13:31:40] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2098 [13:31:47] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:34:47] !log T392462 Ran fixStuckGlobalRename.php for two users [13:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:52] T392462: Unblock stuck global renames of Renamed user d3a1e0becdf319b376c52028d0ac3cf1 and Débora - https://phabricator.wikimedia.org/T392462 [13:35:09] (03CR) 10MVernon: [C:03+1] "The mountpoints on the new system look good (and on the old system look unchanged), so I think this is worth a shot. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1137243 (https://phabricator.wikimedia.org/T391854) (owner: 10Elukey) [13:35:11] Thanks TheresNoTime [13:35:13] :3 [13:36:02] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2098 - bking@cumin2002" [13:36:08] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2098 - bking@cumin2002" [13:36:08] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:36:09] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2098.codfw.wmnet 217.32.192.10.in-addr.arpa 7.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:36:12] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2098.codfw.wmnet 217.32.192.10.in-addr.arpa 7.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:36:13] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2098 [13:36:34] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2098 [13:36:34] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2098 [13:37:16] (03PS1) 10Majavah: hieradata: Set eqiad1 domain_id_internal_reverse_v6 [puppet] - 10https://gerrit.wikimedia.org/r/1138356 (https://phabricator.wikimedia.org/T380174) [13:37:59] (03PS7) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [13:38:22] (03CR) 10CI reject: [V:04-1] [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [13:38:34] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [13:38:40] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2099 to cirrussearch2099 [13:39:03] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:39:59] (03PS8) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [13:41:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T391056)', diff saved to https://phabricator.wikimedia.org/P75309 and previous config saved to /var/cache/conftool/dbconfig/20250423-134131-fceratto.json [13:41:36] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [13:41:36] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1195.eqiad.wmnet with reason: Maintenance [13:41:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T391056)', diff saved to https://phabricator.wikimedia.org/P75310 and previous config saved to /var/cache/conftool/dbconfig/20250423-134142-fceratto.json [13:41:54] (03CR) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [13:43:11] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2099 to cirrussearch2099 - bking@cumin2002" [13:43:30] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2099 to cirrussearch2099 - bking@cumin2002" [13:43:31] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:43:32] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2099 [13:43:44] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2099 [13:43:58] (03CR) 10Elukey: [V:03+1 C:03+2] profile::swift::storage: allow non-scsi id matches for object partitions [puppet] - 10https://gerrit.wikimedia.org/r/1137243 (https://phabricator.wikimedia.org/T391854) (owner: 10Elukey) [13:44:07] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5335/co" [puppet] - 10https://gerrit.wikimedia.org/r/1138356 (https://phabricator.wikimedia.org/T380174) (owner: 10Majavah) [13:44:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2099 to cirrussearch2099 [13:44:27] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Set eqiad1 domain_id_internal_reverse_v6 [puppet] - 10https://gerrit.wikimedia.org/r/1138356 (https://phabricator.wikimedia.org/T380174) (owner: 10Majavah) [13:45:19] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-04-09-214434 to 2025-04-16-213143 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138361 (https://phabricator.wikimedia.org/T391731) [13:45:21] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-04-16-192052 to 2025-04-17-170156 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138362 (https://phabricator.wikimedia.org/T391731) [13:46:07] TheresNoTime: can I add one more wmf-config patch to the current window, or is it too late? [13:46:35] ori: go for it :) can you add it to https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T1300 ? [13:46:49] thanks! doing. [13:47:43] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:47:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118208 (https://phabricator.wikimedia.org/T391516) (owner: 10Ori) [13:48:20] 10ops-codfw, 06DC-Ops, 06SRE Observability: kafka-logging2005 is down since six days - https://phabricator.wikimedia.org/T392488 (10MoritzMuehlenhoff) 03NEW [13:48:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118208 (https://phabricator.wikimedia.org/T391516) (owner: 10Ori) [13:49:40] (03Merged) 10jenkins-bot: Remove temporary '-php8' and '-k8s' suffixes from ArcLamp pipeline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1118208 (https://phabricator.wikimedia.org/T391516) (owner: 10Ori) [13:49:54] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1118208|Remove temporary '-php8' and '-k8s' suffixes from ArcLamp pipeline (T391516)]] [13:49:58] T391516: https://performance.wikimedia.org/php-profiling/ leads to 404 for all listed sources - https://phabricator.wikimedia.org/T391516 [13:53:13] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2099.codfw.wmnet on all recursors [13:53:16] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2099.codfw.wmnet on all recursors [13:53:21] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2098.codfw.wmnet with reason: host reimage [13:53:37] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2099.codfw.wmnet with OS bullseye [13:53:49] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2099 [13:54:26] !log samtar@deploy1003 ori, samtar: Backport for [[gerrit:1118208|Remove temporary '-php8' and '-k8s' suffixes from ArcLamp pipeline (T391516)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:54:28] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:54:30] ori: will you be able to test this? [13:54:44] (03PS1) 10Effie Mouzeli: switch mwdebug1001 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1138363 (https://phabricator.wikimedia.org/T391452) [13:55:50] (03CR) 10Effie Mouzeli: [C:03+2] "Last host to be reimaged, thus self +2ing" [puppet] - 10https://gerrit.wikimedia.org/r/1138363 (https://phabricator.wikimedia.org/T391452) (owner: 10Effie Mouzeli) [13:56:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2098.codfw.wmnet with reason: host reimage [13:56:24] TheresNoTime: partially. I verified that there are no errors on page loads on mwdebug1001, and the code in question is evaluated on every page load. I can't force an excimer profile to be generated. [13:56:29] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:56:35] ack [13:56:36] TheresNoTime: in other words, should be safe to sync everywhere. [13:56:40] !log samtar@deploy1003 ori, samtar: Continuing with sync [13:57:50] !log jiji@cumin1002 conftool action : set/pooled=inactive; selector: name=mwdebug1001.eqiad.wmnet [13:58:05] (03PS1) 10Muehlenhoff: Update insetup alias [puppet] - 10https://gerrit.wikimedia.org/r/1138364 [13:58:11] jouncebot: now [13:58:11] For the next 0 hour(s) and 1 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T1300) [13:58:33] (deploying one last patch, shouldn't be too long) [13:58:36] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2099 - bking@cumin2002" [13:58:42] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2099 - bking@cumin2002" [13:58:42] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:58:43] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2099.codfw.wmnet 218.32.192.10.in-addr.arpa 8.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:58:46] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2099.codfw.wmnet 218.32.192.10.in-addr.arpa 8.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:58:47] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2099 [13:59:01] (03PS2) 10Jforrester: wikifunctions: Update orchestrator from 2025-04-16-192052 to 2025-04-23-134615 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138362 (https://phabricator.wikimedia.org/T391731) [13:59:12] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2099 [13:59:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2099 [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T1400) [14:00:30] ori: sorry for the driveby comment, just point out that mwdebug1001 is on php7.4, while the rest are on php8.1. I will be reimaging 1001 after TheresNoTime is done [14:01:21] (03CR) 10Muehlenhoff: [C:03+2] Update insetup alias [puppet] - 10https://gerrit.wikimedia.org/r/1138364 (owner: 10Muehlenhoff) [14:01:30] effie: ack. Profiles are coming in now, so we're good. [14:02:03] TheresNoTime: I'll do our scheduled service deploy but it shouldn't have any effect on the MW side, if that's OK? [14:02:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T391056)', diff saved to https://phabricator.wikimedia.org/P75312 and previous config saved to /var/cache/conftool/dbconfig/20250423-140221-fceratto.json [14:02:26] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:02:59] James_F: ack, go ahead, deployment is pretty much done :D [14:03:03] Cool. [14:03:21] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update evaluators from 2025-04-09-214434 to 2025-04-16-213143 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138361 (https://phabricator.wikimedia.org/T391731) (owner: 10Jforrester) [14:03:30] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1118208|Remove temporary '-php8' and '-k8s' suffixes from ArcLamp pipeline (T391516)]] (duration: 13m 36s) [14:03:34] T391516: https://performance.wikimedia.org/php-profiling/ leads to 404 for all listed sources - https://phabricator.wikimedia.org/T391516 [14:03:57] ori: done ^ [14:04:12] (03PS4) 10Ori: Turn down the PHP8- and Kubernetes-specific ArcLamp listeners [puppet] - 10https://gerrit.wikimedia.org/r/1118209 (https://phabricator.wikimedia.org/T391516) [14:04:34] effie: fyi deployments all done :) [14:04:49] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-04-09-214434 to 2025-04-16-213143 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138361 (https://phabricator.wikimedia.org/T391731) (owner: 10Jforrester) [14:04:49] TheresNoTime: thank you! [14:04:57] cheers TheresNoTime ! [14:04:58] tx [14:05:33] (03CR) 10Ori: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118209 (https://phabricator.wikimedia.org/T391516) (owner: 10Ori) [14:05:43] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:05:52] (03CR) 10Ori: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118209 (https://phabricator.wikimedia.org/T391516) (owner: 10Ori) [14:06:15] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:06:29] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:07:01] TheresNoTime: so how was spiderpig? 👀 [14:07:04] (03PS1) 10Jforrester: API: Don't try to read fetchAllZLanguageCodes() in client-mode Action APIs either [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1138368 (https://phabricator.wikimedia.org/T392014) [14:07:06] (I might try it out tomorrow…) [14:07:09] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:07:13] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:07:53] Lucas_WMDE: you should! It's very good :D I wrote up some quick notes at https://wikitech.wikimedia.org/wiki/User:TheresNoTime/SpiderPig but tl;dr will make deployments even easier by far [14:07:56] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:08:08] (03CR) 10Jforrester: [C:03+2] wikifunctions: Update orchestrator from 2025-04-16-192052 to 2025-04-23-134615 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138362 (https://phabricator.wikimedia.org/T391731) (owner: 10Jforrester) [14:08:41] FIRING: [4x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:40] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-04-16-192052 to 2025-04-23-134615 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138362 (https://phabricator.wikimedia.org/T391731) (owner: 10Jforrester) [14:10:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1032', diff saved to https://phabricator.wikimedia.org/P75313 and previous config saved to /var/cache/conftool/dbconfig/20250423-141000-marostegui.json [14:10:19] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:10:47] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:11:12] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:11:31] (03CR) 10Ori: [C:03+2] Turn down the PHP8- and Kubernetes-specific ArcLamp listeners [puppet] - 10https://gerrit.wikimedia.org/r/1118209 (https://phabricator.wikimedia.org/T391516) (owner: 10Ori) [14:11:34] (03PS1) 10Marostegui: es1032: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1138369 (https://phabricator.wikimedia.org/T391921) [14:12:02] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:12:05] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:12:38] (03PS1) 10Majavah: hieradata: Enable wmcs_nova_fixed_ptr in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1138371 (https://phabricator.wikimedia.org/T380174) [14:12:53] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:13:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137813 (https://phabricator.wikimedia.org/T392370) (owner: 10Jforrester) [14:13:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1138368 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [14:14:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1032.eqiad.wmnet with reason: Maintenance [14:14:22] (03CR) 10Marostegui: [C:03+2] es1032: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1138369 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [14:14:28] (03PS4) 10Dreamy Jazz: Remove wgCheckUserCentralIndexRangesToExclude definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134203 (https://phabricator.wikimedia.org/T389055) [14:14:30] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mwdebug1001.eqiad.wmnet with OS bullseye [14:15:11] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1138309 (https://phabricator.wikimedia.org/T367725) (owner: 10Majavah) [14:15:24] (03Merged) 10jenkins-bot: ZString: Don't explode if we're handed an array with odd contents [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1137813 (https://phabricator.wikimedia.org/T392370) (owner: 10Jforrester) [14:15:52] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2099.codfw.wmnet with reason: host reimage [14:16:17] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1138371 (https://phabricator.wikimedia.org/T380174) (owner: 10Majavah) [14:16:28] (03CR) 10Majavah: [C:03+2] hieradata: Enable wmcs_nova_fixed_ptr in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1138371 (https://phabricator.wikimedia.org/T380174) (owner: 10Majavah) [14:17:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P75314 and previous config saved to /var/cache/conftool/dbconfig/20250423-141728-fceratto.json [14:18:04] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: cumin: Cleanup cumin master code [puppet] - 10https://gerrit.wikimedia.org/r/1138309 (https://phabricator.wikimedia.org/T367725) (owner: 10Majavah) [14:18:42] (03Merged) 10jenkins-bot: API: Don't try to read fetchAllZLanguageCodes() in client-mode Action APIs either [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1138368 (https://phabricator.wikimedia.org/T392014) (owner: 10Jforrester) [14:18:58] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1137813|ZString: Don't explode if we're handed an array with odd contents (T392370)]], [[gerrit:1138368|API: Don't try to read fetchAllZLanguageCodes() in client-mode Action APIs either (T392014)]] [14:19:04] T392370: PHP Warning: Undefined array key 0 - https://phabricator.wikimedia.org/T392370 [14:19:04] T392014: Error related to initiatlizing RESTAPI/FetchHandler.php - https://phabricator.wikimedia.org/T392014 [14:19:15] (03PS1) 10Majavah: openstack: designate: Remove nova_fixed_multi code [puppet] - 10https://gerrit.wikimedia.org/r/1138373 (https://phabricator.wikimedia.org/T378192) [14:19:24] 10ops-codfw, 06DC-Ops, 10SRE Observability (FY2024/2025-Q4): kafka-logging2005 is down since six days - https://phabricator.wikimedia.org/T392488#10760954 (10lmata) [14:19:30] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2099.codfw.wmnet with reason: host reimage [14:20:33] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1138373 (https://phabricator.wikimedia.org/T378192) (owner: 10Majavah) [14:20:41] (03PS2) 10Andrew Bogott: openstack: designate: Remove nova_fixed_multi code [puppet] - 10https://gerrit.wikimedia.org/r/1138373 (https://phabricator.wikimedia.org/T378192) (owner: 10Majavah) [14:20:43] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138373 (https://phabricator.wikimedia.org/T378192) (owner: 10Majavah) [14:23:29] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1137813|ZString: Don't explode if we're handed an array with odd contents (T392370)]], [[gerrit:1138368|API: Don't try to read fetchAllZLanguageCodes() in client-mode Action APIs either (T392014)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:23:34] (03CR) 10Andrew Bogott: [C:03+1] openstack: designate: Remove nova_fixed_multi code [puppet] - 10https://gerrit.wikimedia.org/r/1138373 (https://phabricator.wikimedia.org/T378192) (owner: 10Majavah) [14:23:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75315 and previous config saved to /var/cache/conftool/dbconfig/20250423-142350-root.json [14:23:51] !log jforrester@deploy1003 jforrester: Continuing with sync [14:24:41] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2098.codfw.wmnet with OS bullseye [14:25:26] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:17] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:27:13] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:27:26] (03CR) 10CDanis: [C:03+1] "thanks Ori, I'm just back from sabbatical today :)" [puppet] - 10https://gerrit.wikimedia.org/r/1118209 (https://phabricator.wikimedia.org/T391516) (owner: 10Ori) [14:28:19] (03CR) 10David Caro: "Manually tested in toolsbeta worker-nfs-9, to see the fields populated by the logs you have to filter like:" [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [14:28:51] (03PS1) 10Muehlenhoff: Make krb1002 a KDC [puppet] - 10https://gerrit.wikimedia.org/r/1138377 (https://phabricator.wikimedia.org/T390863) [14:30:28] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137813|ZString: Don't explode if we're handed an array with odd contents (T392370)]], [[gerrit:1138368|API: Don't try to read fetchAllZLanguageCodes() in client-mode Action APIs either (T392014)]] (duration: 11m 29s) [14:30:33] T392370: PHP Warning: Undefined array key 0 - https://phabricator.wikimedia.org/T392370 [14:30:33] T392014: Error related to initiatlizing RESTAPI/FetchHandler.php - https://phabricator.wikimedia.org/T392014 [14:32:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P75316 and previous config saved to /var/cache/conftool/dbconfig/20250423-143235-fceratto.json [14:32:57] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2101 to cirrussearch2101 [14:33:20] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:33:35] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mwdebug1001.eqiad.wmnet with reason: host reimage [14:33:51] jouncebot: nowandnext [14:33:51] For the next 0 hour(s) and 26 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T1400) [14:33:51] In 2 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T1700) [14:34:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10761002 (10elukey) @Jclark-ctr hi! When you have a moment let's do the fake hot swap test, it should be sufficient to just pull any of... [14:34:39] 10ops-codfw, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q4): kafka-logging2005 is down since six days - https://phabricator.wikimedia.org/T392488#10761004 (10Jhancock.wm) tried reseating the cable at both ends and it didn't ping. tried replacing the cable and no pings. @MoritzMuehlenhoff Are you o... [14:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:46] James_F : Are you done with your deploys? If so I'd like to make a config change deployment [14:35:56] Dreamy_Jazz: Please go ahead, we're done, yes. [14:36:02] Thanks! [14:36:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134203 (https://phabricator.wikimedia.org/T389055) (owner: 10Dreamy Jazz) [14:37:07] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwdebug1001.eqiad.wmnet with reason: host reimage [14:37:40] (03Merged) 10jenkins-bot: Remove wgCheckUserCentralIndexRangesToExclude definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134203 (https://phabricator.wikimedia.org/T389055) (owner: 10Dreamy Jazz) [14:37:52] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1134203|Remove wgCheckUserCentralIndexRangesToExclude definition (T389055)]] [14:37:56] T389055: Special:GlobalContributions: Display edits made by bot accounts - https://phabricator.wikimedia.org/T389055 [14:38:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75317 and previous config saved to /var/cache/conftool/dbconfig/20250423-143856-root.json [14:38:59] bking@cumin2002 rename (PID 2014040) is awaiting input [14:40:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2099.codfw.wmnet with OS bullseye [14:42:13] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1134203|Remove wgCheckUserCentralIndexRangesToExclude definition (T389055)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:42:21] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [14:43:46] (03CR) 10Elukey: [C:03+1] "LGTM, just to be sure, you want profile::kerberos::kadminserver::enable_replication to be enabled right?" [puppet] - 10https://gerrit.wikimedia.org/r/1138377 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [14:47:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T391056)', diff saved to https://phabricator.wikimedia.org/P75318 and previous config saved to /var/cache/conftool/dbconfig/20250423-144741-fceratto.json [14:47:46] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:47:53] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2101 to cirrussearch2101 - bking@cumin2002" [14:47:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1196.eqiad.wmnet with reason: Maintenance [14:48:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:48:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T391056)', diff saved to https://phabricator.wikimedia.org/P75319 and previous config saved to /var/cache/conftool/dbconfig/20250423-144811-fceratto.json [14:48:53] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134203|Remove wgCheckUserCentralIndexRangesToExclude definition (T389055)]] (duration: 11m 00s) [14:48:57] T389055: Special:GlobalContributions: Display edits made by bot accounts - https://phabricator.wikimedia.org/T389055 [14:49:11] I'm done with my deploys. [14:50:58] bking@cumin2002 rename (PID 2014040) is awaiting input [14:51:24] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2101 to cirrussearch2101 - bking@cumin2002" [14:51:24] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:51:25] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2101 [14:52:08] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2101 [14:52:41] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492 (10RobH) 03NEW [14:52:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2101 to cirrussearch2101 [14:53:06] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10761094 (10RobH) [14:53:39] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install pc1018 - https://phabricator.wikimedia.org/T392492#10761096 (10RobH) a:03Marostegui Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. T... [14:54:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75320 and previous config saved to /var/cache/conftool/dbconfig/20250423-145401-root.json [14:54:23] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2101.codfw.wmnet on all recursors [14:54:26] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2101.codfw.wmnet on all recursors [14:54:31] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493 (10RobH) 03NEW [14:54:41] (03CR) 10BCornwall: [C:03+2] acmechief: Add pywikipedia.org to the cert list [puppet] - 10https://gerrit.wikimedia.org/r/1137481 (https://phabricator.wikimedia.org/T388809) (owner: 10BCornwall) [14:54:44] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2101.codfw.wmnet with OS bullseye [14:54:56] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2101 [14:55:02] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:55:05] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10761120 (10RobH) a:03Marostegui Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. T... [14:55:18] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10761125 (10RobH) [14:55:54] (03CR) 10Jforrester: "check experimental" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1111944 (https://phabricator.wikimedia.org/T383597) (owner: 10Hashar) [15:00:06] bking@cumin2002 rename (PID 2038479) is awaiting input [15:00:54] bking@cumin2002 reimage (PID 2035727) is awaiting input [15:04:57] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:06:22] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138389 [15:07:01] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:07:05] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner to 17.9 [puppet] - 10https://gerrit.wikimedia.org/r/1138390 (https://phabricator.wikimedia.org/T392495) [15:07:36] (03CR) 10Muehlenhoff: "It's not fully enabled; profile::kerberos::replication reads the kerberos_kdc_servers Hiera variable and initially krb1002 isn't in there" [puppet] - 10https://gerrit.wikimedia.org/r/1138377 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [15:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T391056)', diff saved to https://phabricator.wikimedia.org/P75321 and previous config saved to /var/cache/conftool/dbconfig/20250423-150839-fceratto.json [15:08:45] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:09:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75322 and previous config saved to /var/cache/conftool/dbconfig/20250423-150907-root.json [15:09:37] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mwdebug1001.eqiad.wmnet with OS bullseye [15:12:01] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:12:47] I'd like to dry-run a script for Flow board migrations on gomwiki – any objections? [15:12:57] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:13:49] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:14:06] (03PS1) 10Gerrit maintenance bot: Add rki to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1138392 (https://phabricator.wikimedia.org/T392490) [15:15:10] FIRING: BFDdown: BFD session down between cr1-eqiad and fe80::ee38:7300:ce8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:17:55] RECOVERY - Restbase root url on restbase1041 is OK: HTTP OK: HTTP/1.1 200 - 18480 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/RESTBase [15:19:47] (03PS1) 10Muehlenhoff: Add a Cumin alias to select UEFI-enabled servers [puppet] - 10https://gerrit.wikimedia.org/r/1138397 (https://phabricator.wikimedia.org/T389217) [15:20:01] (03PS2) 10Muehlenhoff: Add a Cumin alias to select UEFI-enabled servers [puppet] - 10https://gerrit.wikimedia.org/r/1138397 (https://phabricator.wikimedia.org/T389217) [15:20:10] RESOLVED: BFDdown: BFD session down between cr1-eqiad and fe80::ee38:7300:ce8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:22:45] 10ops-codfw, 06SRE, 06DC-Ops, 10SRE Observability (FY2024/2025-Q4): kafka-logging2005 is down since six days - https://phabricator.wikimedia.org/T392488#10761337 (10herron) Hi @Jhancock.wm I'll be helping out with this as a kafka-logging service owner, yes please proceed! [15:22:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 116246928 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:23:17] (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner to 17.9 [puppet] - 10https://gerrit.wikimedia.org/r/1138390 (https://phabricator.wikimedia.org/T392495) (owner: 10Jelto) [15:23:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P75323 and previous config saved to /var/cache/conftool/dbconfig/20250423-152347-fceratto.json [15:23:49] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2102 to cirrussearch2102 [15:23:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 136512 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:24:12] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:24:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75324 and previous config saved to /var/cache/conftool/dbconfig/20250423-152412-root.json [15:25:04] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2101 - bking@cumin2002" [15:25:39] 10ops-codfw, 06Data-Persistence, 06DBA, 06DC-Ops: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10761371 (10Marostegui) [15:25:49] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2101 - bking@cumin2002" [15:25:50] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:25:50] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2101.codfw.wmnet 220.32.192.10.in-addr.arpa 0.2.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:25:53] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2101.codfw.wmnet 220.32.192.10.in-addr.arpa 0.2.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:25:54] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2101 [15:26:31] (03CR) 10Elukey: [C:03+1] "Yes yes I meant if it was ok to have everything deployed/installed as it was live, I think it is fine but I wanted to check with you. Plea" [puppet] - 10https://gerrit.wikimedia.org/r/1138377 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [15:26:37] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2101 [15:26:38] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2101 [15:28:32] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2102 to cirrussearch2102 - bking@cumin2002" [15:28:38] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2102 to cirrussearch2102 - bking@cumin2002" [15:28:38] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:28:39] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2102 [15:28:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [15:29:16] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2102 [15:29:50] (03PS1) 10Marostegui: mariadb: Add db1258 insetup [puppet] - 10https://gerrit.wikimedia.org/r/1138398 (https://phabricator.wikimedia.org/T392493) [15:29:56] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2102 to cirrussearch2102 [15:30:35] (03CR) 10Marostegui: [C:03+2] mariadb: Add db1258 insetup [puppet] - 10https://gerrit.wikimedia.org/r/1138398 (https://phabricator.wikimedia.org/T392493) (owner: 10Marostegui) [15:30:40] aight i've given it 15 minutes, going ahead... [15:32:04] jouncebot: nowandnext [15:32:04] No deployments scheduled for the next 1 hour(s) and 27 minute(s) [15:32:04] In 1 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T1700) [15:32:23] oooh, handy [15:33:20] (03PS1) 10Marostegui: installserver: Add db1258 [puppet] - 10https://gerrit.wikimedia.org/r/1138399 (https://phabricator.wikimedia.org/T392493) [15:33:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [15:35:30] (03CR) 10Marostegui: [C:03+2] installserver: Add db1258 [puppet] - 10https://gerrit.wikimedia.org/r/1138399 (https://phabricator.wikimedia.org/T392493) (owner: 10Marostegui) [15:36:39] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, and 2 others: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10761443 (10Marostegui) Patches merged. [15:36:47] (03PS1) 10Bking: elasticsearch: filter LVS config based on cluster membership [puppet] - 10https://gerrit.wikimedia.org/r/1138400 (https://phabricator.wikimedia.org/T387569) [15:37:02] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, and 2 others: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10761446 (10Marostegui) [15:38:10] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install db1258 - https://phabricator.wikimedia.org/T392493#10761453 (10Marostegui) [15:38:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P75326 and previous config saved to /var/cache/conftool/dbconfig/20250423-153854-fceratto.json [15:39:08] (03CR) 10Ladsgroup: [C:03+2] Add rki to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1138392 (https://phabricator.wikimedia.org/T392490) (owner: 10Gerrit maintenance bot) [15:39:15] !log ladsgroup@dns1004 START - running authdns-update [15:39:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75327 and previous config saved to /var/cache/conftool/dbconfig/20250423-153918-root.json [15:41:22] (03PS3) 10Samtar: InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975) [15:41:47] !log ladsgroup@dns1004 END - running authdns-update [15:42:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:43:21] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2101.codfw.wmnet with reason: host reimage [15:43:29] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2102.codfw.wmnet with OS bullseye [15:43:41] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2102 [15:44:00] !log bking@cumin2002 START - Cookbook sre.dns.netbox [15:45:54] right then, I'd like to run this script again less dry [15:47:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2101.codfw.wmnet with reason: host reimage [15:48:04] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2102 - bking@cumin2002" [15:48:10] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2102 - bking@cumin2002" [15:48:10] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:48:10] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2102.codfw.wmnet 221.32.192.10.in-addr.arpa 1.2.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:48:14] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2102.codfw.wmnet 221.32.192.10.in-addr.arpa 1.2.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:48:15] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2102 [15:50:06] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2102 [15:50:06] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2102 [15:50:51] (03CR) 10Hnowlan: [C:03+2] mw::maintenance::growthexperiments: migrate updateMetrics job to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135916 (https://phabricator.wikimedia.org/T385782) (owner: 10Hnowlan) [15:52:10] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.9 [15:53:41] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:54:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T391056)', diff saved to https://phabricator.wikimedia.org/P75331 and previous config saved to /var/cache/conftool/dbconfig/20250423-155401-fceratto.json [15:54:06] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:54:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1206.eqiad.wmnet with reason: Maintenance [15:54:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T391056)', diff saved to https://phabricator.wikimedia.org/P75332 and previous config saved to /var/cache/conftool/dbconfig/20250423-155423-fceratto.json [15:54:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75333 and previous config saved to /var/cache/conftool/dbconfig/20250423-155423-root.json [15:58:30] !log jelto@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.9 [15:58:41] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:59:10] !log dancy@deploy1003 Installing scap version "4.154.0" for 2 host(s) [15:59:30] (03PS1) 10Majavah: P:toolforge: legacy_redirector: Add dhparam [puppet] - 10https://gerrit.wikimedia.org/r/1138402 [16:00:09] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [16:00:24] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [16:00:32] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [16:00:35] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [16:00:51] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.9 [16:00:57] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1138402 (owner: 10Majavah) [16:01:04] (03PS1) 10Dreamy Jazz: Enable temporary-account-viewer group on all WMF production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138403 (https://phabricator.wikimedia.org/T390942) [16:01:10] jouncebot: nowandnext [16:01:10] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [16:01:10] In 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T1700) [16:01:15] !log dancy@deploy1003 Installation of scap version "4.154.0" completed for 2 hosts [16:01:29] dancy: Mind if I deploy now? [16:01:35] Go for it. I'm done. [16:01:38] Thanks [16:02:17] (03CR) 10Majavah: [V:03+1 C:03+2] P:toolforge: legacy_redirector: Add dhparam [puppet] - 10https://gerrit.wikimedia.org/r/1138402 (owner: 10Majavah) [16:04:26] (03PS2) 10Dreamy Jazz: Enable temporary-account-viewer group on all WMF production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138403 (https://phabricator.wikimedia.org/T390942) [16:04:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138403 (https://phabricator.wikimedia.org/T390942) (owner: 10Dreamy Jazz) [16:04:36] !log restarting pybal on lvs201[34] [16:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:43] (03PS3) 10Dreamy Jazz: Enable temporary-account-viewer group on all WMF production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138403 (https://phabricator.wikimedia.org/T390942) [16:05:55] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138403 (https://phabricator.wikimedia.org/T390942) (owner: 10Dreamy Jazz) [16:06:46] (03Merged) 10jenkins-bot: Enable temporary-account-viewer group on all WMF production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138403 (https://phabricator.wikimedia.org/T390942) (owner: 10Dreamy Jazz) [16:06:57] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2102.codfw.wmnet with reason: host reimage [16:06:59] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1138403|Enable temporary-account-viewer group on all WMF production wikis (T390942 T387205)]] [16:07:04] T390942: Enable IP viewer temporary account group on all projects - https://phabricator.wikimedia.org/T390942 [16:07:05] T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205 [16:08:08] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:08:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade Replica to GitLab 17.9 [16:09:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75334 and previous config saved to /var/cache/conftool/dbconfig/20250423-160928-root.json [16:09:43] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 17.9 [16:09:49] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2102.codfw.wmnet with reason: host reimage [16:10:29] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10761653 (10VRiley-WMF) I got an update from Dell, and they said they would be replacing some of the parts on this. They have listed the Mainboard, Cables and power supplies as being possible issues and lo... [16:10:41] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1138403|Enable temporary-account-viewer group on all WMF production wikis (T390942 T387205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:10:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2101.codfw.wmnet with OS bullseye [16:10:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T391056)', diff saved to https://phabricator.wikimedia.org/P75335 and previous config saved to /var/cache/conftool/dbconfig/20250423-161051-fceratto.json [16:10:55] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:12:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission mw13[49-51], mw1407 - https://phabricator.wikimedia.org/T383226#10761663 (10VRiley-WMF) Hey, I just wanted to check in with this ticket @akosiaris Are these ready to be decomissioned? Wanted to make sure because not everythin... [16:13:28] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [16:13:55] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10761669 (10VRiley-WMF) Okay, thanks. These servers were having trouble imagine and I was trying to look into if they have been added into the preseed. [16:14:10] Deployment of the above failed. [16:14:51] The first error is "The connection to the server kubemaster.svc.codfw.wmnet:6442 was refused - did you specify the right host or port?" [16:15:59] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 26347888 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:16:08] (03CR) 10Brouberol: [C:03+2] Turn off Gobblin test jobs (all at once). [puppet] - 10https://gerrit.wikimedia.org/r/1137067 (https://phabricator.wikimedia.org/T390249) (owner: 10Aleksandar Mastilovic) [16:16:15] Dreamy_Jazz: I recommend retrying [16:16:21] Sure. I will do that now. [16:16:38] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1138403|Enable temporary-account-viewer group on all WMF production wikis (T390942 T387205)]] [16:16:43] T390942: Enable IP viewer temporary account group on all projects - https://phabricator.wikimedia.org/T390942 [16:16:44] T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205 [16:16:52] The other errors appeared to be the canary not having the correct k8s config file [16:16:59] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:17:01] (03PS1) 10Daimona Eaytoy: Enable the CampaignEvents extension on 43 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138405 (https://phabricator.wikimedia.org/T392240) [16:17:09] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:17:13] Which I guess would be expected if there was a temporary connection issue [16:17:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10761747 (10VRiley-WMF) [16:17:32] (03PS1) 10Bking: cirrussearch: add more newly-reimaged hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1138406 (https://phabricator.wikimedia.org/T388610) [16:17:49] inflatador: ^^ those pybal alerts triggered on Monday after some maintenance work you did on a search server [16:17:57] (03CR) 10CI reject: [V:04-1] Enable the CampaignEvents extension on 43 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138405 (https://phabricator.wikimedia.org/T392240) (owner: 10Daimona Eaytoy) [16:18:05] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10761749 (10Marostegui) Thank you!! I hope this will fix the issue for good! [16:18:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade Replica to GitLab 17.9 [16:18:36] vgutierrez checking it out now. My guess is that conftool and the list of actual hosts are out of whack. Hoping https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138406 will fix this [16:19:07] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [16:19:26] inflatador: I've cleaned up those alerts with a pybal restart FYI [16:20:26] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:28] vgutierrez ACK, if there is a way to be less disruptive LMK. I guess I could/should remove the hosts from conftool before reimaging? [16:21:07] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1138403|Enable temporary-account-viewer group on all WMF production wikis (T390942 T387205)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:21:13] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [16:21:42] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:22:29] Looks better this time round (has proceeded to deploy to the canaries), so looks like a temporary connection issue. [16:22:58] (03PS2) 10Daimona Eaytoy: Enable the CampaignEvents extension on 43 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138405 (https://phabricator.wikimedia.org/T392240) [16:23:22] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1184 [16:23:31] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1184 [16:23:58] (03CR) 10CI reject: [V:04-1] Enable the CampaignEvents extension on 43 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138405 (https://phabricator.wikimedia.org/T392240) (owner: 10Daimona Eaytoy) [16:24:05] inflatador: yep.. especially if those hosts won't ever come back up again (cause those are being reimaged) [16:24:11] s/reimaged/renamed/ [16:24:26] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:24:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75336 and previous config saved to /var/cache/conftool/dbconfig/20250423-162434-root.json [16:24:53] (03PS1) 10C. Scott Ananian: Turn on ParsoidFragmentInput; remove unneeded ParsoidFragmentSupport config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138408 (https://phabricator.wikimedia.org/T268144) [16:25:32] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:25:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 23 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138408 (https://phabricator.wikimedia.org/T268144) (owner: 10C. Scott Ananian) [16:25:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P75337 and previous config saved to /var/cache/conftool/dbconfig/20250423-162558-fceratto.json [16:27:31] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:27:49] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138403|Enable temporary-account-viewer group on all WMF production wikis (T390942 T387205)]] (duration: 11m 11s) [16:27:54] T390942: Enable IP viewer temporary account group on all projects - https://phabricator.wikimedia.org/T390942 [16:27:54] T387205: IP reveal groups: Rename 'checkuser-temporary-account-viewer' to not include the phrase 'checkuser' - https://phabricator.wikimedia.org/T387205 [16:28:17] (03CR) 10JHathaway: [C:03+1] "just a couple of suggestions, otherwise looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1137055 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:28:19] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:28:33] Done with my deploys [16:28:41] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:30:30] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2102.codfw.wmnet with OS bullseye [16:31:33] (03PS2) 10Bking: cirrussearch: add more newly-reimaged hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1138406 (https://phabricator.wikimedia.org/T388610) [16:32:56] vriley@cumin1002 provision (PID 2057838) is awaiting input [16:34:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:35:58] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::google_api_proxy: Update network ACLs [puppet] - 10https://gerrit.wikimedia.org/r/1138351 (owner: 10Majavah) [16:37:27] (03CR) 10JHathaway: [C:03+1] Add a Cumin alias to select UEFI-enabled servers [puppet] - 10https://gerrit.wikimedia.org/r/1138397 (https://phabricator.wikimedia.org/T389217) (owner: 10Muehlenhoff) [16:38:21] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:39:19] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:41:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P75338 and previous config saved to /var/cache/conftool/dbconfig/20250423-164105-fceratto.json [16:42:41] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2113 to cirrussearch2113 [16:43:03] !log bking@cumin2002 START - Cookbook sre.dns.netbox [16:43:36] (03PS3) 10Bking: cirrussearch: add more newly-reimaged hosts to conftool [puppet] - 10https://gerrit.wikimedia.org/r/1138406 (https://phabricator.wikimedia.org/T388610) [16:46:22] (03PS1) 10Majavah: P:toolforge::mailrelay: Pull WMCS IP space from network module [puppet] - 10https://gerrit.wikimedia.org/r/1138411 [16:47:20] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2113 to cirrussearch2113 - bking@cumin2002" [16:47:55] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5337/co" [puppet] - 10https://gerrit.wikimedia.org/r/1138411 (owner: 10Majavah) [16:49:44] (03CR) 10Subramanya Sastry: [C:03+1] Turn on ParsoidFragmentInput; remove unneeded ParsoidFragmentSupport config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138408 (https://phabricator.wikimedia.org/T268144) (owner: 10C. Scott Ananian) [16:50:28] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2113 to cirrussearch2113 - bking@cumin2002" [16:50:28] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:50:29] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2113 [16:50:44] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2113 [16:51:14] (03PS3) 10Daimona Eaytoy: Enable the CampaignEvents extension on 43 more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138405 (https://phabricator.wikimedia.org/T392240) [16:51:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2113 to cirrussearch2113 [16:52:12] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2113.codfw.wmnet with OS bullseye [16:52:16] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2113 [16:52:17] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2113 [16:56:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T391056)', diff saved to https://phabricator.wikimedia.org/P75340 and previous config saved to /var/cache/conftool/dbconfig/20250423-165611-fceratto.json [16:56:16] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:56:28] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1207.eqiad.wmnet with reason: Maintenance [16:56:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T391056)', diff saved to https://phabricator.wikimedia.org/P75341 and previous config saved to /var/cache/conftool/dbconfig/20250423-165634-fceratto.json [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T1700) [17:03:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:04:52] !log Remove varnish libvmod-re2 libvmod-netmapper libvmod-querysort libvarnishapi2 varnish-modules varnishkafka from bookworm-wikimedia [17:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138405 (https://phabricator.wikimedia.org/T392240) (owner: 10Daimona Eaytoy) [17:14:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T391056)', diff saved to https://phabricator.wikimedia.org/P75342 and previous config saved to /var/cache/conftool/dbconfig/20250423-171404-fceratto.json [17:14:09] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:20:41] PROBLEM - ganeti-noded running on ganeti1037 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [17:21:41] RECOVERY - ganeti-noded running on ganeti1037 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [17:21:45] !log Remove libvarnishapi-dev from bookworm-wikimedia [17:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P75343 and previous config saved to /var/cache/conftool/dbconfig/20250423-172912-fceratto.json [17:32:58] (03CR) 10Bking: [C:03+2] "self-merging to avoid conftool errors" [puppet] - 10https://gerrit.wikimedia.org/r/1138406 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:38:28] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:39:10] !log bking@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch2071.codfw.wmnet|cirrussearch2098.codfw.wmnet|cirrussearch2099.codfw.wmnet|cirrussearch2101.codfw.wmnet|cirrussearch2102.codfw.wmnet|cirrussearch2113.codfw.wmnet [17:40:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:42:05] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:44:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P75344 and previous config saved to /var/cache/conftool/dbconfig/20250423-174419-fceratto.json [17:49:58] (03CR) 10Brouberol: [C:03+1] cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:59:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T391056)', diff saved to https://phabricator.wikimedia.org/P75345 and previous config saved to /var/cache/conftool/dbconfig/20250423-175926-fceratto.json [17:59:31] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:59:42] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1218.eqiad.wmnet with reason: Maintenance [17:59:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T391056)', diff saved to https://phabricator.wikimedia.org/P75346 and previous config saved to /var/cache/conftool/dbconfig/20250423-175948-fceratto.json [18:00:43] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:02:43] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:04:00] !log import libvmod-wmfuniq 0.1.0 into bullseye-wikimedia (T392059) [18:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:04] T392059: Provide debian packages for libvmod-wmfuniq - https://phabricator.wikimedia.org/T392059 [18:09:03] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:09:36] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2113.codfw.wmnet with OS bullseye [18:09:59] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:14:03] (03CR) 10Dzahn: [C:03+1] gerrit: convert robots.txt to a flat file [puppet] - 10https://gerrit.wikimedia.org/r/1138330 (owner: 10Hashar) [18:15:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T391056)', diff saved to https://phabricator.wikimedia.org/P75347 and previous config saved to /var/cache/conftool/dbconfig/20250423-181533-fceratto.json [18:15:38] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:20:49] (03CR) 10Dzahn: [C:03+1] gerrit/nftables_throttling: add tracking_duration parameter [puppet] - 10https://gerrit.wikimedia.org/r/1138308 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [18:22:43] (03CR) 10Dzahn: gerrit: switchover to gerrit1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [18:22:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [18:30:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P75348 and previous config saved to /var/cache/conftool/dbconfig/20250423-183040-fceratto.json [18:32:44] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [18:32:59] (03PS1) 10Hashar: python3: add python3-venv to devel image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1138442 [18:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:39:08] (03PS1) 10Jforrester: Fix: PHP Warning: Undefined array key "request" [extensions/WikiLambda] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1138443 (https://phabricator.wikimedia.org/T392026) [18:42:13] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:42:39] FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Arelion (2001:2035:0:15b5::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit6&var-bgp_neighbor=Arelion - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:42:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-eqsin:xe-0/1/1 (Transit: Arelion (IC-331928) {#1177}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [18:43:18] !log remove libvmod-wmfuniq-0.1.0 and wmfuniq-keygen-0.1.0 from bullseye-wikimedia (T392059) [18:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:22] T392059: Provide debian packages for libvmod-wmfuniq - https://phabricator.wikimedia.org/T392059 [18:43:55] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device ssw1-e1-codfw.mgmt.codfw.wmnet [18:43:55] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-e1-codfw.mgmt.codfw.wmnet [18:45:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P75349 and previous config saved to /var/cache/conftool/dbconfig/20250423-184547-fceratto.json [18:46:32] (03CR) 10Ebernhardson: [C:03+1] deployment-prep: cleanup deployment-elastic values [puppet] - 10https://gerrit.wikimedia.org/r/1134151 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [18:47:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Arelion (2001:2035:0:15b5::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [18:50:41] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device ssw1-e1-codfw.mgmt.codfw.wmnet [18:50:43] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:51:27] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release [18:51:49] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: security release [18:53:14] 06SRE-OnFire, 10Cassandra, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10762507 (10Eevans) p:05High→03Medium [18:54:01] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:54:09] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:55:39] (03PS1) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1138446 (https://phabricator.wikimedia.org/T388610) [18:56:19] pt1979@cumin2002 provision (PID 2280826) is awaiting input [18:56:41] !log import libvmod-wmfuniq-0.1.0~deb11u1 and wmfuniq-keygen-0.1.0~deb11u1 into bullseye-wikimedia (T392059) [18:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:45] T392059: Provide debian packages for libvmod-wmfuniq - https://phabricator.wikimedia.org/T392059 [18:56:53] !log import libvmod-wmfuniq-0.1.0~deb12u1 and wmfuniq-keygen-0.1.0~deb12u1 into bookworm-wikimedia (T392059) [18:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:13] (03CR) 10Bking: [C:03+2] cirrussearch: prepare for eqiad migration [puppet] - 10https://gerrit.wikimedia.org/r/1137069 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:58:21] (03CR) 10Dzahn: [C:03+1] microsites: fix regex_matches for query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1138255 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [18:59:07] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-e1-codfw - pt1979@cumin2002" [18:59:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-e1-codfw - pt1979@cumin2002" [18:59:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:00:46] 06SRE-OnFire, 10Cassandra, 10Sustainability (Incident Followup): Alert when disk space utilization on sessionstore nodes is trending high - https://phabricator.wikimedia.org/T390630#10762528 (10Eevans) [19:00:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T391056)', diff saved to https://phabricator.wikimedia.org/P75350 and previous config saved to /var/cache/conftool/dbconfig/20250423-190054-fceratto.json [19:00:58] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:01:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1219.eqiad.wmnet with reason: Maintenance [19:01:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T391056)', diff saved to https://phabricator.wikimedia.org/P75351 and previous config saved to /var/cache/conftool/dbconfig/20250423-190116-fceratto.json [19:03:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:04:05] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:06:27] (03PS1) 10Bking: cirrussearch: fix typo in regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1138449 (https://phabricator.wikimedia.org/T388610) [19:07:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-eqsin:xe-0/1/1 (Transit: Arelion (IC-331928) {#1177}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:08:10] (03CR) 10Bking: [C:03+2] "self-merging to unblock failed reimage" [puppet] - 10https://gerrit.wikimedia.org/r/1138449 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:08:18] (03CR) 10BCornwall: [C:03+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1137464 (owner: 10Ncmonitor) [19:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:09:16] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release [19:11:01] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:11:05] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:11:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-eqsin:xe-0/1/1 (Transit: Arelion (IC-331928) {#1177}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:16:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr3-eqsin:xe-0/1/1 (Transit: Arelion (IC-331928) {#1177}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:17:45] (03CR) 10Andrew Bogott: invisible-unicorn: Return 404 if caller tries to access a nonexistent proxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1137803 (owner: 10Andrew Bogott) [19:18:05] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:18:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T391056)', diff saved to https://phabricator.wikimedia.org/P75352 and previous config saved to /var/cache/conftool/dbconfig/20250423-191825-fceratto.json [19:18:30] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:20:01] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:21:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr3-eqsin:xe-0/1/1 (Transit: Arelion (IC-331928) {#1177}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:23:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [19:25:16] (03PS2) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1138446 (https://phabricator.wikimedia.org/T388610) [19:28:05] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:28:41] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:56] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [19:29:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr3-eqsin:xe-0/1/1 (Transit: Arelion (IC-331928) {#1177}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:30:05] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:30:20] !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@7312379]: Release DAGs for T391283. [19:30:24] T391283: Create Airflow pipeline to produce wmf_content.mediawiki_content_current_v1 - https://phabricator.wikimedia.org/T391283 [19:31:14] !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@7312379]: Release DAGs for T391283. (duration: 00m 54s) [19:33:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P75353 and previous config saved to /var/cache/conftool/dbconfig/20250423-193332-fceratto.json [19:33:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [19:34:21] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr3-eqsin:xe-0/1/1 (Transit: Arelion (IC-331928) {#1177}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:35:57] (03CR) 10AOkoth: [C:03+2] miscweb: os-report: use puppetdb from external_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131952 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [19:36:17] (03PS3) 10Jelto: miscweb: os-report: use puppetdb from external_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131952 (https://phabricator.wikimedia.org/T350794) [19:36:33] (03PS3) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1138446 (https://phabricator.wikimedia.org/T388610) [19:39:14] (03CR) 10AOkoth: [V:03+2 C:03+2] miscweb: os-report: use puppetdb from external_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131952 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [19:43:21] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1002 is CRITICAL: PROCS CRITICAL: 2 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [19:43:41] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:44:21] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief1002 is OK: PROCS OK: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [19:45:04] (03PS1) 10AOkoth: miscweb: change os-reports runtime owner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138459 (https://phabricator.wikimedia.org/T350794) [19:48:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P75354 and previous config saved to /var/cache/conftool/dbconfig/20250423-194839-fceratto.json [19:48:41] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:49:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2113-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [19:50:36] i'm excited to try spiderpig for the first time in my backport in ~10min [19:50:39] (03PS4) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1138446 (https://phabricator.wikimedia.org/T388610) [19:51:24] generating my one time password using scap spiderpig-otp though I initially got "Password: XXXXXX (Expires in 1 seconds)" ... which didn't exactly give me much time to type it in. [19:53:41] FIRING: [6x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:49] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:58:41] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T2000) [20:00:05] cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:51] i'm here, and i'd like to try to use spiderpig to do this deploy myself [20:01:06] but it would be nice if there was someone with Experience around in case things go pear-shaped [20:01:07] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:01:13] I'm around. [20:01:41] cscott ^ [20:03:42] ok, i'm the only one in the window so I guess I'll get started? i'll ping @dancy if i run into problems with spiderpig. [20:03:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T391056)', diff saved to https://phabricator.wikimedia.org/P75355 and previous config saved to /var/cache/conftool/dbconfig/20250423-200346-fceratto.json [20:03:51] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:03:51] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1232.eqiad.wmnet with reason: Maintenance [20:03:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T391056)', diff saved to https://phabricator.wikimedia.org/P75356 and previous config saved to /var/cache/conftool/dbconfig/20250423-200358-fceratto.json [20:05:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138408 (https://phabricator.wikimedia.org/T268144) (owner: 10C. Scott Ananian) [20:07:04] (03Merged) 10jenkins-bot: Turn on ParsoidFragmentInput; remove unneeded ParsoidFragmentSupport config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138408 (https://phabricator.wikimedia.org/T268144) (owner: 10C. Scott Ananian) [20:07:17] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1138408|Turn on ParsoidFragmentInput; remove unneeded ParsoidFragmentSupport config (T268144)]] [20:07:21] T268144: Add setFunctionHook equivalent support to Parsoid Extension API - https://phabricator.wikimedia.org/T268144 [20:08:54] !log import libvmod-netmapper-1.10-1 into bullseye-wikimedia (T392533) [20:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:58] T392533: Update libvmod-netmapper to 1.10 - https://phabricator.wikimedia.org/T392533 [20:11:59] !log cscott@deploy1003 cscott: Backport for [[gerrit:1138408|Turn on ParsoidFragmentInput; remove unneeded ParsoidFragmentSupport config (T268144)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:14:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2113-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [20:15:55] !log cscott@deploy1003 cscott: Continuing with sync [20:16:12] tests look good, spiderpig looks good! [20:20:26] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:10] !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@4a7644d]: Deploy hotfix for T391283. [20:21:15] T391283: Create Airflow pipeline to produce wmf_content.mediawiki_content_current_v1 - https://phabricator.wikimedia.org/T391283 [20:21:29] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:22:15] !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@4a7644d]: Deploy hotfix for T391283. (duration: 01m 04s) [20:22:27] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:22:36] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138408|Turn on ParsoidFragmentInput; remove unneeded ParsoidFragmentSupport config (T268144)]] (duration: 15m 19s) [20:22:40] T268144: Add setFunctionHook equivalent support to Parsoid Extension API - https://phabricator.wikimedia.org/T268144 [20:22:45] all done! [20:23:01] that was very smooth, props to the spiderpig team [20:23:10] Thank you! [20:28:01] (03PS5) 10Bking: cirrussearch: Add new master-eligibles, YAML space change [puppet] - 10https://gerrit.wikimedia.org/r/1138446 (https://phabricator.wikimedia.org/T388610) [20:32:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Arelion (2001:2035:0:15b5::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:33:08] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2096'] [20:34:24] wikibugs: speak [20:34:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:36] (03PS1) 10Dzahn: gerrit: add nftables rule to allow Istanbul Hackathon hotel network [puppet] - 10https://gerrit.wikimedia.org/r/1138468 (https://phabricator.wikimedia.org/T382309) [20:34:43] (03CR) 10CI reject: [V:04-1] gerrit: add nftables rule to allow Istanbul Hackathon hotel network [puppet] - 10https://gerrit.wikimedia.org/r/1138468 (https://phabricator.wikimedia.org/T382309) (owner: 10Dzahn) [20:35:40] (03PS2) 10Dzahn: gerrit: add nftables rule to allow Istanbul Hackathon hotel network [puppet] - 10https://gerrit.wikimedia.org/r/1138468 (https://phabricator.wikimedia.org/T382309) [20:35:49] PROBLEM - Disk space on analytics1072 is CRITICAL: DISK CRITICAL - free space: / 2094 MB (3% inode=95%): /tmp 2094 MB (3% inode=95%): /var/tmp 2094 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=analytics1072&var-datasource=eqiad+prometheus/ops [20:38:45] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cirrussearch2096'] [20:40:52] pt1979@cumin2002 provision (PID 2280826) is awaiting input [20:41:32] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2096'] [20:42:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T391056)', diff saved to https://phabricator.wikimedia.org/P75357 and previous config saved to /var/cache/conftool/dbconfig/20250423-204236-fceratto.json [20:42:41] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:43:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:45:15] (03CR) 10BCornwall: [C:04-1] "Looks great, thanks! I'm concerned about the regex (maybe I'm just not getting it) and the file paths." [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [20:45:21] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on aphlict2001.codfw.wmnet with reason: Bookworm Re-image [20:46:56] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cirrussearch2096'] [20:48:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:49:24] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2096'] [20:53:01] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:53:09] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:53:10] !log dancy@deploy1003 Installing scap version "4.155.0" for 186 host(s) [20:57:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P75358 and previous config saved to /var/cache/conftool/dbconfig/20250423-205743-fceratto.json [20:57:45] !log dancy@deploy1003 Installation of scap version "4.155.0" completed for 186 hosts [20:58:26] dancy: OK for me to a quick config-only deploy? [20:58:37] Yep I'm done. [20:58:50] Cool. [20:59:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cirrussearch2096'] [21:00:45] (03PS2) 10Jforrester: [wikifunctionswiki] Enable Parsoid in wikitext articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137330 [21:00:45] (03PS2) 10Jforrester: tests: Add a Wikifunctions-related test suite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137333 [21:00:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137330 (owner: 10Jforrester) [21:00:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137333 (owner: 10Jforrester) [21:01:15] (03CR) 10Jforrester: [wikifunctionswiki] Enable Parsoid in wikitext articles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137330 (owner: 10Jforrester) [21:01:43] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cirrussearch2096'] [21:03:28] (03Merged) 10jenkins-bot: [wikifunctionswiki] Enable Parsoid in wikitext articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137330 (owner: 10Jforrester) [21:03:31] (03Merged) 10jenkins-bot: tests: Add a Wikifunctions-related test suite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137333 (owner: 10Jforrester) [21:03:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [21:03:44] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1137330|[wikifunctionswiki] Enable Parsoid in wikitext articles]], [[gerrit:1137333|tests: Add a Wikifunctions-related test suite]] [21:08:25] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1137330|[wikifunctionswiki] Enable Parsoid in wikitext articles]], [[gerrit:1137333|tests: Add a Wikifunctions-related test suite]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:41] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [21:08:45] !log jforrester@deploy1003 jforrester: Continuing with sync [21:10:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cirrussearch2096'] [21:11:55] (03CR) 10BCornwall: [C:04-1] varnish: Add basic edge uniques handling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [21:12:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P75359 and previous config saved to /var/cache/conftool/dbconfig/20250423-211249-fceratto.json [21:15:17] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1137330|[wikifunctionswiki] Enable Parsoid in wikitext articles]], [[gerrit:1137333|tests: Add a Wikifunctions-related test suite]] (duration: 11m 33s) [21:20:54] (03Abandoned) 10Ryan Kemper: rolling-operation: (proof of concept) manually output commands [cookbooks] - 10https://gerrit.wikimedia.org/r/1137824 (owner: 10Ryan Kemper) [21:26:22] (03PS6) 10Bking: cirrussearch: Add new master-eligibles, YAML space change [puppet] - 10https://gerrit.wikimedia.org/r/1138446 (https://phabricator.wikimedia.org/T388610) [21:27:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T391056)', diff saved to https://phabricator.wikimedia.org/P75360 and previous config saved to /var/cache/conftool/dbconfig/20250423-212756-fceratto.json [21:27:57] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2096.codfw.wmnet with OS bullseye [21:28:05] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [21:28:06] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10763085 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cirrussearch2096.codfw.w... [21:28:09] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2096 [21:28:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1234.eqiad.wmnet with reason: Maintenance [21:28:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T391056)', diff saved to https://phabricator.wikimedia.org/P75361 and previous config saved to /var/cache/conftool/dbconfig/20250423-212818-fceratto.json [21:28:21] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:32:17] PROBLEM - Hadoop NodeManager on an-worker1195 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:32:31] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2096 - bking@cumin2002" [21:32:37] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2096 - bking@cumin2002" [21:32:37] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:32:37] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2096.codfw.wmnet 233.16.192.10.in-addr.arpa 3.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:32:41] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2096.codfw.wmnet 233.16.192.10.in-addr.arpa 3.3.2.0.6.1.0.0.2.9.1.0.0.1.0.0.2.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [21:32:42] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2096 [21:32:53] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2096 [21:32:53] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2096 [21:34:26] (03PS7) 10Ryan Kemper: cirrussearch: Change whitespace from 4 to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1138446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:35:31] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:35:46] (03CR) 10Bking: [C:03+1] cirrussearch: Change whitespace from 4 to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1138446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:36:11] (03PS3) 10Andrew Bogott: cloudlb2004-dev: replace cloudlb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1079987 (https://phabricator.wikimedia.org/T377126) (owner: 10Arturo Borrero Gonzalez) [21:36:15] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1138446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:36:31] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:36:35] !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1028.eqiad.wmnet [21:37:13] (03CR) 10Andrew Bogott: [C:03+2] cloudlb2004-dev: replace cloudlb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1079987 (https://phabricator.wikimedia.org/T377126) (owner: 10Arturo Borrero Gonzalez) [21:37:14] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] cloudlb2004-dev: replace cloudlb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1079987 (https://phabricator.wikimedia.org/T377126) (owner: 10Arturo Borrero Gonzalez) [21:37:43] !log eevans@cumin1002 conftool action : set/weight=10; selector: cluster=restbase,dc=eqiad,name=restbase1043.eqiad.wmnet [21:38:09] (03CR) 10Andrew Bogott: [C:03+2] cloudlb2004-dev: replace cloudlb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1079987 (https://phabricator.wikimedia.org/T377126) (owner: 10Arturo Borrero Gonzalez) [21:38:31] (03CR) 10Ryan Kemper: [C:03+2] cirrussearch: Change whitespace from 4 to 2 [puppet] - 10https://gerrit.wikimedia.org/r/1138446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:40:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:42:49] (03PS1) 10Ryan Kemper: cirrus: migrate elastic2061->cirrussearch2061 [puppet] - 10https://gerrit.wikimedia.org/r/1138479 (https://phabricator.wikimedia.org/T388610) [21:43:43] (03PS1) 10Eevans: restbase: add/remove new/old hosts to/from conftool [puppet] - 10https://gerrit.wikimedia.org/r/1138480 (https://phabricator.wikimedia.org/T389423) [21:46:20] (03PS2) 10Ryan Kemper: cirrus: migrate elastic2061->cirrussearch2061 [puppet] - 10https://gerrit.wikimedia.org/r/1138479 (https://phabricator.wikimedia.org/T388610) [21:47:08] (03CR) 10Bking: [C:03+1] cirrus: migrate elastic2061->cirrussearch2061 [puppet] - 10https://gerrit.wikimedia.org/r/1138479 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [21:48:11] (03CR) 10Ryan Kemper: [C:03+2] cirrus: migrate elastic2061->cirrussearch2061 [puppet] - 10https://gerrit.wikimedia.org/r/1138479 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [21:48:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T391056)', diff saved to https://phabricator.wikimedia.org/P75362 and previous config saved to /var/cache/conftool/dbconfig/20250423-214814-fceratto.json [21:48:19] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [21:48:31] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:48:59] (03PS1) 10Andrew Bogott: acme_chief: replace cloudlb2001-dev with cloudlb2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1138481 (https://phabricator.wikimedia.org/T377126) [21:49:56] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2061 to cirrussearch2061 [21:50:06] (03CR) 10Andrew Bogott: [C:03+2] acme_chief: replace cloudlb2001-dev with cloudlb2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1138481 (https://phabricator.wikimedia.org/T377126) (owner: 10Andrew Bogott) [21:50:19] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:50:27] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:50:30] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2096.codfw.wmnet with reason: host reimage [21:54:16] RECOVERY - Hadoop NodeManager on an-worker1195 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:54:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2096.codfw.wmnet with reason: host reimage [21:54:33] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2061 to cirrussearch2061 - bking@cumin2002" [21:54:51] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2061 to cirrussearch2061 - bking@cumin2002" [21:54:51] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:54:52] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2061 [21:55:10] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2061 [21:55:50] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2061 to cirrussearch2061 [21:57:47] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2061.codfw.wmnet with OS bullseye [21:57:59] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2061 [21:58:06] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:58:31] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:59:00] (03PS1) 10Dzahn: firewall: temp add rule to allow Istanbul Hackathon [puppet] - 10https://gerrit.wikimedia.org/r/1138483 (https://phabricator.wikimedia.org/T382309) [21:59:29] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250423T2200) [22:01:20] (03PS1) 10Andrew Bogott: site.pp: add entries for new codfw1dev cloudrabbit servers [puppet] - 10https://gerrit.wikimedia.org/r/1138484 (https://phabricator.wikimedia.org/T392539) [22:02:06] (03CR) 10Andrew Bogott: [C:03+2] site.pp: add entries for new codfw1dev cloudrabbit servers [puppet] - 10https://gerrit.wikimedia.org/r/1138484 (https://phabricator.wikimedia.org/T392539) (owner: 10Andrew Bogott) [22:02:35] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2061 - bking@cumin2002" [22:02:40] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2061 - bking@cumin2002" [22:02:41] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:02:41] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2061.codfw.wmnet 143.0.192.10.in-addr.arpa 3.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [22:02:45] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2061.codfw.wmnet 143.0.192.10.in-addr.arpa 3.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [22:02:45] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2061 [22:03:00] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2061 [22:03:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2061 [22:03:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P75363 and previous config saved to /var/cache/conftool/dbconfig/20250423-220321-fceratto.json [22:11:25] PROBLEM - Bird Internet Routing Daemon on cloudlb2004-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [22:14:51] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2096.codfw.wmnet with OS bullseye [22:14:58] 10ops-eqiad, 06SRE, 06DC-Ops: cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10763198 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cirrussearch2096.codfw.wmnet... [22:15:19] PROBLEM - Check if anycast-healthchecker and all configured threads are running on cloudlb2004-dev is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [22:18:03] (03PS1) 10Ryan Kemper: cirrus: add to-be-renamed masters [puppet] - 10https://gerrit.wikimedia.org/r/1138489 (https://phabricator.wikimedia.org/T388610) [22:18:15] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2061.codfw.wmnet with reason: host reimage [22:18:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P75364 and previous config saved to /var/cache/conftool/dbconfig/20250423-221828-fceratto.json [22:19:11] PROBLEM - haproxy alive on cloudlb2004-dev is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [22:21:02] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2061.codfw.wmnet with reason: host reimage [22:23:04] PROBLEM - haproxy process on cloudlb2004-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [22:33:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T391056)', diff saved to https://phabricator.wikimedia.org/P75365 and previous config saved to /var/cache/conftool/dbconfig/20250423-223336-fceratto.json [22:33:41] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [22:33:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1235.eqiad.wmnet with reason: Maintenance [22:33:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T391056)', diff saved to https://phabricator.wikimedia.org/P75366 and previous config saved to /var/cache/conftool/dbconfig/20250423-223359-fceratto.json [22:35:26] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:44:49] pt1979@cumin2002 provision (PID 2280826) is awaiting input [22:46:01] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:46:39] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2061.codfw.wmnet with OS bullseye [22:52:00] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-e1-codfw - pt1979@cumin2002" [22:52:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-e1-codfw - pt1979@cumin2002" [22:52:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:52:06] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-e1-codfw.mgmt.codfw.wmnet [22:53:20] PROBLEM - Gitlab HTTPS SSL Expiry on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [22:53:41] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device ssw1-e1-codfw.mgmt.codfw.wmnet [22:53:44] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [22:54:02] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 2353 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [22:54:20] RECOVERY - Gitlab HTTPS SSL Expiry on gitlab.wikimedia.org is OK: OK - Certificate gitlab.wikimedia.org will expire on Tue 08 Jul 2025 12:50:22 PM GMT +0000. https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [22:54:39] ^ just noticed this, seems recovered now [22:54:44] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: security release [22:54:56] FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:55:02] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 115525 bytes in 1.168 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [22:58:12] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-e1-codfw - pt1979@cumin2002" [22:58:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for ssw1-e1-codfw - pt1979@cumin2002" [22:58:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:59:56] RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:00:26] FIRING: [3x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:59] the gitlab alerts above were triggered by a version upgrade. the cookbook should downtime them [23:08:08] (03PS1) 10RLazarus: admin_ng: Read RoleBinding usernames from services hiera [deployment-charts] - 10https://gerrit.wikimedia.org/r/1138494 (https://phabricator.wikimedia.org/T378429) [23:21:30] (03CR) 10Dzahn: "ok," [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [23:21:58] (03PS1) 10Dzahn: Revert "aptrepo: add jenkins to bookworm section in distributions-wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/1138498 [23:22:18] (03CR) 10Dzahn: [C:03+2] Revert "aptrepo: add jenkins to bookworm section in distributions-wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/1138498 (owner: 10Dzahn) [23:36:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device ssw1-e1-codfw.mgmt.codfw.wmnet [23:37:05] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-e1-codfw.mgmt.codfw.wmnet [23:37:08] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:37:42] PROBLEM - ganeti-noded running on ganeti1023 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [23:38:42] RECOVERY - ganeti-noded running on ganeti1023 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [23:40:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1138499 [23:40:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1138499 (owner: 10TrainBranchBot) [23:41:18] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-e1-codfw - pt1979@cumin2002" [23:41:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add management record for lsw1-e1-codfw - pt1979@cumin2002" [23:41:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:43:40] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:44:36] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:45:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T391056)', diff saved to https://phabricator.wikimedia.org/P75367 and previous config saved to /var/cache/conftool/dbconfig/20250423-234521-fceratto.json [23:45:26] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [23:52:28] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1138499 (owner: 10TrainBranchBot) [23:53:41] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:49] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:56:51] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device ssw1-e1-codfw.mgmt.codfw.wmnet [23:56:54] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:58:41] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:59:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)