[00:25:27] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:34:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:35:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:35:40] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:39:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198718 [00:39:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198718 (owner: 10TrainBranchBot) [00:40:38] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 8.111 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:46:40] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:53:38] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 9.369 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:54:55] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1198718 (owner: 10TrainBranchBot) [00:55:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:55:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:56:40] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:59:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:00:30] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 0.358 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [01:00:38] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:08:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1198720 [01:08:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1198720 (owner: 10TrainBranchBot) [01:15:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:18:41] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 18m 03s) [01:19:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:29:53] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1198720 (owner: 10TrainBranchBot) [01:34:03] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:39:03] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:52:40] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [01:54:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:54:38] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 8.939 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [01:57:40] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [01:59:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:01:38] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30033 bytes in 8.687 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [02:20:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:59:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:04:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:20:27] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:et-0/0/30 (Core: ssw1-d8-eqiad:ethernet-1/30 {#}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:30:27] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:50:27] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [04:16:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:25:27] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:25:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:27:06] PROBLEM - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1203 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [04:27:08] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1203 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T408359 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [04:27:15] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359 (10ops-monitoring-bot) 03NEW [04:29:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:35:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:55:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:07:16] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:09:03] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:04] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:10:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:14:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:30:27] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:43:57] (03Abandoned) 10Hashar: gerrit: add daemons ssh host key to known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1175114 (https://phabricator.wikimedia.org/T398401) (owner: 10Hashar) [06:04:03] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:07:16] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:20:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:00:05] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T0700). [07:00:05] sfaci and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:18] o/ [07:00:29] o/ [07:01:09] I can deploy [07:02:31] sfaci: is it OK for you if I deploy your 2 patches and mine in a single deploy? [07:02:46] Of course! [07:02:50] ok [07:02:54] thanks! [07:05:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198404 (owner: 10Clare Ming) [07:05:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198413 (owner: 10Clare Ming) [07:05:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [extensions/CirrusSearch] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198529 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:14:37] (03Merged) 10jenkins-bot: ext.xLab: Implement UnenrolledExperiment#setStream() [extensions/MetricsPlatform] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198404 (owner: 10Clare Ming) [07:14:37] (03Merged) 10jenkins-bot: ext.xLab: Implement OverriddenExperiment#setStream() [extensions/MetricsPlatform] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198413 (owner: 10Clare Ming) [07:14:38] (03Merged) 10jenkins-bot: CompletionSuggester: fix index id format check [extensions/CirrusSearch] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198529 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:15:04] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1198404|ext.xLab: Implement UnenrolledExperiment#setStream()]], [[gerrit:1198413|ext.xLab: Implement OverriddenExperiment#setStream()]], [[gerrit:1198529|CompletionSuggester: fix index id format check (T404858)]] [07:15:09] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [07:20:27] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-e1-eqiad:et-0/0/30 (Core: ssw1-d8-eqiad:ethernet-1/30 {#}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:27:07] (03CR) 10Elukey: [C:03+2] profile::pyrra: add two Xlab SLOs under the data-platform namespace [puppet] - 10https://gerrit.wikimedia.org/r/1198011 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [07:30:27] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:32:55] (03PS4) 10JHathaway: EFI: install grub on all EFI partitions [puppet] - 10https://gerrit.wikimedia.org/r/1082288 (https://phabricator.wikimedia.org/T376949) [07:36:12] 10SRE-swift-storage, 06Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11310626 (10elukey) @MatthewVernon thanks a lot for the tests! So I see two issues in this task: 1) Debian install doesn't duplicate... [07:38:39] (03CR) 10Elukey: [C:03+1] EFI: install grub on all EFI partitions [puppet] - 10https://gerrit.wikimedia.org/r/1082288 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [07:39:58] !log dcausse@deploy2002 cjming, dcausse: Backport for [[gerrit:1198404|ext.xLab: Implement UnenrolledExperiment#setStream()]], [[gerrit:1198413|ext.xLab: Implement OverriddenExperiment#setStream()]], [[gerrit:1198529|CompletionSuggester: fix index id format check (T404858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:40:03] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [07:40:25] sfaci: should be ready for testing, please let me know if everything's OK on your side [07:40:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [07:45:54] (03PS1) 10Muehlenhoff: Record LDAP access for jmoore111 [puppet] - 10https://gerrit.wikimedia.org/r/1198903 [07:46:02] @dcausse Tested! It's working fine! [07:46:07] Thank you very much! [07:46:19] sfaci: thanks, continuing with the sync then [07:46:42] !log dcausse@deploy2002 cjming, dcausse: Continuing with sync [07:46:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [07:46:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [07:47:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [07:47:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [07:48:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [07:48:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [07:49:13] (03CR) 10Muehlenhoff: [C:03+2] Record LDAP access for jmoore111 [puppet] - 10https://gerrit.wikimedia.org/r/1198903 (owner: 10Muehlenhoff) [07:50:27] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:51:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [07:52:46] (03PS1) 10Elukey: Revert "profile::pyrra: add two Xlab SLOs under the data-platform namespace" [puppet] - 10https://gerrit.wikimedia.org/r/1198904 [07:53:53] jouncebot: next [07:53:53] In 2 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T1000) [07:55:20] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [07:55:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [07:57:51] I'll need to extend the backport window by 15mins to ship a last patch when this deploy is done [07:59:14] (03CR) 10Elukey: [C:03+2] Revert "profile::pyrra: add two Xlab SLOs under the data-platform namespace" [puppet] - 10https://gerrit.wikimedia.org/r/1198904 (owner: 10Elukey) [08:01:18] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198404|ext.xLab: Implement UnenrolledExperiment#setStream()]], [[gerrit:1198413|ext.xLab: Implement OverriddenExperiment#setStream()]], [[gerrit:1198529|CompletionSuggester: fix index id format check (T404858)]] (duration: 46m 13s) [08:01:23] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [08:01:36] sfaci: should be live [08:05:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198295 (owner: 10DCausse) [08:06:05] (03Merged) 10jenkins-bot: Revert^2 "cirrus: enable completion search with defaultsort A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198295 (owner: 10DCausse) [08:06:11] (03CR) 10Slyngshede: [C:03+1] admin: add a-pizzata to analytics-admins, deployment [puppet] - 10https://gerrit.wikimedia.org/r/1198531 (https://phabricator.wikimedia.org/T407228) (owner: 10Kamila Součková) [08:06:19] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for a-pizzata - https://phabricator.wikimedia.org/T407228#11310661 (10SLyngshede-WMF) [08:06:27] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1198295|Revert^2 "cirrus: enable completion search with defaultsort A/B test"]] [08:10:42] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1198295|Revert^2 "cirrus: enable completion search with defaultsort A/B test"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:12:25] (03PS1) 10Zabe: Correctly check if value is not false [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198907 [08:16:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:40] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:22:34] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 5.598 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:25:27] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-drmrs:xe-0/1/5 (DISABLED) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-drmrs:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:26:09] !log dcausse@deploy2002 dcausse: Continuing with sync [08:26:40] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11310677 (10DPogorzelski-WMF) Hello, would it be possible to have it approved? Thanks! [08:32:22] (03PS10) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) [08:33:14] (03CR) 10CI reject: [V:04-1] hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [08:34:20] (03CR) 10Pmiazga: api-gateway: make cookie name configurable for testing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: 10Daniel Kinzler) [08:34:40] (03PS1) 10Muehlenhoff: imposm-initial-import: Read the permissions from a file [puppet] - 10https://gerrit.wikimedia.org/r/1198908 [08:36:28] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198295|Revert^2 "cirrus: enable completion search with defaultsort A/B test"]] (duration: 30m 01s) [08:36:50] (03CR) 10Elukey: [C:03+1] imposm-initial-import: Read the permissions from a file [puppet] - 10https://gerrit.wikimedia.org/r/1198908 (owner: 10Muehlenhoff) [08:37:34] !log closing UTC morning backport window [08:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:09] (03PS1) 10Elukey: profile::thanos: add two recording rules for Xlab's SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1198911 (https://phabricator.wikimedia.org/T398869) [08:44:16] (03PS2) 10Muehlenhoff: imposm-initial-import: Read the permissions from a file [puppet] - 10https://gerrit.wikimedia.org/r/1198908 (https://phabricator.wikimedia.org/T381565) [08:44:59] (03PS11) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) [08:45:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198908 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:46:27] (03PS1) 10Slyngshede: zarcillo: decommission idp hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198912 (https://phabricator.wikimedia.org/T406455) [08:47:37] (03CR) 10Elukey: [C:03+1] imposm-initial-import: Read the permissions from a file [puppet] - 10https://gerrit.wikimedia.org/r/1198908 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:48:17] (03PS1) 10Slyngshede: jaeger: decommision old IDP hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198913 (https://phabricator.wikimedia.org/T406455) [08:48:59] (03PS4) 10Cathal Mooney: Nokia DHCP-Relay: use IRB interface IP [homer/public] - 10https://gerrit.wikimedia.org/r/1198506 (https://phabricator.wikimedia.org/T402577) [08:55:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:57:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [08:58:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [08:59:08] (03CR) 10Cathal Mooney: [C:03+2] team-netops: add checks against Nokia OSPF status [alerts] - 10https://gerrit.wikimedia.org/r/1198095 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [08:59:41] (03CR) 10Elukey: [C:03+1] "The python code LGTM, I vaguely got the problem but I can't really judge on the implementation :D" [homer/public] - 10https://gerrit.wikimedia.org/r/1198506 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:00:50] (03Merged) 10jenkins-bot: team-netops: add checks against Nokia OSPF status [alerts] - 10https://gerrit.wikimedia.org/r/1198095 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [09:01:23] (03CR) 10Cathal Mooney: [C:03+2] Nokia DHCP-Relay: use IRB interface IP [homer/public] - 10https://gerrit.wikimedia.org/r/1198506 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:02:39] (03Merged) 10jenkins-bot: Nokia DHCP-Relay: use IRB interface IP [homer/public] - 10https://gerrit.wikimedia.org/r/1198506 (https://phabricator.wikimedia.org/T402577) (owner: 10Cathal Mooney) [09:06:25] (03PS4) 10RLazarus: deployment_server: Add --priority to charlie [puppet] - 10https://gerrit.wikimedia.org/r/1196989 (https://phabricator.wikimedia.org/T406212) [09:08:16] (03CR) 10Clément Goubert: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1196989 (https://phabricator.wikimedia.org/T406212) (owner: 10RLazarus) [09:08:24] (03PS4) 10RLazarus: deployment_server: Add --dangerously_fast to charlie [puppet] - 10https://gerrit.wikimedia.org/r/1196990 (https://phabricator.wikimedia.org/T406212) [09:11:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. I recommend going forward to add a comment with the hostname to each record, just going with the IPs makes these really easy t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198912 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [09:12:38] (03CR) 10DCausse: [C:03+1] dumps: Sync cirrus index dumps from hdfs [puppet] - 10https://gerrit.wikimedia.org/r/1184585 (https://phabricator.wikimedia.org/T366248) (owner: 10Ebernhardson) [09:15:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:59] (03CR) 10Clément Goubert: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1196990 (https://phabricator.wikimedia.org/T406212) (owner: 10RLazarus) [09:19:31] (03CR) 10Slyngshede: "The two new IDP hosts are commented, it's just the rest of the IPs that aren't" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198912 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [09:23:08] (03CR) 10Federico Ceratto: [C:03+2] zarcillo: decommission idp hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198912 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [09:23:32] 06SRE, 10Cloud-VPS, 06DC-Ops, 06cloud-services-team (FY2025/26-Q1): Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11310822 (10fgiunchedi) 05Open→03Resolved This is complete, both hosts are in service with full weight an... [09:24:19] 06SRE, 10Cloud-VPS, 06DC-Ops, 06cloud-services-team (FY2025/26-Q1): cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11310826 (10fgiunchedi) [09:24:25] 06SRE, 10Cloud-VPS, 06DC-Ops, 06cloud-services-team (FY2025/26-Q1): cloudcephosd10[48-52] service implementation - https://phabricator.wikimedia.org/T395910#11310828 (10fgiunchedi) 05In progress→03Resolved Completed [09:25:04] (03Merged) 10jenkins-bot: zarcillo: decommission idp hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198912 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [09:25:26] !log ammarpad@deploy2002 mwscript-k8s job started: extensions/Translate/scripts/moveTranslatableBundle.php --wiki=metawiki 'Lingua Libre/SignIt' SignIt Ammarpad --reason 'requested at [[:phab:T408314]]' # T408314 [09:25:30] T408314: Request to move translatable page on meta from "Lingua Libre/SignIt" to "SignIt" - https://phabricator.wikimedia.org/T408314 [09:31:40] (03CR) 10Muehlenhoff: [C:03+1] "There' also some mystery entries (e.g. 208.80.153.12), maybe these are former IDP hosts which slipped through in the past?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198912 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [09:33:09] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198913 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [09:33:31] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11310845 (10fgiunchedi) We have successfully put in service cloudcephosd1050 and cloudcephosd1051 in {T405478} with single-nic, I haven't seen any problem whatsoever with... [09:34:00] !log cmooney@cumin1003 START - Cookbook sre.hosts.remove-downtime for ssw1-d8-eqiad [09:34:01] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ssw1-d8-eqiad [09:34:22] (03CR) 10Muehlenhoff: [C:03+2] imposm-initial-import: Read the permissions from a file [puppet] - 10https://gerrit.wikimedia.org/r/1198908 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:37:31] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [09:39:52] !log slyngshede@cumin1003 START - Cookbook sre.hosts.decommission for hosts idp1004.wikimedia.org [09:40:09] !log slyngshede@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts idp1004.wikimedia.org [09:43:36] (03PS1) 10Slyngshede: P:idp decommision Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/1198921 (https://phabricator.wikimedia.org/T406455) [09:46:34] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1198921 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [09:49:29] (03CR) 10Clément Goubert: [C:03+2] "I don't remember if we'd landed on needing an announcement for these (I think we did), but that would only apply for when we route `/api/r" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197731 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [09:49:42] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11310972 (10cmooney) >>! In T399180#11310845, @fgiunchedi wrote: > @taavi @Andrew @cmooney what do you think of the above? The plan sounds good. We need to audit and ma... [09:50:30] (03PS1) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198922 [09:51:35] (03Merged) 10jenkins-bot: Update /page/ lint routes to use the new rest.php endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197731 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [09:56:03] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:56:06] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:56:10] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [09:56:15] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [09:56:40] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:57:16] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:57:23] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [09:57:44] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [09:59:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:56] (03CR) 10Clément Goubert: [C:03+1] Move rest_v1-wikimedia.json under the wwwportal directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198405 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T1000) [10:00:52] (03CR) 10Clément Goubert: [C:03+1] "Depends on Ia9ceb76e36e3d699fb2821eddc351b7e6113a4ea" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198406 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [10:03:17] 06SRE, 06cloud-services-team: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware - https://phabricator.wikimedia.org/T407586#11311121 (10fgiunchedi) p:05Triage→03High [10:04:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:05:27] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:05:48] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378 (10cmooney) 03NEW p:05Triage→03Medium [10:08:29] (03CR) 10Tiziano Fogli: [C:03+1] profile::thanos: add two recording rules for Xlab's SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1198911 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [10:15:46] (03PS1) 10Federico Ceratto: zarcillo: remove obsoleted IDP egress ipaddrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198924 (https://phabricator.wikimedia.org/T384810) [10:15:46] (03CR) 10Federico Ceratto: "(Just cleanup, no real changes expected)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198924 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [10:16:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:55] (03PS1) 10Clément Goubert: api-gateway: Rename limit_group to policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198925 (https://phabricator.wikimedia.org/T408192) [10:19:55] (03CR) 10Clément Goubert: "No-op for `api-gateway`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198925 (https://phabricator.wikimedia.org/T408192) (owner: 10Clément Goubert) [10:20:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:22:22] (03PS1) 10Cathal Mooney: OSPF alert: fix error in grouping of labels [alerts] - 10https://gerrit.wikimedia.org/r/1198926 (https://phabricator.wikimedia.org/T408378) [10:22:44] (03CR) 10Slyngshede: [C:03+2] P:idp decommision Bookworm hosts [puppet] - 10https://gerrit.wikimedia.org/r/1198921 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [10:24:55] !log slyngshede@cumin1003 START - Cookbook sre.hosts.decommission for hosts idp1004.wikimedia.org [10:29:57] !log slyngshede@cumin1003 START - Cookbook sre.dns.netbox [10:33:18] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp1004.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1003" [10:34:42] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp1004.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1003" [10:34:42] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:34:43] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp1004.wikimedia.org [10:38:59] (03PS1) 10Elukey: WIP: sre.hosts.reboot-single: add the powercycle option [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [10:42:17] (03PS1) 10Clément Goubert: trafficserver: action api to rest-gateway group0 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198929 (https://phabricator.wikimedia.org/T408223) [10:42:19] (03PS1) 10Clément Goubert: trafficserver: action api to rest-gateway group0 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198930 (https://phabricator.wikimedia.org/T408223) [10:42:21] (03PS1) 10Clément Goubert: trafficserver: action api to rest-gateway group0 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198931 (https://phabricator.wikimedia.org/T408223) [10:42:23] (03PS1) 10Clément Goubert: trafficserver: action api to rest-gateway group1 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198932 (https://phabricator.wikimedia.org/T408223) [10:42:25] (03PS1) 10Clément Goubert: trafficserver: action api to rest-gateway group1 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198933 (https://phabricator.wikimedia.org/T408223) [10:42:27] (03PS1) 10Clément Goubert: trafficserver: action api to rest-gateway group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198934 (https://phabricator.wikimedia.org/T408223) [10:42:29] (03PS1) 10Clément Goubert: trafficserver: action api to rest-gateway group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198935 (https://phabricator.wikimedia.org/T408223) [10:42:31] (03PS1) 10Clément Goubert: trafficserver: action api to rest-gateway group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198936 (https://phabricator.wikimedia.org/T408223) [10:42:33] (03PS1) 10Clément Goubert: trafficserver: action api to rest-gateway group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198937 (https://phabricator.wikimedia.org/T408223) [10:42:38] (03PS1) 10Clément Goubert: trafficserver: action api to rest-gateway enwiki 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198938 (https://phabricator.wikimedia.org/T408223) [10:42:42] (03PS1) 10Clément Goubert: trafficserver: action api to rest-gateway enwiki 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198939 (https://phabricator.wikimedia.org/T408223) [10:42:46] (03PS1) 10Clément Goubert: trafficserver: action api to rest-gateway enwiki 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198940 (https://phabricator.wikimedia.org/T408223) [10:42:50] (03PS1) 10Clément Goubert: trafficserver: action api to rest-gateway cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1198941 (https://phabricator.wikimedia.org/T408223) [10:42:54] (03PS2) 10Elukey: WIP: sre.hosts.reboot-single: add the powercycle option [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [10:46:17] (03CR) 10Tiziano Fogli: [C:03+1] OSPF alert: fix error in grouping of labels [alerts] - 10https://gerrit.wikimedia.org/r/1198926 (https://phabricator.wikimedia.org/T408378) (owner: 10Cathal Mooney) [10:46:39] (03PS3) 10Elukey: WIP: sre.hosts.reboot-single: add the powercycle option [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [10:48:32] !log slyngshede@cumin1003 START - Cookbook sre.hosts.decommission for hosts idp2004.wikimedia.org [10:48:33] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work, 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11311294 (10JMoore-WMF) I just checked- and I still **c... [10:50:08] (03CR) 10Elukey: "It is probably better a separate cookbook, too many differences with reboot single." [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [10:53:33] !log slyngshede@cumin1003 START - Cookbook sre.dns.netbox [10:54:29] (03PS1) 10Slyngshede: P:idp-test decommission idp-test1004 [puppet] - 10https://gerrit.wikimedia.org/r/1198946 (https://phabricator.wikimedia.org/T406455) [10:55:23] (03CR) 10Brouberol: [C:03+2] Configure production shell access and posix groups for jmoore111 [puppet] - 10https://gerrit.wikimedia.org/r/1198504 (https://phabricator.wikimedia.org/T408164) (owner: 10Btullis) [10:59:14] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp2004.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1003" [10:59:56] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp2004.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1003" [10:59:56] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:59:57] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp2004.wikimedia.org [11:00:36] jouncebot: nowandnext [11:00:36] No deployments scheduled for the next 1 hour(s) and 59 minute(s) [11:00:36] In 1 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T1300) [11:01:44] (03CR) 10Zabe: [C:03+2] Correctly check if value is not false [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198907 (owner: 10Zabe) [11:03:19] (03Merged) 10jenkins-bot: Correctly check if value is not false [extensions/WikimediaMaintenance] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198907 (owner: 10Zabe) [11:04:51] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1198907|Correctly check if value is not false]] [11:08:05] (03CR) 10Cathal Mooney: [C:03+2] OSPF alert: fix error in grouping of labels [alerts] - 10https://gerrit.wikimedia.org/r/1198926 (https://phabricator.wikimedia.org/T408378) (owner: 10Cathal Mooney) [11:08:38] !log zabe@deploy2002 zabe: Backport for [[gerrit:1198907|Correctly check if value is not false]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:09:08] !log zabe@deploy2002 zabe: Continuing with sync [11:10:00] (03Merged) 10jenkins-bot: OSPF alert: fix error in grouping of labels [alerts] - 10https://gerrit.wikimedia.org/r/1198926 (https://phabricator.wikimedia.org/T408378) (owner: 10Cathal Mooney) [11:17:00] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198907|Correctly check if value is not false]] (duration: 12m 08s) [11:24:39] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work, 13Patch-For-Review: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11311375 (10JMoore-WMF) access confirmed to airflow, su... [11:24:51] (03CR) 10Zabe: [C:03+2] Using Hadoop for MostTranscludedPages on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196112 (https://phabricator.wikimedia.org/T309738) (owner: 10Zabe) [11:25:42] (03Merged) 10jenkins-bot: Using Hadoop for MostTranscludedPages on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196112 (https://phabricator.wikimedia.org/T309738) (owner: 10Zabe) [11:26:14] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1196112|Using Hadoop for MostTranscludedPages on testwiki (T309738)]] [11:26:19] T309738: Move MediaWiki QueryPages computation to Hadoop - https://phabricator.wikimedia.org/T309738 [11:30:21] !log zabe@deploy2002 zabe: Backport for [[gerrit:1196112|Using Hadoop for MostTranscludedPages on testwiki (T309738)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:30:27] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:31:12] !log zabe@deploy2002 zabe: Continuing with sync [11:37:16] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196112|Using Hadoop for MostTranscludedPages on testwiki (T309738)]] (duration: 11m 02s) [11:37:25] T309738: Move MediaWiki QueryPages computation to Hadoop - https://phabricator.wikimedia.org/T309738 [11:39:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [11:39:23] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply [11:43:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [11:43:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [11:45:24] (03PS12) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) [11:45:25] (03PS1) 10Kosta Harlan: hCaptcha: Define HCaptchaSiteKey in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198949 (https://phabricator.wikimedia.org/T405586) [11:45:30] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11311453 (10Jclark-ctr) a:03Jclark-ctr Updated Idrac Firmware to 7.20.60.50 [11:46:39] (03CR) 10Muehlenhoff: [C:03+1] "I think that's a good workaround. There is now actually a new feature in systemd which allows to fix this properly, but it got only added " [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [11:47:06] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Nokia OSPF alerts not working - https://phabricator.wikimedia.org/T408378#11311458 (10cmooney) Ok well I fixed the obvious error but the alerts still aren't firing :( [11:53:23] (03CR) 10Kamila Součková: [C:03+2] admin: add a-pizzata to analytics-admins, deployment [puppet] - 10https://gerrit.wikimedia.org/r/1198531 (https://phabricator.wikimedia.org/T407228) (owner: 10Kamila Součková) [11:56:51] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1198946 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [12:01:35] (03CR) 10Slyngshede: [C:03+2] P:idp-test decommission idp-test1004 [puppet] - 10https://gerrit.wikimedia.org/r/1198946 (https://phabricator.wikimedia.org/T406455) (owner: 10Slyngshede) [12:03:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.835s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:07:02] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11311497 (10cmooney) >>! In T396065#11307254, @VRiley-WMF wrote: > Okay, I believe it should be up. > > ssw1-d8-eqiad cable has been moved from port 30 to 31. > > ssw1-d8-eqiad had the cable... [12:08:56] !log slyngshede@cumin1003 START - Cookbook sre.hosts.decommission for hosts idp-test1004.wikimedia.org [12:10:14] (03PS1) 10Muehlenhoff: Remove Cumin aliases for legacy mediawiki servers [puppet] - 10https://gerrit.wikimedia.org/r/1198952 [12:10:56] (03PS14) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [12:10:57] (03PS1) 10Clément Goubert: api-gateway: csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198953 (https://phabricator.wikimedia.org/T406490) [12:12:02] (03PS1) 10Cathal Mooney: ssw1-f1-eqiad: add BGP peerings to ssw1-d1/d8 [homer/public] - 10https://gerrit.wikimedia.org/r/1198954 (https://phabricator.wikimedia.org/T396065) [12:12:08] (03CR) 10CI reject: [V:04-1] ssw1-f1-eqiad: add BGP peerings to ssw1-d1/d8 [homer/public] - 10https://gerrit.wikimedia.org/r/1198954 (https://phabricator.wikimedia.org/T396065) (owner: 10Cathal Mooney) [12:12:53] (03PS15) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) [12:12:53] (03PS2) 10Clément Goubert: api-gateway: csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198953 (https://phabricator.wikimedia.org/T406490) [12:12:56] (03Abandoned) 10Cathal Mooney: ssw1-f1-eqiad: add BGP peerings to ssw1-d1/d8 [homer/public] - 10https://gerrit.wikimedia.org/r/1198954 (https://phabricator.wikimedia.org/T396065) (owner: 10Cathal Mooney) [12:13:46] !log slyngshede@cumin1003 START - Cookbook sre.dns.netbox [12:14:12] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11311539 (10Jclark-ctr) Service Request 217773984 [12:16:33] !log installing Java 21 security updates [12:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:16] !log slyngshede@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test1004.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1003" [12:17:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [12:17:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [12:17:54] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: idp-test1004.wikimedia.org decommissioned, removing all IPs except the asset tag one - slyngshede@cumin1003" [12:17:54] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:17:55] !log slyngshede@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts idp-test1004.wikimedia.org [12:18:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.084s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:23:16] (03PS1) 10Cathal Mooney: ssw1-f1-eqiad: add peering to ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1198958 (https://phabricator.wikimedia.org/T396065) [12:23:22] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [12:23:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [12:25:09] (03CR) 10Cathal Mooney: [C:03+2] ssw1-f1-eqiad: add peering to ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1198958 (https://phabricator.wikimedia.org/T396065) (owner: 10Cathal Mooney) [12:25:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.137s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:26:31] (03Merged) 10jenkins-bot: ssw1-f1-eqiad: add peering to ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1198958 (https://phabricator.wikimedia.org/T396065) (owner: 10Cathal Mooney) [12:29:47] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool es2026 T408385', diff saved to https://phabricator.wikimedia.org/P84307 and previous config saved to /var/cache/conftool/dbconfig/20251027-122946-fceratto.json [12:29:57] T408385: decommission es2026 - https://phabricator.wikimedia.org/T408385 [12:30:47] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:31:06] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:31:34] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:32:22] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:35:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.28s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:36:19] (03PS1) 10Federico Ceratto: instances.yaml: remove es2026 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1198962 (https://phabricator.wikimedia.org/T408385) [12:38:59] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:39:30] (03CR) 10Pmiazga: api-gateway: Use metadata to flip csp header handling (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [12:39:49] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:40:18] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:40:32] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:40:52] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:40:55] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:42:43] (03PS4) 10Michael Große: beta: Enable ReviseTone Structured Task on enwiki,frwiki,arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198023 (https://phabricator.wikimedia.org/T405176) [12:42:56] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to "analytics-admins" and "deployment" groups for a-pizzata - https://phabricator.wikimedia.org/T407228#11311601 (10Raine) 05Open→03Resolved Done, let me know if something isn't working :-) [12:51:06] (03CR) 10MVernon: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1198962 (https://phabricator.wikimedia.org/T408385) (owner: 10Federico Ceratto) [12:52:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [12:52:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply [12:53:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.261s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:54:20] 06SRE, 10Hiddenparma, 06Traffic: FY 25/26 WE 5.4.5: Enforce global rate-limits - https://phabricator.wikimedia.org/T406545#11311628 (10ssingh) [12:54:24] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es2026 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1198962 (https://phabricator.wikimedia.org/T408385) (owner: 10Federico Ceratto) [12:55:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:56:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [12:56:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [12:58:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.244s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:00:05] Urbanecm and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:02:17] !log fceratto@cumin1003 dbctl commit (dc=all): 'Remove es2026 from dbctl T408385', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20251027-130212-fceratto.json [13:02:37] T408385: decommission es2026 - https://phabricator.wikimedia.org/T408385 [13:04:36] (03PS3) 10Brouberol: postgresql-growthbook: define a custom PG image, libraries and post init SQL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198514 (https://phabricator.wikimedia.org/T406578) [13:04:36] (03PS1) 10Brouberol: cloudnative-pg-cluster: allow direct access to the DB when pooling is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198974 (https://phabricator.wikimedia.org/T406578) [13:04:38] (03PS1) 10Brouberol: cloudnative-pg-cluster: set env vars disabling s3 security feature not implemented in radosgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198975 (https://phabricator.wikimedia.org/T406578) [13:04:47] (03CR) 10CI reject: [V:04-1] postgresql-growthbook: define a custom PG image, libraries and post init SQL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198514 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [13:04:49] (03CR) 10CI reject: [V:04-1] cloudnative-pg-cluster: allow direct access to the DB when pooling is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198974 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [13:04:51] (03CR) 10CI reject: [V:04-1] cloudnative-pg-cluster: set env vars disabling s3 security feature not implemented in radosgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198975 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [13:07:40] (03PS1) 10Brouberol: Definition of a ferretdb chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198977 (https://phabricator.wikimedia.org/T406579) [13:07:44] (03PS1) 10Brouberol: ferretdb-growthbook: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198978 (https://phabricator.wikimedia.org/T406579) [13:08:49] (03PS2) 10Brouberol: cloudnative-pg-cluster: allow direct access to the DB when pooling is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198974 (https://phabricator.wikimedia.org/T406578) [13:08:49] (03PS2) 10Brouberol: cloudnative-pg-cluster: set env vars disabling s3 security feature not implemented in radosgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198975 (https://phabricator.wikimedia.org/T406578) [13:08:49] (03PS4) 10Brouberol: postgresql-growthbook: define a custom PG image, libraries and post init SQL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198514 (https://phabricator.wikimedia.org/T406578) [13:09:18] (03PS2) 10Brouberol: Definition of a ferretdb chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198977 (https://phabricator.wikimedia.org/T406579) [13:09:18] (03PS2) 10Brouberol: ferretdb-growthbook: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198978 (https://phabricator.wikimedia.org/T406579) [13:11:42] (03CR) 10CI reject: [V:04-1] ferretdb-growthbook: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198978 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [13:13:47] (03PS3) 10Brouberol: ferretdb-growthbook: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198978 (https://phabricator.wikimedia.org/T406579) [13:20:35] (03CR) 10Clément Goubert: api-gateway: Use metadata to flip csp header handling (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [13:26:17] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [13:26:20] (03CR) 10Daniel Kinzler: api-gateway: make cookie name configurable for testing (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198385 (https://phabricator.wikimedia.org/T408128) (owner: 10Daniel Kinzler) [13:28:16] (03Merged) 10jenkins-bot: api-gateway: Use metadata to flip csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198310 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [13:29:11] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:36:35] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts failoid2002.codfw.wmnet [13:38:12] (03PS1) 10Clément Goubert: Revert "api-gateway: Use metadata to flip csp header handling" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198981 [13:40:42] (03CR) 10Clément Goubert: [C:03+2] Revert "api-gateway: Use metadata to flip csp header handling" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198981 (owner: 10Clément Goubert) [13:41:34] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:42:49] (03Merged) 10jenkins-bot: Revert "api-gateway: Use metadata to flip csp header handling" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198981 (owner: 10Clément Goubert) [13:43:05] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:43:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11311818 (10Jclark-ctr) [13:46:02] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad racks for variance from Netbox - https://phabricator.wikimedia.org/T407851#11311826 (10Jclark-ctr) 05Open→03Resolved [13:47:17] jmm@cumin2002 decommission (PID 1250413) is awaiting input [13:48:42] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: failoid2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:48:55] (03PS17) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) [13:48:55] (03CR) 10Jelto: [V:03+1] "tested on `tcp-proxy-test.devtools.eqiad1.wikimedia.cloud`, see details in T365259#11311832" [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [13:49:19] (03PS1) 10Zabe: BETA: Stop using Hadoop data for Mostlinkedtemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198983 [13:49:53] (03PS3) 10Clément Goubert: api-gateway: csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198953 (https://phabricator.wikimedia.org/T406490) [13:50:31] (03CR) 10Zabe: [C:03+2] BETA: Stop using Hadoop data for Mostlinkedtemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198983 (owner: 10Zabe) [13:51:20] (03Merged) 10jenkins-bot: BETA: Stop using Hadoop data for Mostlinkedtemplates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198983 (owner: 10Zabe) [13:51:47] jmm@cumin2002 decommission (PID 1250413) is awaiting input [13:52:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: failoid2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:52:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:52:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts failoid2002.codfw.wmnet [13:52:17] 06SRE, 06Infrastructure-Foundations: Move failoid to trixie - https://phabricator.wikimedia.org/T402406#11311841 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `failoid2002.codfw.wmnet` - failoid2002.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager... [13:54:30] (03PS1) 10Stevemunene: DNS: Add druid-public-coordinator records [dns] - 10https://gerrit.wikimedia.org/r/1198984 (https://phabricator.wikimedia.org/T406222) [13:55:02] (03PS2) 10Stevemunene: DNS: Add druid-public-coordinator records [dns] - 10https://gerrit.wikimedia.org/r/1198984 (https://phabricator.wikimedia.org/T406222) [13:56:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11311845 (10bking) 05Open→03In progress [13:56:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11311847 (10bking) 05Open→03In progress [13:56:58] (03Abandoned) 10Stevemunene: DNS: Add druid-public-coordinator records [dns] - 10https://gerrit.wikimedia.org/r/1198984 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [13:57:52] (03CR) 10Clément Goubert: [C:03+2] api-gateway: csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198953 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [14:00:03] (03Merged) 10jenkins-bot: api-gateway: csp header handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198953 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [14:00:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11311875 (10bking) Hello DC Ops, I've added the hosts and partman recipe to Puppet as requested. Please note that the part... [14:00:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo200[1-3] - https://phabricator.wikimedia.org/T405964#11311880 (10bking) 05In progress→03Stalled [14:01:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Q2:rack/setup/install ganeti-jumbo100[1-3] - https://phabricator.wikimedia.org/T405966#11311886 (10bking) 05In progress→03Stalled Hello DC Ops, I've added the hosts and partman recipe to Puppet as requeste... [14:01:54] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11311894 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:02:37] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:02:57] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:03:32] !log zabe@deploy2002 mwscript-k8s job started: extensions/WikimediaMaintenance/T389026.php --wiki=dewikivoyage # T389026 [14:03:45] T389026: Rethink rev_sha1 field - https://phabricator.wikimedia.org/T389026 [14:04:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:04:22] (03PS1) 10Andrew Bogott: Preseed: one more grub experiment [puppet] - 10https://gerrit.wikimedia.org/r/1198986 (https://phabricator.wikimedia.org/T407586) [14:04:56] !log zabe@deploy2002 mwscript-k8s job started: extensions/WikimediaMaintenance/T389026.php --wiki=itwikivoyage # T389026 [14:05:27] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:05:58] (03PS2) 10Stevemunene: DNS: Add druid-public-coordinator record [dns] - 10https://gerrit.wikimedia.org/r/1198500 (https://phabricator.wikimedia.org/T406222) [14:06:33] (03CR) 10Stevemunene: DNS: Add druid-public-coordinator record (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1198500 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:07:35] (03CR) 10Andrew Bogott: [C:03+2] Preseed: one more grub experiment [puppet] - 10https://gerrit.wikimedia.org/r/1198986 (https://phabricator.wikimedia.org/T407586) (owner: 10Andrew Bogott) [14:11:31] (03PS2) 10Clément Goubert: rest-gateway: Fix csp_enabled configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198987 (https://phabricator.wikimedia.org/T406490) [14:14:17] (03CR) 10Fabfur: [C:04-1] "beware of typos! :)" [dns] - 10https://gerrit.wikimedia.org/r/1198500 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:14:46] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:16:32] (03CR) 10Brouberol: [C:03+1] "As @fabfur mentioned, there are typos in the DNS records" [dns] - 10https://gerrit.wikimedia.org/r/1198500 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:16:36] (03CR) 10Fabfur: [C:03+1] LVS: etcd data for druid-public-coordinator (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:17:00] (03PS3) 10Stevemunene: DNS: Add druid-public-coordinator record [dns] - 10https://gerrit.wikimedia.org/r/1198500 (https://phabricator.wikimedia.org/T406222) [14:17:39] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Fix csp_enabled configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198987 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [14:18:36] (03CR) 10Brouberol: LVS: etcd data for druid-public-coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:19:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:43] (03Merged) 10jenkins-bot: rest-gateway: Fix csp_enabled configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198987 (https://phabricator.wikimedia.org/T406490) (owner: 10Clément Goubert) [14:19:55] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:20:04] (03CR) 10Fabfur: [C:03+1] "ok for me, aside from a small nit" [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:20:22] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [14:20:24] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:20:27] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:20:36] stevemunene: I would coordinate w/ sukhe before merging the patches above [14:20:43] as he's working on eqiad currently [14:20:57] (03CR) 10Brouberol: [C:03+1] "LGTM, although I'm not an expert in the configuration of pybal services. Don't let that stop you." [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:21:29] Ack, thanks for the reviews fabfur :) [14:21:39] stevemunene: yeah, we have some reboots in progress [14:21:43] which patch is this? [14:23:19] (03CR) 10Daniel Kinzler: [C:03+1] "Thanks! I couldn't find any remaining references to "group" referring to rate limits." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198925 (https://phabricator.wikimedia.org/T408192) (owner: 10Clément Goubert) [14:23:30] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:23:58] PROBLEM - pybal on lvs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:24:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:04] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:24:20] yea this is fine [14:24:29] I don't want to downtime this [14:24:32] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:24:43] https://usercontent.irccloud-cdn.com/file/Op4h0J6f/image.png [14:24:57] :) [14:25:23] sukhe: https://gerrit.wikimedia.org/r/c/operations/dns/+/1198500, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198498, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198499 [14:25:27] (03CR) 10Zabe: "The three weeks in the announcement are over, I think we can merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/1191090 (https://phabricator.wikimedia.org/T389026) (owner: 10Zabe) [14:25:38] me remembering fabfur is on on-call https://www.youtube.com/watch?v=ZZ5LpwO-An4 [14:25:49] (03CR) 10Stevemunene: DNS: Add druid-public-coordinator record (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1198500 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:26:24] fabfur: thanks [14:26:28] Hi sukhe, adding a new service to eqiad, the whole stack. But still pending some reviews. I shall ping you later on [14:26:38] stevemunene: ok thanks, please check with us once [14:27:01] (03CR) 10Fabfur: [C:03+1] "ok for me!" [dns] - 10https://gerrit.wikimedia.org/r/1198500 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:27:07] Sure will ;) [14:28:20] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [14:29:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:35] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:29:52] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T1430) [14:33:04] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1016 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [14:33:43] _fine_ I will downtime [14:34:16] :D [14:34:20] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1016.eqiad.wmnet with reason: reboot [14:35:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:33] (03PS2) 10Clément Goubert: api-gateway: Rename limit_group to policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198925 (https://phabricator.wikimedia.org/T408192) [14:38:10] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [14:40:45] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407833#11312085 (10LSobanski) a:03cmooney [14:41:14] andrew@cumin2002 reimage (PID 1262710) is awaiting input [14:41:39] !log dancy@deploy2002 Installing scap version "4.216.0" for 165 host(s) [14:41:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198626 (https://phabricator.wikimedia.org/T408284) (owner: 10Bunnypranav) [14:42:48] (03PS1) 10Andrew Bogott: preseed: further attempt single-drive grub install [puppet] - 10https://gerrit.wikimedia.org/r/1198996 (https://phabricator.wikimedia.org/T407586) [14:43:28] (03CR) 10Andrew Bogott: [C:03+2] preseed: further attempt single-drive grub install [puppet] - 10https://gerrit.wikimedia.org/r/1198996 (https://phabricator.wikimedia.org/T407586) (owner: 10Andrew Bogott) [14:44:40] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host lvs1016.eqiad.wmnet [14:45:39] !log dancy@deploy2002 Installation of scap version "4.216.0" completed for 165 hosts [14:46:20] (03CR) 10Ahmon Dancy: scap: remove testservers 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198019 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [14:46:56] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1016.eqiad.wmnet [14:46:58] PROBLEM - Host lvs1016 is DOWN: PING CRITICAL - Packet loss = 100% [14:47:02] RECOVERY - Host lvs1016 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [14:47:03] huh [14:47:04] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:47:58] PROBLEM - pybal on lvs1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:49:02] (03CR) 10Dr0ptp4kt: [C:03+1] profile::thanos: add two recording rules for Xlab's SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1198911 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [14:49:04] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:49:58] RECOVERY - pybal on lvs1016 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:50:43] (03CR) 10CDanis: [C:03+1] P:cache::haproxy: start preparing for known-client DSL [puppet] - 10https://gerrit.wikimedia.org/r/1193275 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [14:52:57] 06SRE, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th): Move Druid realtime configuration out of Refinery into standalone repo on GitLab - https://phabricator.wikimedia.org/T407994#11312146 (10Ottomata) Hm! Before we do this, should we take a bigger picture look around what we intend to do with re... [14:53:04] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1016 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [14:53:20] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 8 connections established with conf1007.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [14:53:30] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [14:54:28] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Rename limit_group to policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198925 (https://phabricator.wikimedia.org/T408192) (owner: 10Clément Goubert) [14:56:13] (03Merged) 10jenkins-bot: api-gateway: Rename limit_group to policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198925 (https://phabricator.wikimedia.org/T408192) (owner: 10Clément Goubert) [14:56:46] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:56:51] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:57:29] (03CR) 10Zabe: "Actually, we can give T407814 a bit of time as long as we are not yet at the point to stop writing to this column" [puppet] - 10https://gerrit.wikimedia.org/r/1191090 (https://phabricator.wikimedia.org/T389026) (owner: 10Zabe) [14:57:45] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:57:58] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:58:04] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:58:20] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:58:29] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:58:39] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:58:43] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:59:08] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:59:57] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:00:20] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [15:00:31] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [15:01:20] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [15:09:03] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:17] (03CR) 10Elukey: [C:03+2] profile::thanos: add two recording rules for Xlab's SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1198911 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [15:12:59] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - Status - issue on cp2031:9290 - https://phabricator.wikimedia.org/T408230#11312237 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:14:52] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts failoid1002.eqiad.wmnet [15:19:50] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:23:22] !log disable-puppet on A:cp hosts for haproxy config change [15:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:31] (03CR) 10Scott French: [C:03+2] P:cache::haproxy: start preparing for known-client DSL [puppet] - 10https://gerrit.wikimedia.org/r/1193275 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [15:25:38] (03PS4) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [15:25:45] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: failoid1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:27:09] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle-single for host sretest2010 [15:27:11] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.powercycle-single (exit_code=99) for host sretest2010 [15:28:50] jmm@cumin2002 decommission (PID 1269530) is awaiting input [15:29:13] (03CR) 10Stevemunene: [C:03+1] cloudnative-pg-cluster: allow direct access to the DB when pooling is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198974 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [15:29:23] (03CR) 10Stevemunene: [C:03+1] cloudnative-pg-cluster: set env vars disabling s3 security feature not implemented in radosgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198975 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [15:29:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: failoid1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [15:29:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:29:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts failoid1002.eqiad.wmnet [15:29:48] 06SRE, 06Infrastructure-Foundations: Move failoid to trixie - https://phabricator.wikimedia.org/T402406#11312356 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `failoid1002.eqiad.wmnet` - failoid1002.eqiad.wmnet (**PASS**) - Downtimed host on Icinga/Alertmanager... [15:30:04] jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T1530). [15:30:27] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:30:28] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:31:30] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [15:31:53] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [15:33:59] (03PS5) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [15:34:03] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:22] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host sretest2010 [15:36:50] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host sretest2010 [15:38:23] !log rolling run-puppet-agent on A:cp hosts for haproxy config change [15:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:38] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1018.eqiad.wmnet with reason: reboot [15:42:48] 06SRE, 06Infrastructure-Foundations: Move failoid to trixie - https://phabricator.wikimedia.org/T402406#11312536 (10MoritzMuehlenhoff) 05Open→03Resolved All done, the new VMs are in place (failoid1003/2003) and the old Bullseye nodes were decommissioned. [15:44:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.075s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:44:26] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host sretest2010 [15:45:18] (03CR) 10Stevemunene: [C:03+1] postgresql-growthbook: define a custom PG image, libraries and post init SQL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198514 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [15:46:41] (03CR) 10Scott French: scap: remove testservers 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198019 (https://phabricator.wikimedia.org/T397498) (owner: 10Effie Mouzeli) [15:46:49] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host sretest2010 [15:48:52] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11312553 (10elukey) Something is definitely wrong, since I waited for an hour and `-> reset /system1/pwrmgtsvc1` still hanged (not sure if the host rebooted in the meatime). Power... [15:49:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.075s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:49:43] (03PS1) 10Muehlenhoff: osm_master: Remove support for pre Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1199005 (https://phabricator.wikimedia.org/T381565) [15:54:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: eqiad row C/D Machine Learning host migrations - https://phabricator.wikimedia.org/T405647#11312569 (10klausman) >>! In T405647#11298600, @RobH wrote: > Please note this migration has shifted from Oct 15th start date to Nov 1 start date. I'll be avail... [15:54:30] (03PS6) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [15:54:34] (03CR) 10Fabfur: [C:03+1] "lgtm! Great job, also for documenting all rules!" [puppet] - 10https://gerrit.wikimedia.org/r/1193276 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [15:55:00] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host sretest2010 [15:56:15] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1198325 (https://phabricator.wikimedia.org/T407094) (owner: 10Kamila Součková) [15:56:46] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host sretest2010 [15:57:01] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198924 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [15:59:23] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol1008-dev.eqiad.wmnet with OS trixie [15:59:56] (03CR) 10Federico Ceratto: [C:03+2] zarcillo: remove obsoleted IDP egress ipaddrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198924 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [16:02:26] (03PS1) 10Andrew Bogott: cloudcontrol2010-dev preseed: move to flat partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1199007 [16:05:24] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host lvs1018.eqiad.wmnet [16:08:00] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11312651 (10Dzahn) a:03Dzahn [16:08:16] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1018.eqiad.wmnet [16:08:29] (03PS1) 10Daniel Kinzler: rest-gateway: Create metrics mapping for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199008 (https://phabricator.wikimedia.org/T408183) [16:08:39] (03CR) 10CI reject: [V:04-1] rest-gateway: Create metrics mapping for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199008 (https://phabricator.wikimedia.org/T408183) (owner: 10Daniel Kinzler) [16:08:56] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:09:04] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:09:05] ^ fixing soon, expected [16:09:30] (03CR) 10Daniel Kinzler: [C:04-1] rest-gateway: Create metrics mapping for ratelimit service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199008 (https://phabricator.wikimedia.org/T408183) (owner: 10Daniel Kinzler) [16:09:56] RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:10:04] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:11:50] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [16:12:40] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 0.252 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [16:14:46] (03PS1) 10Elukey: profile::thanos: remove unnecessary quotes for xlab SLO rec rules [puppet] - 10https://gerrit.wikimedia.org/r/1199012 (https://phabricator.wikimedia.org/T398869) [16:15:29] (03CR) 10Majavah: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [16:22:48] (03CR) 10Dr0ptp4kt: [C:03+1] profile::thanos: remove unnecessary quotes for xlab SLO rec rules [puppet] - 10https://gerrit.wikimedia.org/r/1199012 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [16:26:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.802s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:31:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.027s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:34:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.314s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:35:45] (03PS1) 10Dr0ptp4kt: Use Thanos rules for Pyrra error metrics for xLab [puppet] - 10https://gerrit.wikimedia.org/r/1199023 (https://phabricator.wikimedia.org/T398869) [16:36:39] (03CR) 10Andrew Bogott: [C:03+2] cloudcontrol2010-dev preseed: move to flat partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1199007 (owner: 10Andrew Bogott) [16:38:40] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1019.eqiad.wmnet with reason: reboot [16:39:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.314s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:40:33] (03PS2) 10Dr0ptp4kt: Use Thanos rules for Pyrra error metrics for xLab [puppet] - 10https://gerrit.wikimedia.org/r/1199023 (https://phabricator.wikimedia.org/T398869) [16:48:02] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [16:50:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:59] (03CR) 10Elukey: [C:03+2] profile::thanos: remove unnecessary quotes for xlab SLO rec rules [puppet] - 10https://gerrit.wikimedia.org/r/1199012 (https://phabricator.wikimedia.org/T398869) (owner: 10Elukey) [16:54:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:55:54] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy5001.eqsin.wmnet [16:55:56] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [16:56:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198949 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [16:59:27] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy5001.eqsin.wmnet - dzahn@cumin2002" [16:59:33] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy5001.eqsin.wmnet - dzahn@cumin2002" [16:59:33] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:59:33] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy5001.eqsin.wmnet on all recursors [16:59:36] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy5001.eqsin.wmnet on all recursors [17:00:00] (03PS1) 10Kosta Harlan: CheckUser: Enable SI on metawiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199026 (https://phabricator.wikimedia.org/T408428) [17:00:05] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy5001.eqsin.wmnet - dzahn@cumin2002" [17:00:05] swfrench-wmf: MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T1700). Please do the needful. [17:00:05] ryankemper: That opportune time for a Wikidata Query Service weekly deploy deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T1700). [17:00:10] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy5001.eqsin.wmnet - dzahn@cumin2002" [17:00:16] o/ [17:03:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196466 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:03:29] dzahn@cumin2002 makevm (PID 1290263) is awaiting input [17:03:51] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy5001.eqsin.wmnet with OS trixie [17:03:58] (03Merged) 10jenkins-bot: Enroll 5% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196466 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:04:16] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1196466|Enroll 5% of client sessions in PHP 8.3 (T405955)]] [17:04:23] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:06:29] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1196466|Enroll 5% of client sessions in PHP 8.3 (T405955)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:09:04] (03PS1) 10DLynch: Edit check: instrument when pastes happen with known sources [extensions/VisualEditor] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199027 (https://phabricator.wikimedia.org/T407302) [17:09:13] !log swfrench@deploy2002 swfrench: Continuing with sync [17:09:21] (03PS3) 10Kamila Součková: admin: add skaramwmf to analytics-private-data-users [puppet] - 10https://gerrit.wikimedia.org/r/1198325 (https://phabricator.wikimedia.org/T407094) [17:09:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/VisualEditor] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199027 (https://phabricator.wikimedia.org/T407302) (owner: 10DLynch) [17:10:06] (03CR) 10CI reject: [V:04-1] admin: add skaramwmf to analytics-private-data-users [puppet] - 10https://gerrit.wikimedia.org/r/1198325 (https://phabricator.wikimedia.org/T407094) (owner: 10Kamila Součková) [17:13:01] (03PS2) 10Aaron Schulz: restgateway: update spec-json-wikimedia to use www prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198406 (https://phabricator.wikimedia.org/T396805) [17:13:56] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudrabbit2003-dev.codfw.wmnet with OS trixie [17:14:20] (03PS1) 10Kamila Součková: admin: fix duplicate user entry for jmoore111 [puppet] - 10https://gerrit.wikimedia.org/r/1199029 [17:15:01] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196466|Enroll 5% of client sessions in PHP 8.3 (T405955)]] (duration: 10m 44s) [17:15:14] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:15:38] (03CR) 10Dzahn: [C:03+1] admin: fix duplicate user entry for jmoore111 [puppet] - 10https://gerrit.wikimedia.org/r/1199029 (owner: 10Kamila Součková) [17:16:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198405 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [17:16:39] (03CR) 10Kamila Součková: [C:03+2] admin: fix duplicate user entry for jmoore111 [puppet] - 10https://gerrit.wikimedia.org/r/1199029 (owner: 10Kamila Součková) [17:17:32] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196467 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:17:34] (03CR) 10Scott French: [C:03+2] mw-(api-int|jobrunner): Serve ~ 1% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196467 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:19:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:10] (03Merged) 10jenkins-bot: mw-(api-int|jobrunner): Serve ~ 1% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196467 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:20:25] (03PS4) 10Kamila Součková: admin: add skaramwmf to analytics-private-data-users [puppet] - 10https://gerrit.wikimedia.org/r/1198325 (https://phabricator.wikimedia.org/T407094) [17:21:05] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:21:20] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:21:39] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [17:22:38] (03CR) 10Aaron Schulz: [DNM] rest-gateway: map restbase sandbox URLs to Special:RestSandbox/wmf-restbase (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1190753 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [17:24:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:26:44] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:27:08] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host lvs1019.eqiad.wmnet [17:28:09] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:28:22] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:28:41] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage [17:28:55] (03PS1) 10Bking: opensearch-operator: Update image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199030 (https://phabricator.wikimedia.org/T404874) [17:28:56] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:30:08] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1019.eqiad.wmnet [17:30:48] (03PS3) 10Aaron Schulz: Route transform/wikitext/to/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1194996 (https://phabricator.wikimedia.org/T385066) [17:30:49] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [17:30:50] (03CR) 10Dzahn: [C:03+1] admin: add skaramwmf to analytics-private-data-users [puppet] - 10https://gerrit.wikimedia.org/r/1198325 (https://phabricator.wikimedia.org/T407094) (owner: 10Kamila Součková) [17:30:57] PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:31:03] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:31:14] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2003-dev.codfw.wmnet with reason: host reimage [17:32:57] RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:33:05] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:33:37] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:34:50] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:34:59] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:35:08] (03PS16) 10CDobbins: dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) [17:35:19] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:35:26] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:35:31] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2003-dev.codfw.wmnet with reason: host reimage [17:35:44] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:36:56] (03CR) 10Ebernhardson: [C:04-1] "This should be ready, but it shouldn't be merged prior to verifying the first production generation of the dataset works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/1184585 (https://phabricator.wikimedia.org/T366248) (owner: 10Ebernhardson) [17:39:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:40:12] (03PS1) 10Aaron Schulz: Route /page/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1199032 (https://phabricator.wikimedia.org/T384216) [17:40:21] (03PS1) 10Aaron Schulz: Route /page/lint(.*) to the gateway on group0 [puppet] - 10https://gerrit.wikimedia.org/r/1199033 (https://phabricator.wikimedia.org/T384216) [17:40:25] (03PS1) 10Aaron Schulz: Route /page/lint(.*) to the gateway on group1 [puppet] - 10https://gerrit.wikimedia.org/r/1199034 (https://phabricator.wikimedia.org/T384216) [17:40:32] (03PS1) 10Aaron Schulz: Route /page/lint(.*) to the gateway on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/1199035 (https://phabricator.wikimedia.org/T384216) [17:40:37] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:40:46] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:40:49] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs1019 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [17:40:58] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:41:04] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:44:01] (03PS2) 10Matthias Mullie: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198922 [17:44:35] (03CR) 10MusikAnimal: CodeMirrorWikiEditor: fix selector usurping WikiEditor's search btn [extensions/CodeMirror] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198425 (https://phabricator.wikimedia.org/T404543) (owner: 10MusikAnimal) [17:45:27] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:46:57] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7455/console" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [17:47:18] (03CR) 10Ssingh: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [17:47:40] (03CR) 10CDobbins: [V:03+1 C:03+2] dnsrecursor: use config dir instead of standalone file [puppet] - 10https://gerrit.wikimedia.org/r/1195259 (https://phabricator.wikimedia.org/T389333) (owner: 10CDobbins) [17:49:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:55:21] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit2003-dev.codfw.wmnet with OS trixie [17:55:31] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy5001.eqsin.wmnet with reason: host reimage [17:55:43] !log ammarpad@deploy2002 mwscript-k8s job started: refreshImageMetadata.php --start=2018-05-11_Joensuu_station_4.jpg --end=2018-05-11_Joensuu_station_4.jpg --wiki=commonswiki --force # T223051 [17:55:47] T223051: EXIF location data of image file not imported to MediaWiki File information (Upload with UploadWizard) - https://phabricator.wikimedia.org/T223051 [17:58:41] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy5001.eqsin.wmnet with reason: host reimage [17:59:22] (03CR) 10Brouberol: [C:03+1] opensearch-operator: Update image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199030 (https://phabricator.wikimedia.org/T404874) (owner: 10Bking) [18:05:28] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:06:23] (03CR) 10BCornwall: [C:03+1] dmarc: add dmarc monitoring records to more domains [dns] - 10https://gerrit.wikimedia.org/r/1198598 (https://phabricator.wikimedia.org/T404884) (owner: 10JHathaway) [18:09:34] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS trixie [18:11:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by musikanimal@deploy2002 using scap backport" [extensions/CodeMirror] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198425 (https://phabricator.wikimedia.org/T404543) (owner: 10MusikAnimal) [18:18:41] (03Merged) 10jenkins-bot: CodeMirrorWikiEditor: fix selector usurping WikiEditor's search btn [extensions/CodeMirror] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198425 (https://phabricator.wikimedia.org/T404543) (owner: 10MusikAnimal) [18:19:03] !log musikanimal@deploy2002 Started scap sync-world: Backport for [[gerrit:1198425|CodeMirrorWikiEditor: fix selector usurping WikiEditor's search btn (T404543)]] [18:19:08] T404543: Make search and CodeMirror settings buttons more accessible in WikiEditor - https://phabricator.wikimedia.org/T404543 [18:19:11] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy5001.eqsin.wmnet with OS trixie [18:19:11] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy5001.eqsin.wmnet [18:20:28] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:22:09] (03CR) 10BCornwall: "[nit] The param descriptions vary: One specifically mentions browser detection (which may not be the only thing in private in future) and " [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [18:22:27] (03CR) 10Bking: [C:03+2] opensearch-operator: Update image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199030 (https://phabricator.wikimedia.org/T404874) (owner: 10Bking) [18:23:02] !log musikanimal@deploy2002 musikanimal: Backport for [[gerrit:1198425|CodeMirrorWikiEditor: fix selector usurping WikiEditor's search btn (T404543)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [18:24:00] !log musikanimal@deploy2002 musikanimal: Continuing with sync [18:26:45] lemme know when that's done, i've got a backport for ReaderExperiments to slip in if possible. it'll include i18n updates so will be a slow sync though. [18:27:18] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy5002.eqsin.wmnet [18:27:20] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [18:28:32] !log musikanimal@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198425|CodeMirrorWikiEditor: fix selector usurping WikiEditor's search btn (T404543)]] (duration: 09m 29s) [18:28:37] T404543: Make search and CodeMirror settings buttons more accessible in WikiEditor - https://phabricator.wikimedia.org/T404543 [18:28:53] (03PS10) 10Ssingh: varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [18:29:20] (03CR) 10Ssingh: "Yeah, makes sense, I have updated it." [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [18:29:44] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7458/co" [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [18:30:41] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy5002.eqsin.wmnet - dzahn@cumin2002" [18:31:25] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy5002.eqsin.wmnet - dzahn@cumin2002" [18:31:25] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:31:25] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy5002.eqsin.wmnet on all recursors [18:31:29] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy5002.eqsin.wmnet on all recursors [18:31:58] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy5002.eqsin.wmnet - dzahn@cumin2002" [18:32:03] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy5002.eqsin.wmnet - dzahn@cumin2002" [18:33:23] (03PS2) 10Jdlrobson: WIP: Deploy dark mode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184564 (https://phabricator.wikimedia.org/T395628) [18:34:04] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy5002.eqsin.wmnet with OS trixie [18:37:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198922 (owner: 10Matthias Mullie) [18:38:50] this is gonna be a slow one cause some i18n updates [18:38:54] :D [18:40:10] (03Merged) 10jenkins-bot: Squashed diff to master [extensions/ReaderExperiments] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1198922 (owner: 10Matthias Mullie) [18:40:19] .. uh why is that being backported? [18:40:27] !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1198922|Squashed diff to master]] [18:42:34] taavi: oops that should say "from master" [18:42:56] sorry for letting the confusing message stand :D [18:43:11] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:48:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:55:06] (03PS3) 10Jasmine: taskgen: Update calico IPPool check [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) (owner: 10Clément Goubert) [18:55:06] (03PS6) 10Jasmine: wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) [19:01:58] (03CR) 10Jasmine: wikikube: Add wikikube-worker2[248-330] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [19:03:01] FIRING: [2x] ProbeDown: Service wdqs2019:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:04:18] (03CR) 10BCornwall: [C:03+1] varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [19:05:45] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudrabbit2001-dev.codfw.wmnet with OS trixie [19:06:13] !log bvibber@deploy2002 bvibber, mlitn: Backport for [[gerrit:1198922|Squashed diff to master]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:06:43] woohoo testing [19:07:20] !log bvibber@deploy2002 bvibber, mlitn: Continuing with sync [19:11:44] (03PS11) 10Ssingh: varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [19:11:54] (03CR) 10Ssingh: "No code change, fixing host in commit message." [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [19:13:17] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7459/co" [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [19:14:06] !log sudo cumin "A:cp" "disable-puppet 'merging CR 1198424'" [19:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:07] (03PS12) 10Scott French: P:cache::haproxy: move x_requestctl setup into listen section [puppet] - 10https://gerrit.wikimedia.org/r/1193276 (https://phabricator.wikimedia.org/T403220) [19:15:27] (03PS1) 10Scott French: Enroll 10% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199048 (https://phabricator.wikimedia.org/T405955) [19:15:28] (03PS1) 10Scott French: mw-(api-int|jobrunner): Serve 5% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199047 (https://phabricator.wikimedia.org/T405955) [19:15:50] (03CR) 10Ssingh: [V:03+1 C:03+2] varnish: only use private files when the private repo is available [puppet] - 10https://gerrit.wikimedia.org/r/1198424 (https://phabricator.wikimedia.org/T404826) (owner: 10Giuseppe Lavagetto) [19:19:23] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudrabbit2002-dev.codfw.wmnet with OS trixie [19:21:39] !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198922|Squashed diff to master]] (duration: 41m 12s) [19:22:23] done! [19:24:21] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2001-dev.codfw.wmnet with reason: host reimage [19:24:47] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193276 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [19:25:41] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy5002.eqsin.wmnet with reason: host reimage [19:27:37] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1193276 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [19:27:45] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2001-dev.codfw.wmnet with reason: host reimage [19:29:22] (03PS11) 10Scott French: P:cache::haproxy: introduce known-client DSL fragment [puppet] - 10https://gerrit.wikimedia.org/r/1196543 (https://phabricator.wikimedia.org/T403220) [19:30:27] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:31:33] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy5002.eqsin.wmnet with reason: host reimage [19:33:01] RESOLVED: [2x] ProbeDown: Service wdqs2019:443 has failed probes (http_wdqs_internal_main_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:41] (03PS2) 10BCornwall: varnish: Promote new m-dot redirect from 302/307 to 301/308 [puppet] - 10https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [19:33:41] (03PS3) 10BCornwall: varnish: Remove temporary enable_m_redir flag [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [19:33:58] (03CR) 10Scott French: "Thanks in advance for the review, Fabrizio!" [puppet] - 10https://gerrit.wikimedia.org/r/1196543 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [19:34:25] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy6001.drmrs.wmnet [19:34:27] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [19:35:18] (03CR) 10CI reject: [V:04-1] varnish: Remove temporary enable_m_redir flag [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [19:36:57] (03PS1) 10Ebernhardson: cirrus: Start near match A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) [19:37:31] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudrabbit2002-dev.codfw.wmnet with reason: host reimage [19:37:43] (03CR) 10CI reject: [V:04-1] cirrus: Start near match A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) (owner: 10Ebernhardson) [19:38:41] (03PS2) 10Ebernhardson: cirrus: Start near match A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) [19:39:19] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy6001.drmrs.wmnet - dzahn@cumin2002" [19:41:02] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy6001.drmrs.wmnet - dzahn@cumin2002" [19:41:03] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:41:03] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy6001.drmrs.wmnet on all recursors [19:41:07] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy6001.drmrs.wmnet on all recursors [19:41:31] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy6001.drmrs.wmnet - dzahn@cumin2002" [19:41:36] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy6001.drmrs.wmnet - dzahn@cumin2002" [19:41:53] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11313552 (10Dzahn) [19:42:18] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11313554 (10Dzahn) [19:43:16] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy6001.drmrs.wmnet with OS trixie [19:43:29] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11313558 (10Dzahn) [19:43:30] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11313559 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy600... [19:44:11] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudrabbit2002-dev.codfw.wmnet with reason: host reimage [19:46:03] 10ops-eqiad, 06SRE, 06DC-Ops: asw2-a4-eqiad:PEM 1 is not powered - https://phabricator.wikimedia.org/T401886#11313575 (10VRiley-WMF) Confirmed with Juniper support, it should be arriving there October 31st. Once they receive it, we should be getting a replacment back. [19:47:45] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit2001-dev.codfw.wmnet with OS trixie [19:49:51] (03PS3) 10Ssingh: varnishtest: allow including stubs for files from volatile [puppet] - 10https://gerrit.wikimedia.org/r/1198432 (owner: 10Giuseppe Lavagetto) [19:50:20] (03CR) 10CI reject: [V:04-1] varnishtest: allow including stubs for files from volatile [puppet] - 10https://gerrit.wikimedia.org/r/1198432 (owner: 10Giuseppe Lavagetto) [19:50:49] cccccbukvgbcrdclgvdtkdehhdirlndilkbhudbvtvdf [19:52:26] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy5002.eqsin.wmnet with OS trixie [19:52:26] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy5002.eqsin.wmnet [19:52:42] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11313591 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy5002.eq... [19:53:24] (03CR) 10Ssingh: varnish: Remove temporary enable_m_redir flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [19:54:37] (03PS1) 10Cathal Mooney: Nokia: always set system cpm packet filter on devices [homer/public] - 10https://gerrit.wikimedia.org/r/1199056 (https://phabricator.wikimedia.org/T402577) [19:55:56] (03PS4) 10BCornwall: varnish: Remove temporary enable_m_redir flag [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [19:56:03] (03CR) 10BCornwall: varnish: Remove temporary enable_m_redir flag (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [19:57:43] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11313597 (10VRiley-WMF) Checked to see if there was light coming out of the fiber and confirmed that there is. I have swapped out the QSFP to see if that may be the issue... [19:58:29] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy4002.ulsfo.wmnet [19:58:32] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T2000). [20:00:04] cjming, kostajh, kemayo, and AaronSchulz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] o/ [20:00:23] I can deploy my own if needed. [20:01:03] Hi [20:01:08] o/ [20:01:09] (03PS4) 10Ssingh: varnishtest: allow including stubs for files from volatile [puppet] - 10https://gerrit.wikimedia.org/r/1198432 (owner: 10Giuseppe Lavagetto) [20:01:24] i can deploy for anyone who isn't able to self-deploy [20:01:27] Mine is a no-op and could go out with another config patch [20:01:38] (03CR) 10CI reject: [V:04-1] varnishtest: allow including stubs for files from volatile [puppet] - 10https://gerrit.wikimedia.org/r/1198432 (owner: 10Giuseppe Lavagetto) [20:02:08] kostajh: i can do yours and mine together - mine is noop too [20:02:28] sounds good! [20:02:31] then i'll pass over to Kemayo [20:02:43] Works for me. [20:02:47] starting 1st 2 patches in queue now [20:03:16] (03PS2) 10Clare Ming: Add config for xLab MW Module experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198591 (https://phabricator.wikimedia.org/T401705) [20:03:26] (03PS2) 10Kosta Harlan: hCaptcha: Define HCaptchaSiteKey in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198949 (https://phabricator.wikimedia.org/T405586) [20:03:27] hi! who has been working with "sre-test1006"? [20:03:35] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudrabbit2002-dev.codfw.wmnet with OS trixie [20:04:04] FIRING: CoreBGPDown: Core BGP session down between ssw1-f1-eqiad and ssw1-d8-eqiad (10.64.147.20) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=ssw1-f1-eqiad:9804&var-bgp_group=core&var-bgp_neighbor=ssw1-d8-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:04:10] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [20:04:12] (03PS5) 10Ssingh: varnishtest: allow including stubs for files from volatile [puppet] - 10https://gerrit.wikimedia.org/r/1198432 (owner: 10Giuseppe Lavagetto) [20:04:18] dzahn@cumin2002 makevm (PID 1339403) is awaiting input [20:04:25] topranks: I have a DNS change in my diff [20:04:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198591 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [20:04:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198949 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [20:04:38] mutante: yep please go ahead [20:04:39] changes the IP of sre-test1006 [20:04:41] thanks yeah [20:04:44] ok, applying [20:05:06] didn't know if you'd see it, I've the cookbook running I'll cancel mine (it is waiting on lock) [20:05:19] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy4002.ulsfo.wmnet - dzahn@cumin2002" [20:05:30] gotcha! deployed to DNS servers [20:05:32] (03Merged) 10jenkins-bot: Add config for xLab MW Module experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198591 (https://phabricator.wikimedia.org/T401705) (owner: 10Clare Ming) [20:05:34] you should see it now [20:05:35] (03Merged) 10jenkins-bot: hCaptcha: Define HCaptchaSiteKey in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198949 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [20:05:38] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy4002.ulsfo.wmnet - dzahn@cumin2002" [20:05:39] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:05:39] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy4002.ulsfo.wmnet on all recursors [20:05:42] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy4002.ulsfo.wmnet on all recursors [20:05:56] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1198591|Add config for xLab MW Module experiment (T401705)]], [[gerrit:1198949|hCaptcha: Define HCaptchaSiteKey in CommonSettings.php (T405586)]] [20:06:02] T401705: Implement debugging for events in the Javascript SDK - https://phabricator.wikimedia.org/T401705 [20:06:02] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [20:06:06] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy4002.ulsfo.wmnet - dzahn@cumin2002" [20:06:11] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy4002.ulsfo.wmnet - dzahn@cumin2002" [20:06:46] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1006.eqiad.wmnet with OS trixie [20:06:53] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:06:59] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy4002.ulsfo.wmnet with OS trixie [20:06:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11313615 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host sretest1006.eqiad.wm... [20:07:12] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11313616 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy400... [20:08:18] cccccbukvgbcrvicklklhbvhhhhvvgkjhbklindcdglc [20:08:31] (03CR) 10BCornwall: [V:03+2 C:03+2] "I do not like this new paradigm, nor do I like the further entrenchment of an already-overcomplicated test environment, but it does fix te" [puppet] - 10https://gerrit.wikimedia.org/r/1198432 (owner: 10Giuseppe Lavagetto) [20:10:01] !log cjming@deploy2002 kharlan, cjming: Backport for [[gerrit:1198591|Add config for xLab MW Module experiment (T401705)]], [[gerrit:1198949|hCaptcha: Define HCaptchaSiteKey in CommonSettings.php (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:10:44] (03PS3) 10BCornwall: varnish: Promote new m-dot redirect from 302/307 to 301/308 [puppet] - 10https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [20:10:44] (03PS5) 10BCornwall: varnish: Remove temporary enable_m_redir flag [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [20:11:19] !log cjming@deploy2002 kharlan, cjming: Continuing with sync [20:14:34] (03PS1) 10Cwhite: site: add logging-sd hosts insetup [puppet] - 10https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796) [20:14:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199026 (https://phabricator.wikimedia.org/T408428) (owner: 10Kosta Harlan) [20:15:04] I forgot I had another patch for this window, and have added it now [20:15:35] cjming: can you let me know when it's on mwdebug, please? [20:15:49] ah, it's already gone forward [20:16:01] kostajh: is that ok? [20:16:04] sorry [20:16:19] yeah it's not a problem [20:16:24] do you want me to do your follow up patch after these 1st 2 are done? [20:16:28] it *should* be a no-op :) [20:16:49] Well, I don't want to jump the queue. If it's ok with the others, though, then sure [20:17:14] should be quick - do you need to test before syncing? [20:17:40] cjming: let me check something before I answer [20:19:07] I need to figure out how to create the database tables that are needed by the patch [20:19:21] (03PS1) 10Cathal Mooney: Fix typo on peer IP on ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1199064 [20:21:27] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198591|Add config for xLab MW Module experiment (T401705)]], [[gerrit:1198949|hCaptcha: Define HCaptchaSiteKey in CommonSettings.php (T405586)]] (duration: 15m 31s) [20:21:33] T401705: Implement debugging for events in the Javascript SDK - https://phabricator.wikimedia.org/T401705 [20:21:33] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [20:21:36] (03CR) 10Cathal Mooney: [C:03+2] Fix typo on peer IP on ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1199064 (owner: 10Cathal Mooney) [20:22:03] kostajh: your 1st patch should be live - do you want me to proceed with your 2nd patch? [20:22:23] cjming: no, I need to create database tables before the 2nd patch could work [20:22:28] so, I'll reschedule it for tomorrow [20:22:34] sounds good [20:22:38] Kemayo: all yours! [20:22:43] cjming: Thanks! [20:22:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199027 (https://phabricator.wikimedia.org/T407302) (owner: 10DLynch) [20:23:52] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy6001.drmrs.wmnet with reason: host reimage [20:25:27] RESOLVED: CoreBGPDown: Core BGP session down between ssw1-f1-eqiad and ssw1-d8-eqiad (10.64.147.20) - group core - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=ssw1-f1-eqiad:9804&var-bgp_group=core&var-bgp_neighbor=ssw1-d8-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:28:30] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy4002.ulsfo.wmnet with reason: host reimage [20:29:16] !log cmooney@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1006.eqiad.wmnet with reason: host reimage [20:29:23] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy6001.drmrs.wmnet with reason: host reimage [20:29:32] (03PS2) 10Cwhite: site: initial setup for new logging-sd hosts [puppet] - 10https://gerrit.wikimedia.org/r/1199062 (https://phabricator.wikimedia.org/T406796) [20:32:06] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#11313691 (10aaron) >>! In T328872#10873913, @Ladsgroup wrote: > I'm not sure... [20:32:18] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1006.eqiad.wmnet with reason: host reimage [20:34:14] (03Merged) 10jenkins-bot: Edit check: instrument when pastes happen with known sources [extensions/VisualEditor] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199027 (https://phabricator.wikimedia.org/T407302) (owner: 10DLynch) [20:34:34] !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1199027|Edit check: instrument when pastes happen with known sources (T407302)]] [20:34:44] T407302: Add logging to detect pastes that would not cause Paste Check to activate as currently configured - https://phabricator.wikimedia.org/T407302 [20:35:13] (03PS2) 10Kosta Harlan: CheckUser: Enable SI on metawiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199026 (https://phabricator.wikimedia.org/T408428) [20:35:29] does anyone here know how to create the extension1 tables for a MW extension? [20:35:44] https://wikitech.wikimedia.org/wiki/Creating_new_tables#Deployment doesn't mention extension1 [20:36:10] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy4002.ulsfo.wmnet with reason: host reimage [20:36:55] !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1199027|Edit check: instrument when pastes happen with known sources (T407302)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:38:35] !log kemayo@deploy2002 kemayo: Continuing with sync [20:38:47] kostajh: the sql scripts have a --cluster param. Did the tables get DBA approval? [20:39:19] AaronSchulz: yes. They exist for several wikis already. [20:42:01] (03PS1) 10Andrew Bogott: Rabbitmq: add new config for version 4 [puppet] - 10https://gerrit.wikimedia.org/r/1199066 (https://phabricator.wikimedia.org/T406516) [20:42:28] (03CR) 10CI reject: [V:04-1] Rabbitmq: add new config for version 4 [puppet] - 10https://gerrit.wikimedia.org/r/1199066 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [20:42:48] AaronSchulz: so I should use `sql.php` and not `createExtensionTables.php` ? [20:43:39] if so, then I think the command I need to run is `php maintenance/mysql.php --cluster extension1 --wiki loginwiki ./extensions/CheckUser/schema/mysql/tables-virtual-checkuser-generated.sql` [20:44:50] !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199027|Edit check: instrument when pastes happen with known sources (T407302)]] (duration: 10m 16s) [20:44:56] T407302: Add logging to detect pastes that would not cause Paste Check to activate as currently configured - https://phabricator.wikimedia.org/T407302 [20:45:25] Okay, it’s free for any remaining patches now. [20:45:53] I can do my config one then. [20:46:26] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy6001.drmrs.wmnet with OS trixie [20:46:26] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy6001.drmrs.wmnet [20:46:41] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11313721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy6001.dr... [20:47:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by aaron@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198405 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [20:48:27] (03Merged) 10jenkins-bot: Move rest_v1-wikimedia.json under the wwwportal directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198405 (https://phabricator.wikimedia.org/T396805) (owner: 10Aaron Schulz) [20:48:45] !log aaron@deploy2002 Started scap sync-world: Backport for [[gerrit:1198405|Move rest_v1-wikimedia.json under the wwwportal directory (T396805)]] [20:48:50] T396805: Define static OpenAPI specs per API family for RESTbase endpoints - https://phabricator.wikimedia.org/T396805 [20:49:48] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy6002.drmrs.wmnet [20:49:50] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:50:49] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11313725 (10Dzahn) [20:51:02] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1003" [20:52:32] !log aaron@deploy2002 aaron: Backport for [[gerrit:1198405|Move rest_v1-wikimedia.json under the wwwportal directory (T396805)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:53:03] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11313736 (10Dzahn) Hi @Neslihan_Turan_WMDE I took over access requests for this week. We can do the SSH key confirmation to move this forward. Unless you have a... [20:53:20] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1003" [20:53:21] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1006.eqiad.wmnet with OS trixie [20:53:24] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11313739 (10Dzahn) 05In progress→03Stalled [20:53:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Eqiad C/D refresh: 2 x test hosts for config validation - https://phabricator.wikimedia.org/T405560#11313740 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host sretest1006.eqiad.wmnet... [20:54:01] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy4002.ulsfo.wmnet with OS trixie [20:54:01] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy4002.ulsfo.wmnet [20:54:13] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11313743 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy4002.ul... [20:54:25] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11313747 (10Dzahn) a:05KFrancis→03Dzahn [20:54:52] !log aaron@deploy2002 aaron: Continuing with sync [20:55:08] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy6002.drmrs.wmnet - dzahn@cumin2002" [20:55:13] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy6002.drmrs.wmnet - dzahn@cumin2002" [20:55:13] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:55:14] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy6002.drmrs.wmnet on all recursors [20:55:17] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy6002.drmrs.wmnet on all recursors [20:55:27] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:55:42] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy6002.drmrs.wmnet - dzahn@cumin2002" [20:55:47] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy6002.drmrs.wmnet - dzahn@cumin2002" [20:56:44] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy6002.drmrs.wmnet with OS trixie [20:56:57] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11313752 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy600... [20:57:37] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy1002.eqiad.wmnet [20:57:39] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [20:58:14] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11313754 (10Dzahn) [20:59:43] !log cmooney@cumin1003 START - Cookbook sre.hosts.dhcp for host sretest1006.eqiad.wmnet [20:59:53] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest1006.eqiad.wmnet [21:00:05] Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T2100). [21:00:23] !log cmooney@cumin1003 START - Cookbook sre.hosts.dhcp for host sretest1006.eqiad.wmnet [21:00:55] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11313764 (10Dzahn) [21:01:08] !log aaron@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198405|Move rest_v1-wikimedia.json under the wwwportal directory (T396805)]] (duration: 12m 22s) [21:01:12] T396805: Define static OpenAPI specs per API family for RESTbase endpoints - https://phabricator.wikimedia.org/T396805 [21:01:57] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [21:01:59] * AaronSchulz is done [21:02:25] (03PS1) 10BCornwall: varnishtest: Remove logfile support [puppet] - 10https://gerrit.wikimedia.org/r/1199068 (https://phabricator.wikimedia.org/T408202) [21:02:52] (03CR) 10CI reject: [V:04-1] varnishtest: Remove logfile support [puppet] - 10https://gerrit.wikimedia.org/r/1199068 (https://phabricator.wikimedia.org/T408202) (owner: 10BCornwall) [21:03:19] dzahn@cumin2002 makevm (PID 1354352) is awaiting input [21:03:27] cmooney@cumin1003 dhcp (PID 285314) is awaiting input [21:03:45] (03PS2) 10BCornwall: varnishtest: Remove logfile support [puppet] - 10https://gerrit.wikimedia.org/r/1199068 (https://phabricator.wikimedia.org/T408202) [21:05:56] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11313777 (10Dzahn) Thanks Katie! Taking over. I sent an email to Sean to verify the key out of band as the next step. [21:06:26] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy1002.eqiad.wmnet - dzahn@cumin2002" [21:06:31] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy1002.eqiad.wmnet - dzahn@cumin2002" [21:06:31] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:06:32] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy1002.eqiad.wmnet on all recursors [21:06:35] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy1002.eqiad.wmnet on all recursors [21:07:10] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy1002.eqiad.wmnet - dzahn@cumin2002" [21:07:14] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy1002.eqiad.wmnet - dzahn@cumin2002" [21:07:19] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11313783 (10Dzahn) 05Open→03Stalled Hi @Volker_E I am processing access requests this week. Feel free to reach out if you have any questions. Right now this one is waiting for you. Cheers,... [21:08:31] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy1002.eqiad.wmnet with OS trixie [21:08:34] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11313787 (10Dzahn) 05Open→03Resolved This seems confirmed as resolved... [21:08:47] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11313789 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy100... [21:08:49] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [21:08:51] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Requesting access to Superset, Turnilo, Spark, Presto, Hive, Hadoop, Jupyter for Jmoore111 - https://phabricator.wikimedia.org/T408164#11313790 (10Dzahn) a:05BTullis→03None [21:09:24] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11313792 (10Dzahn) a:03Neslihan_Turan_WMDE [21:10:27] (03CR) 10Dzahn: [C:03+2] "all requirements are checked off. change looks good, had previous +1, taking over access requests this week" [puppet] - 10https://gerrit.wikimedia.org/r/1198325 (https://phabricator.wikimedia.org/T407094) (owner: 10Kamila Součková) [21:11:57] jouncebot: nowandnext [21:11:58] For the next 1 hour(s) and 48 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T2100) [21:11:58] In 1 hour(s) and 48 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T2300) [21:12:24] I will be doing a security deploy today [21:12:38] kostajh: is the table just for one wiki/db? [21:12:54] (03PS4) 10Krinkle: varnish: Promote new m-dot redirect from 302/307 to 301/308 [puppet] - 10https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) [21:12:55] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198429 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [21:13:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11313804 (10VRiley-WMF) 2303045001218 - port 32 2303045001219 - port 32 [21:13:09] (03PS6) 10BCornwall: varnish: Remove temporary enable_m_redir flag [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [21:13:13] (03PS7) 10Krinkle: varnish: Remove temporary enable_m_redir flag [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) [21:13:14] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [21:13:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11313805 (10VRiley-WMF) [21:14:20] (03PS1) 10Ryan Kemper: wdqs: fix url for https://agrovoc.fao.org/sparql [puppet] - 10https://gerrit.wikimedia.org/r/1199070 (https://phabricator.wikimedia.org/T407412) [21:15:15] (03CR) 10Bking: [C:03+1] wdqs: fix url for https://agrovoc.fao.org/sparql [puppet] - 10https://gerrit.wikimedia.org/r/1199070 (https://phabricator.wikimedia.org/T407412) (owner: 10Ryan Kemper) [21:15:20] (03CR) 10Ryan Kemper: [C:03+2] wdqs: fix url for https://agrovoc.fao.org/sparql [puppet] - 10https://gerrit.wikimedia.org/r/1199070 (https://phabricator.wikimedia.org/T407412) (owner: 10Ryan Kemper) [21:17:08] ACKNOWLEDGEMENT - Dell PowerEdge or Supermicro Broadcom RAID Controller on an-worker1203 is CRITICAL: communication: 0 OK : controller: 1 Needs Attention : physical_disk: 1 Failed : virtual_disk: 1 OfLn : bbu: 0 OK : enclosure: 0 OK : CLI Version = 007.1910.0000.0000 Oct 08, 2021 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T408446 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [21:17:17] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408446 (10ops-monitoring-bot) 03NEW [21:19:38] (03PS1) 10Bking: wdqs: Add new endpoints to allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1199071 (https://phabricator.wikimedia.org/T402891) [21:21:19] (03PS2) 10Ryan Kemper: wdqs: Add new endpoints to allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1199071 (https://phabricator.wikimedia.org/T407373) (owner: 10Bking) [21:21:59] preparing to run scap [21:22:43] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy1002.eqiad.wmnet with reason: host reimage [21:23:56] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy6002.drmrs.wmnet with reason: host reimage [21:24:00] (03CR) 10Ryan Kemper: [C:03+1] wdqs: Add new endpoints to allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1199071 (https://phabricator.wikimedia.org/T407373) (owner: 10Bking) [21:24:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:16] (03CR) 10Bking: [C:03+2] wdqs: Add new endpoints to allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1199071 (https://phabricator.wikimedia.org/T407373) (owner: 10Bking) [21:25:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:26:04] apparently scap is being updated [21:26:17] anyone know anything about this? [21:26:33] maryum: that's me. one second... [21:26:45] cool [21:26:59] scap is running now, all good [21:27:37] maryum: all yours. I typed the command to start the deploy and then checked here before telling it to proceed. Safety check accomplished on al sides I guess. :) [21:27:51] awesome [21:28:07] (03PS2) 10Andrew Bogott: Rabbitmq: add new config for version 4 [puppet] - 10https://gerrit.wikimedia.org/r/1199066 (https://phabricator.wikimedia.org/T406516) [21:28:34] (03CR) 10CI reject: [V:04-1] Rabbitmq: add new config for version 4 [puppet] - 10https://gerrit.wikimedia.org/r/1199066 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [21:29:18] maryum: I have a relatively trivial update to spiderpig to push out once you are done with the security window. If you remember I'd appreciate a ping. [21:29:32] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy1002.eqiad.wmnet with reason: host reimage [21:29:47] bd808 I will definitely let you know when I'm done [21:30:47] 07Puppet, 06SRE, 06Infrastructure-Foundations: Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#11313878 (10Krinkle) [21:32:52] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy6002.drmrs.wmnet with reason: host reimage [21:32:58] (03PS3) 10Andrew Bogott: Rabbitmq: add new config for version 4 [puppet] - 10https://gerrit.wikimedia.org/r/1199066 (https://phabricator.wikimedia.org/T406516) [21:33:27] 07Puppet, 06SRE: Plan Puppet 5 upgrade - https://phabricator.wikimedia.org/T184564#11313893 (10Krinkle) [21:33:34] (03PS1) 10Bartosz Dziewoński: Make wgVectorMaxWidthOptions specify Special:Userlogin correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199074 (https://phabricator.wikimedia.org/T408447) [21:33:34] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199066 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [21:33:48] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 07Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490#11313897 (10Krinkle) [21:34:46] (03CR) 10Ladsgroup: "Yeah, let me know when we are ready to stop writing to the column and I'll merge and deploy the patch." [puppet] - 10https://gerrit.wikimedia.org/r/1191090 (https://phabricator.wikimedia.org/T389026) (owner: 10Zabe) [21:35:14] (03PS2) 10Bartosz Dziewoński: Make wgVectorMaxWidthOptions specify Special:Userlogin correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199074 (https://phabricator.wikimedia.org/T408447) [21:36:25] !log Deployed security fix for T385403 [21:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199074 (https://phabricator.wikimedia.org/T408447) (owner: 10Bartosz Dziewoński) [21:36:33] bd808: I'm all done [21:36:44] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11313928 (10Dzahn) 05Open→03Resolved a:03Dzahn @SKaram-WMF @Nahid The access should work now. Feel free to try it out and let us know if there are any issues. [21:37:39] awesome. Thanks maryum [21:37:55] !log bd808@deploy2002 Installing scap version "4.217.0" for 2 host(s) [21:38:15] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to phabricator-admin for urbanecm - https://phabricator.wikimedia.org/T408008#11313932 (10Dzahn) a:05Raine→03Dzahn [21:38:45] (03CR) 10Dzahn: [C:03+2] admin: add urbanecm to phabricator-admins [puppet] - 10https://gerrit.wikimedia.org/r/1198340 (https://phabricator.wikimedia.org/T408008) (owner: 10Kamila Součková) [21:39:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:39:42] !log bd808@deploy2002 Installation of scap version "4.217.0" completed for 2 hosts [21:39:48] (03PS4) 10Andrew Bogott: Rabbitmq: add new config for version 4 [puppet] - 10https://gerrit.wikimedia.org/r/1199066 (https://phabricator.wikimedia.org/T406516) [21:41:36] (03PS1) 10Bking: admin_ng (dse-k8s): watch more OpenSearch-related namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199075 (https://phabricator.wikimedia.org/T357753) [21:41:40] (03PS5) 10Andrew Bogott: Rabbitmq: add new config for version 4 [puppet] - 10https://gerrit.wikimedia.org/r/1199066 (https://phabricator.wikimedia.org/T406516) [21:42:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199066 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [21:42:55] 07Puppet, 06SRE, 06Infrastructure-Foundations: Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254#11314006 (10Krinkle) [21:43:13] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [21:43:21] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to phabricator-admin for urbanecm - https://phabricator.wikimedia.org/T408008#11314011 (10Dzahn) @Urbanecm You have access to `phab1004.eqiad.wmnet` (currently active) and `phab2002.codfw.wmnet` (currently failover) now. This also made pupp... [21:43:35] (03CR) 10RLazarus: [C:03+2] deployment_server: Add --priority to charlie [puppet] - 10https://gerrit.wikimedia.org/r/1196989 (https://phabricator.wikimedia.org/T406212) (owner: 10RLazarus) [21:43:37] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy1002.eqiad.wmnet with OS trixie [21:43:37] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy1002.eqiad.wmnet [21:43:42] (03CR) 10RLazarus: [C:03+2] deployment_server: Add --dangerously_fast to charlie [puppet] - 10https://gerrit.wikimedia.org/r/1196990 (https://phabricator.wikimedia.org/T406212) (owner: 10RLazarus) [21:43:49] (03PS5) 10RLazarus: deployment_server: Add --dangerously_fast to charlie [puppet] - 10https://gerrit.wikimedia.org/r/1196990 (https://phabricator.wikimedia.org/T406212) [21:43:56] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11314014 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy1002.eq... [21:43:59] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to phabricator-admin for urbanecm - https://phabricator.wikimedia.org/T408008#11314015 (10Dzahn) 05Open→03Resolved [21:45:24] (03CR) 10Andrew Bogott: [C:03+2] Rabbitmq: add new config for version 4 [puppet] - 10https://gerrit.wikimedia.org/r/1199066 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [21:45:46] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11314019 (10Dzahn) [21:46:24] (03CR) 10RLazarus: [C:03+2] deployment_server: Add --dangerously_fast to charlie [puppet] - 10https://gerrit.wikimedia.org/r/1196990 (https://phabricator.wikimedia.org/T406212) (owner: 10RLazarus) [21:46:32] !log vriley@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt franio1004 - vriley@cumin1003" [21:46:49] andrewbogott: please do :) [21:47:02] !log vriley@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt franio1004 - vriley@cumin1003" [21:47:02] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:47:03] oh, there you are :) ok! [21:47:17] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host franio1004 [21:47:36] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host franio1004 [21:48:38] (03CR) 10Ryan Kemper: [C:03+1] admin_ng (dse-k8s): watch more OpenSearch-related namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199075 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [21:49:01] (03CR) 10Bking: [C:03+2] admin_ng (dse-k8s): watch more OpenSearch-related namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199075 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [21:49:22] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy3001.esams.wmnet [21:49:24] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [21:50:12] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [21:50:21] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy6002.drmrs.wmnet with OS trixie [21:50:22] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy6002.drmrs.wmnet [21:50:36] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11314034 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy6002.dr... [21:51:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11314036 (10VRiley-WMF) [21:51:41] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [21:51:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11314038 (10VRiley-WMF) [21:55:11] (03PS1) 10Bking: Revert "admin_ng (dse-k8s): watch more OpenSearch-related namespaces" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199076 [21:55:19] (03CR) 10Bking: [V:03+2 C:03+2] Revert "admin_ng (dse-k8s): watch more OpenSearch-related namespaces" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199076 (owner: 10Bking) [21:55:25] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy3001.esams.wmnet - dzahn@cumin2002" [21:55:30] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy3001.esams.wmnet - dzahn@cumin2002" [21:55:30] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:55:31] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy3001.esams.wmnet on all recursors [21:55:34] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy3001.esams.wmnet on all recursors [21:56:07] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy3001.esams.wmnet - dzahn@cumin2002" [21:56:13] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy3001.esams.wmnet - dzahn@cumin2002" [21:56:31] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [21:56:51] borrowing mw-debug in codfw for a few minutes to test an envoy upgrade [21:57:00] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [21:57:43] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:58:06] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:58:42] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy3001.esams.wmnet with OS trixie [22:00:29] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981#11314126 (10Dwisehaupt) @Jhancock.wm Could you verify the ILO password on this host. I have tried our password and the prod password and neither seemed to work. It should be se... [22:02:08] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy3002.esams.wmnet [22:02:10] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [22:05:27] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:06:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11314158 (10VRiley-WMF) Hey @Jgreen I know we have scripts now to assist with this better. I'm unsure if there are still settings we need to enable for the rest of this install... [22:06:54] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy3002.esams.wmnet - dzahn@cumin2002" [22:07:00] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy3002.esams.wmnet - dzahn@cumin2002" [22:07:00] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:07:00] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy3002.esams.wmnet on all recursors [22:07:04] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy3002.esams.wmnet on all recursors [22:07:37] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy3002.esams.wmnet - dzahn@cumin2002" [22:10:42] dzahn@cumin2002 makevm (PID 1373389) is awaiting input [22:11:23] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy3002.esams.wmnet - dzahn@cumin2002" [22:11:52] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy3002.esams.wmnet with OS trixie [22:12:05] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11314263 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy300... [22:15:56] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808#11314269 (10RLazarus) Testing this in mw-debug, there are two envoy warnings in the logs on startup: ` [2025-10-27 21:58:03.859][1][warning][main] [source/server/server.cc:852] Usage of the deprecated runtime... [22:16:07] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [22:16:25] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [22:16:54] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on 60 hosts with reason: downtime new nokia devices in case they alert during tests [22:17:00] 06SRE, 06Infrastructure-Foundations, 10netops: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11314270 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1d57495a-c8c9-4142-bb4a-68c98114d4d1) set by cmooney@cumin1003 for 3 d... [22:19:05] done with mw-debug [22:20:28] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:20:48] (03PS1) 10RLazarus: mathoid: Upgrade to envoy-future:1.32.12 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199079 (https://phabricator.wikimedia.org/T405808) [22:23:27] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy3001.esams.wmnet with reason: host reimage [22:29:10] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy3001.esams.wmnet with reason: host reimage [22:37:16] (03CR) 10Scott French: [C:03+1] mathoid: Upgrade to envoy-future:1.32.12 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199079 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [22:37:57] (03CR) 10RLazarus: [C:03+2] mathoid: Upgrade to envoy-future:1.32.12 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199079 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [22:39:52] (03Merged) 10jenkins-bot: mathoid: Upgrade to envoy-future:1.32.12 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199079 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [22:39:56] (03PS7) 10Jasmine: wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) [22:42:11] (03CR) 10Jasmine: wikikube: Add wikikube-worker2[248-330] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [22:43:54] !log rzl@deploy1003 helmfile [staging] START helmfile.d/services/mathoid: apply [22:44:17] !log rzl@deploy1003 helmfile [staging] DONE helmfile.d/services/mathoid: apply [22:45:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:46:30] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mathoid: apply [22:46:31] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy3001.esams.wmnet with OS trixie [22:46:31] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy3001.esams.wmnet [22:46:50] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11314353 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy3001.es... [22:46:57] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [22:48:24] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/mathoid: apply [22:48:47] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [22:49:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:03] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:55:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251027T2300) [23:00:21] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11314433 (10Dzahn) [23:03:18] !log rzl@apt1002:~$ sudo -i reprepro -C main includedeb bullseye-wikimedia /srv/wikimedia/pool/component/envoy-future/e/envoyproxy/envoyproxy_1.32.12-1_amd64.deb # T405808 [23:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:22] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [23:03:34] !log rzl@apt1002:~$ sudo -i reprepro copy bookworm-wikimedia bullseye-wikimedia envoyproxy # T405808 [23:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:42] !log rzl@apt1002:~$ sudo -i reprepro copy trixie-wikimedia bullseye-wikimedia envoyproxy # T405808 [23:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:15] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tcp-proxy3002.esams.wmnet with OS trixie [23:05:16] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host tcp-proxy3002.esams.wmnet [23:05:29] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11314465 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy3002.es... [23:07:16] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11314480 (10bking) Thanks again for your time and patience. You have greatly increased my understanding of the purpose of the netbox... [23:09:47] 06SRE, 10Wikimedia-Mailing-lists: Set up a new mailing list for Wales - https://phabricator.wikimedia.org/T406101#11314502 (10Dzahn) It sounds to me like this is like a user group. I suggest we can follow "Globally established groups/teams can have top level list names, in a style that best matches their need... [23:12:45] (03PS1) 10RLazarus: envoy: Update to v1.32.12 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1199082 (https://phabricator.wikimedia.org/T405808) [23:12:54] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy3002.esams.wmnet with OS trixie [23:13:06] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11314523 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host tcp-proxy300... [23:14:19] (03CR) 10RLazarus: [V:03+2] "`" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1199082 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [23:17:58] (03CR) 10Scott French: [C:03+1] envoy: Update to v1.32.12 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1199082 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [23:18:24] !log ganeti3005 - sudo ssh-keygen -f "/var/lib/ganeti/known_hosts" -R "ganeti03.svc.esams.wmnet" - revoking offending RSA key [23:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:14] (03CR) 10RLazarus: [V:03+2 C:03+2] "Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1199082 (https://phabricator.wikimedia.org/T405808) (owner: 10RLazarus) [23:27:50] (03PS1) 10RLazarus: {api,rest}-gateway: Update to Envoy 1.32.12 in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199085 (https://phabricator.wikimedia.org/T405808) [23:30:28] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:36:26] 10ops-eqiad, 06DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408478 (10phaultfinder) 03NEW [23:56:43] (03CR) 10RLazarus: [C:03+1] mw-(api-int|jobrunner): Serve 5% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199047 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [23:56:47] (03CR) 10RLazarus: [C:03+1] Enroll 10% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199048 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French)