[00:08:09] !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4049.ulsfo.wmnet [00:08:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1197366 [00:08:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1197366 (owner: 10TrainBranchBot) [00:33:30] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on A:cp-ulsfo and not P{cp4037*} and A:cp [00:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:00:44] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11292131 (10Krinkle) [01:07:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.24 [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197370 (https://phabricator.wikimedia.org/T405680) [01:08:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.24 [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197370 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot) [01:09:10] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1197366 (owner: 10TrainBranchBot) [01:24:30] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.24 [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197370 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot) [01:29:12] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:58] (03PS3) 10Krinkle: varnish: Simplify m-dot rewrite and fix m.wikipedia.org bug [puppet] - 10https://gerrit.wikimedia.org/r/1197343 (https://phabricator.wikimedia.org/T405931) [01:34:58] (03PS2) 10Krinkle: varnish: Implement enable_m_redir and enable on test wikis [puppet] - 10https://gerrit.wikimedia.org/r/1197351 (https://phabricator.wikimedia.org/T405931) [01:34:58] (03PS1) 10Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite and POST [puppet] - 10https://gerrit.wikimedia.org/r/1197372 (https://phabricator.wikimedia.org/T405931) [01:35:27] (03CR) 10RLazarus: [C:03+1] shellbox: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196771 (owner: 10Scott French) [01:35:56] (03CR) 10CI reject: [V:04-1] varnish: Add test for m.wikisource.org x-dt-host rewrite and POST [puppet] - 10https://gerrit.wikimedia.org/r/1197372 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [01:36:11] (03PS3) 10Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite and POST [puppet] - 10https://gerrit.wikimedia.org/r/1197341 (https://phabricator.wikimedia.org/T405931) [01:36:11] (03PS4) 10Krinkle: varnish: Simplify m-dot rewrite and fix m.wikipedia.org bug [puppet] - 10https://gerrit.wikimedia.org/r/1197343 (https://phabricator.wikimedia.org/T405931) [01:36:11] (03PS3) 10Krinkle: varnish: Implement enable_m_redir and enable on test wikis [puppet] - 10https://gerrit.wikimedia.org/r/1197351 (https://phabricator.wikimedia.org/T405931) [01:36:26] (03PS2) 10Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite and POST [puppet] - 10https://gerrit.wikimedia.org/r/1197372 [01:36:39] (03Abandoned) 10Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite and POST [puppet] - 10https://gerrit.wikimedia.org/r/1197372 (owner: 10Krinkle) [01:39:15] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T0200) [02:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:34:05] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197351 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [02:38:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [02:39:27] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 1.772 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T0300) [03:02:08] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197374 (https://phabricator.wikimedia.org/T405680) [03:02:11] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197374 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot) [03:02:59] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197374 (https://phabricator.wikimedia.org/T405680) (owner: 10TrainBranchBot) [03:03:44] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.45.0-wmf.24 refs T405680 [03:03:48] T405680: 1.45.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T405680 [03:09:15] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:13:17] (03PS4) 10Krinkle: varnish: Implement enable_m_redir and enable on test wikis [puppet] - 10https://gerrit.wikimedia.org/r/1197351 (https://phabricator.wikimedia.org/T405931) [03:43:35] (03PS5) 10Krinkle: varnish: Implement enable_m_redir and enable on test wikis [puppet] - 10https://gerrit.wikimedia.org/r/1197351 (https://phabricator.wikimedia.org/T405931) [03:48:05] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:48:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [03:50:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:54:12] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:58:05] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:59:12] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T0400) [04:01:13] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:10:46] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.45.0-wmf.24 refs T405680 (duration: 67m 03s) [04:10:51] T405680: 1.45.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T405680 [04:17:20] 06SRE, 06Traffic, 06MediaWiki-Platform-Team (Radar): Have CDN edge set the `X-Request-Id` header for incoming external requests - https://phabricator.wikimedia.org/T221976#11292290 (10tstarling) {T407826} may be related. [04:20:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [04:23:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:41:41] fceratto@cumin1003 clone_es (PID 1381498) is awaiting input [04:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:59:12] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:01:13] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:29:15] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:39:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:21] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [05:55:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P84133 and previous config saved to /var/cache/conftool/dbconfig/20251021-055543-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T0600). [06:04:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:09:33] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 8.613 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:10:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P84134 and previous config saved to /var/cache/conftool/dbconfig/20251021-061049-root.json [06:13:35] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [06:14:29] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 3.520 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [06:16:01] (03PS1) 10Marostegui: db1232: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197386 (https://phabricator.wikimedia.org/T407463) [06:16:43] (03CR) 10Marostegui: [C:03+2] db1232: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197386 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [06:17:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1232.eqiad.wmnet with reason: Maintenance [06:17:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1232 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84135 and previous config saved to /var/cache/conftool/dbconfig/20251021-061748-marostegui.json [06:19:14] (03PS1) 10Marostegui: instances.yaml: Add sretest2003 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197387 (https://phabricator.wikimedia.org/T407352) [06:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:20:06] (03PS2) 10Marostegui: instances.yaml: Add sretest2003 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197387 (https://phabricator.wikimedia.org/T407352) [06:21:06] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add sretest2003 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197387 (https://phabricator.wikimedia.org/T407352) (owner: 10Marostegui) [06:29:15] (03PS1) 10Marostegui: es1028: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197389 (https://phabricator.wikimedia.org/T407720) [06:29:56] (03CR) 10Marostegui: [C:03+2] es1028: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197389 (https://phabricator.wikimedia.org/T407720) (owner: 10Marostegui) [06:31:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1028 from dbctl T407720', diff saved to https://phabricator.wikimedia.org/P84136 and previous config saved to /var/cache/conftool/dbconfig/20251021-063134-marostegui.json [06:31:39] T407720: decommission es1028.eqiad.wmnet - https://phabricator.wikimedia.org/T407720 [06:31:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1232 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84137 and previous config saved to /var/cache/conftool/dbconfig/20251021-063142-root.json [06:31:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 7%: Repooling', diff saved to https://phabricator.wikimedia.org/P84138 and previous config saved to /var/cache/conftool/dbconfig/20251021-063143-root.json [06:32:38] !log Add sretest2003 to dbctl depooled T407352 [06:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:41] T407352: Test config H 1P in external store - https://phabricator.wikimedia.org/T407352 [06:33:50] (03PS1) 10Marostegui: mariadb: Remove es1028 [puppet] - 10https://gerrit.wikimedia.org/r/1197390 (https://phabricator.wikimedia.org/T407720) [06:34:08] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1028.eqiad.wmnet [06:34:47] (03CR) 10Marostegui: [C:03+2] mariadb: Remove es1028 [puppet] - 10https://gerrit.wikimedia.org/r/1197390 (https://phabricator.wikimedia.org/T407720) (owner: 10Marostegui) [06:39:50] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [06:44:03] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [06:44:04] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts es1028.eqiad.wmnet [06:44:17] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1028.eqiad.wmnet [06:44:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 21 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197284 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [06:46:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 21 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196064 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [06:46:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1232 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84139 and previous config saved to /var/cache/conftool/dbconfig/20251021-064648-root.json [06:46:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P84140 and previous config saved to /var/cache/conftool/dbconfig/20251021-064649-root.json [06:48:58] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [06:50:22] 06SRE: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization - https://phabricator.wikimedia.org/T398161#11292475 (10Joe) 05Open→03Resolved [06:52:34] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1028.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:53:51] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1028.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:53:51] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:53:52] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts es1028.eqiad.wmnet [06:54:19] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1028.eqiad.wmnet - https://phabricator.wikimedia.org/T407720#11292482 (10Marostegui) a:05Marostegui→03None [06:55:39] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1028.eqiad.wmnet - https://phabricator.wikimedia.org/T407720#11292489 (10Marostegui) This is ready for DC-Ops. The first failure was due to some connection glitches I had so I wasn't able to reply to the question three times and hence t... [07:00:05] Amir1, Urbanecm, and awight: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T0700) [07:00:05] edsanders and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:12] o/ [07:00:33] o/ [07:01:00] I can deploy [07:01:00] I can self deploy [07:01:23] edsanders: sure, please go ahead :) [07:01:50] thanks [07:01:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1232 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84141 and previous config saved to /var/cache/conftool/dbconfig/20251021-070154-root.json [07:01:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P84142 and previous config saved to /var/cache/conftool/dbconfig/20251021-070155-root.json [07:02:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by esanders@deploy2002 using scap backport" [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197284 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [07:07:37] (03PS1) 10Marostegui: db2246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197470 (https://phabricator.wikimedia.org/T406551) [07:08:09] (03CR) 10Marostegui: [C:03+2] db2246: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197470 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [07:10:05] (03Merged) 10jenkins-bot: Follow-up Iedb6361: Set insert-ignore on all insertSelect queries [extensions/Flow] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197284 (https://phabricator.wikimedia.org/T407357) (owner: 10Esanders) [07:10:20] (03PS1) 10Marostegui: instances.yaml: Add db2246 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197530 (https://phabricator.wikimedia.org/T406551) [07:10:54] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1197284|Follow-up Iedb6361: Set insert-ignore on all insertSelect queries (T407357)]] [07:10:59] T407357: Ignore duplicate key errors when creating Flow posts from LQT - https://phabricator.wikimedia.org/T407357 [07:11:02] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2246 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197530 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [07:13:47] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:14:12] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:15:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db2246 depooled T406551', diff saved to https://phabricator.wikimedia.org/P84143 and previous config saved to /var/cache/conftool/dbconfig/20251021-071503-marostegui.json [07:15:09] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [07:15:50] !log esanders@deploy2002 esanders: Backport for [[gerrit:1197284|Follow-up Iedb6361: Set insert-ignore on all insertSelect queries (T407357)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:16:20] !log esanders@deploy2002 esanders: Continuing with sync [07:16:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 1%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84144 and previous config saved to /var/cache/conftool/dbconfig/20251021-071632-root.json [07:17:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1232 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84145 and previous config saved to /var/cache/conftool/dbconfig/20251021-071700-root.json [07:17:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P84146 and previous config saved to /var/cache/conftool/dbconfig/20251021-071701-root.json [07:22:39] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197284|Follow-up Iedb6361: Set insert-ignore on all insertSelect queries (T407357)]] (duration: 11m 45s) [07:22:43] T407357: Ignore duplicate key errors when creating Flow posts from LQT - https://phabricator.wikimedia.org/T407357 [07:23:59] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-operator: watch the growthbook namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197272 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [07:24:01] (03CR) 10Brouberol: [C:03+2] Deploy a postgresql-growthbook cluster in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197273 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [07:24:51] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2033 gradually with 4 steps - Pool es2033.codfw.wmnet in after cloning [07:25:47] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197271 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [07:25:56] dcausse: all done [07:26:07] edsanders: thanks, shipping mine [07:27:52] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407833 (10LSobanski) 03NEW [07:30:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196064 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:30:24] (03PS2) 10Alexandros Kosiaris: Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 [07:30:40] (03CR) 10Mszwarc: [C:03+1] Define CheckUser Suggested Investigations event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197278 (https://phabricator.wikimedia.org/T404177) (owner: 10Dreamy Jazz) [07:32:03] (03Merged) 10jenkins-bot: cirrus: prepare completion search with defaultsort A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196064 (https://phabricator.wikimedia.org/T404858) (owner: 10DCausse) [07:32:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P84148 and previous config saved to /var/cache/conftool/dbconfig/20251021-073207-root.json [07:32:38] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1196064|cirrus: prepare completion search with defaultsort A/B test (T404858)]] [07:32:42] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [07:34:00] (03PS1) 10Joely Rooke WMDE: Revert^2 "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197552 [07:37:13] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1196064|cirrus: prepare completion search with defaultsort A/B test (T404858)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:38:08] (03Merged) 10jenkins-bot: cloudnative-pg-operator: watch the growthbook namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197272 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [07:38:30] !log dcausse@deploy2002 dcausse: Continuing with sync [07:38:49] (03Merged) 10jenkins-bot: Deploy a postgresql-growthbook cluster in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197273 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [07:39:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es1052 to es1 master and depool es1029 T407832', diff saved to https://phabricator.wikimedia.org/P84149 and previous config saved to /var/cache/conftool/dbconfig/20251021-073904-marostegui.json [07:39:09] T407832: decommission es1029.eqiad.wmnet - https://phabricator.wikimedia.org/T407832 [07:39:57] (03PS1) 10Marostegui: es1029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197553 (https://phabricator.wikimedia.org/T407832) [07:41:39] (03CR) 10Marostegui: [C:03+2] es1029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197553 (https://phabricator.wikimedia.org/T407832) (owner: 10Marostegui) [07:41:55] (03CR) 10Brouberol: [C:03+2] deployment_server: create kubeconfigs to deploy postgresql-growthbook [puppet] - 10https://gerrit.wikimedia.org/r/1197271 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [07:42:36] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196064|cirrus: prepare completion search with defaultsort A/B test (T404858)]] (duration: 09m 58s) [07:42:41] T404858: A/B test using defaultsort with the completion suggester - https://phabricator.wikimedia.org/T404858 [07:43:51] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [07:46:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 5%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84151 and previous config saved to /var/cache/conftool/dbconfig/20251021-074604-root.json [07:47:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P84152 and previous config saved to /var/cache/conftool/dbconfig/20251021-074713-root.json [07:47:57] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 from sretest2003 - marostegui@cumin1003" [07:48:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 from sretest2003 - marostegui@cumin1003" [07:48:01] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:49:15] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:49:58] (03CR) 10Brouberol: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1196734 (https://phabricator.wikimedia.org/T309738) (owner: 10Scott French) [07:53:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [07:53:39] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply [07:54:12] FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:56:15] (03CR) 10Cathal Mooney: [C:03+2] homer-diff-checker: move execution from cumin1002 to cumin1003 [puppet] - 10https://gerrit.wikimedia.org/r/1197321 (https://phabricator.wikimedia.org/T389380) (owner: 10Cathal Mooney) [07:56:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11292629 (10elukey) >>! In T406656#11289937, @bking wrote: > {F66767261} Please note that the examples that you posted above are not r... [07:57:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Pool sretest2003 with minimal weight T407352', diff saved to https://phabricator.wikimedia.org/P84154 and previous config saved to /var/cache/conftool/dbconfig/20251021-075741-marostegui.json [07:57:47] T407352: Test config H 1P in external store - https://phabricator.wikimedia.org/T407352 [07:58:35] (03PS3) 10Alexandros Kosiaris: Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 [07:59:41] (03PS2) 10Slyngshede: CAS version 7.2.6. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1149665 [08:01:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 7%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84155 and previous config saved to /var/cache/conftool/dbconfig/20251021-080110-root.json [08:02:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P84156 and previous config saved to /var/cache/conftool/dbconfig/20251021-080219-root.json [08:03:01] (03PS3) 10Slyngshede: CAS version 7.2.7 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1149665 (https://phabricator.wikimedia.org/T406455) [08:04:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Increase sretest2003 weight in es1 T407352', diff saved to https://phabricator.wikimedia.org/P84157 and previous config saved to /var/cache/conftool/dbconfig/20251021-080412-marostegui.json [08:04:18] T407352: Test config H 1P in external store - https://phabricator.wikimedia.org/T407352 [08:04:58] (03PS4) 10Slyngshede: CAS version 7.2.7 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1149665 (https://phabricator.wikimedia.org/T406455) [08:06:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:06:55] 06SRE, 06Infrastructure-Foundations: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726#11292650 (10cmooney) >>! In T407726#11290651, @jhathaway wrote: > Since we seem to be able to handle the load okay, I think we should bump the max conntrack setting. Ok.... [08:07:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Increase sretest2003 weight in es1 T407352', diff saved to https://phabricator.wikimedia.org/P84158 and previous config saved to /var/cache/conftool/dbconfig/20251021-080733-marostegui.json [08:07:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:07:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11292651 (10elukey) >>! In T406656#11290513, @Dzahn wrote: > I just wanted to add that I still just see a logical conflict between 2 st... [08:09:12] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:09:32] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-growthbook: apply [08:09:37] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-growthbook: apply [08:10:21] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2033 gradually with 4 steps - Pool es2033.codfw.wmnet in after cloning [08:10:22] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.clone_es (exit_code=0) of es2033.codfw.wmnet onto es2056.codfw.wmnet [08:13:47] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:16:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 10%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84160 and previous config saved to /var/cache/conftool/dbconfig/20251021-081616-root.json [08:16:35] 06SRE, 06Infrastructure-Foundations: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726#11292678 (10cmooney) Hmm so the plot thickens, seems someone already tried this: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/m... [08:16:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Increase sretest2003 weight in es1 T407352', diff saved to https://phabricator.wikimedia.org/P84161 and previous config saved to /var/cache/conftool/dbconfig/20251021-081644-marostegui.json [08:16:49] T407352: Test config H 1P in external store - https://phabricator.wikimedia.org/T407352 [08:17:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P84162 and previous config saved to /var/cache/conftool/dbconfig/20251021-081725-root.json [08:19:20] (03PS1) 10David Caro: p:toolforge::prometheus: add logs api [puppet] - 10https://gerrit.wikimedia.org/r/1197587 (https://phabricator.wikimedia.org/T127367) [08:21:22] (03CR) 10CI reject: [V:04-1] p:toolforge::prometheus: add logs api [puppet] - 10https://gerrit.wikimedia.org/r/1197587 (https://phabricator.wikimedia.org/T127367) (owner: 10David Caro) [08:21:33] (03PS1) 10Brouberol: growthbook: remove all traces of mongoDB from the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197589 (https://phabricator.wikimedia.org/T406579) [08:23:27] (03CR) 10Majavah: thanos-rule: add support for multiple instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: 10Herron) [08:24:16] (03PS2) 10David Caro: p:toolforge::prometheus: add logs api [puppet] - 10https://gerrit.wikimedia.org/r/1197587 (https://phabricator.wikimedia.org/T127367) [08:27:33] (03PS1) 10Majavah: thanos::rule: Cleanup firewall handling [puppet] - 10https://gerrit.wikimedia.org/r/1197590 (https://phabricator.wikimedia.org/T407837) [08:27:35] (03PS1) 10Majavah: P:wmcs::metricsinfra: Fix thanos::rule usage [puppet] - 10https://gerrit.wikimedia.org/r/1197591 (https://phabricator.wikimedia.org/T407837) [08:28:36] (03CR) 10Majavah: [C:03+1] p:toolforge::prometheus: add logs api [puppet] - 10https://gerrit.wikimedia.org/r/1197587 (https://phabricator.wikimedia.org/T127367) (owner: 10David Caro) [08:29:12] (03CR) 10Cathal Mooney: [C:03+2] Add new Nokia switches to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1196926 (https://phabricator.wikimedia.org/T405558) (owner: 10Cathal Mooney) [08:30:05] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7334/co" [puppet] - 10https://gerrit.wikimedia.org/r/1197590 (https://phabricator.wikimedia.org/T407837) (owner: 10Majavah) [08:30:47] (03CR) 10Elukey: [C:03+2] multirootca: add the client auth usage to the dse_k8s discovery issuer profile [puppet] - 10https://gerrit.wikimedia.org/r/1196920 (https://phabricator.wikimedia.org/T406876) (owner: 10Brouberol) [08:31:02] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7335/console" [puppet] - 10https://gerrit.wikimedia.org/r/1197591 (https://phabricator.wikimedia.org/T407837) (owner: 10Majavah) [08:31:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 20%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84163 and previous config saved to /var/cache/conftool/dbconfig/20251021-083122-root.json [08:32:01] (03CR) 10David Caro: [C:03+2] p:toolforge::prometheus: add logs api [puppet] - 10https://gerrit.wikimedia.org/r/1197587 (https://phabricator.wikimedia.org/T127367) (owner: 10David Caro) [08:32:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'es2028 (re)pooling @ 1000%: Repooling', diff saved to https://phabricator.wikimedia.org/P84164 and previous config saved to /var/cache/conftool/dbconfig/20251021-083231-root.json [08:39:25] 06SRE, 10SRE-Access-Requests: Requesting access to fr-tech-devs for lsandergreen - https://phabricator.wikimedia.org/T406927#11292758 (10jijiki) confirmed oob [08:39:36] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SKaram-WMF - https://phabricator.wikimedia.org/T407094#11292759 (10jijiki) confirmed oob [08:39:36] !log restart cfssl-multirootca on pki nodes to pick up new discovery settings (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196920) [08:39:37] (03PS1) 10Federico Ceratto: instances.yaml, es2056.yaml: prepare es2056 [puppet] - 10https://gerrit.wikimedia.org/r/1197594 (https://phabricator.wikimedia.org/T402859) [08:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:09] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [08:44:47] (03CR) 10Cathal Mooney: [C:03+2] sudoers: allow members of datacenter-ops group run homer [puppet] - 10https://gerrit.wikimedia.org/r/1196090 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [08:44:51] (03PS4) 10Alexandros Kosiaris: Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 [08:44:53] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:45:46] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "run sync to add new nokia switches - cmooney@cumin1003 - T405558" [08:45:46] !log urbanecm@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [08:45:50] T405558: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558 [08:46:03] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "run sync to add new nokia switches - cmooney@cumin1003 - T405558" [08:46:21] !log urbanecm@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [08:46:23] (03CR) 10CI reject: [V:04-1] Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 (owner: 10Alexandros Kosiaris) [08:46:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 25%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84166 and previous config saved to /var/cache/conftool/dbconfig/20251021-084628-root.json [08:46:54] (03CR) 10Marostegui: [C:03+1] instances.yaml, es2056.yaml: prepare es2056 [puppet] - 10https://gerrit.wikimedia.org/r/1197594 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [08:48:07] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 60 hosts with reason: downtime new nokia devices in case they alert during tests [08:48:20] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11292788 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d70417af-8325-49e7-a880-7a0cd37bd2d2) set by cmo... [08:51:50] (03CR) 10Federico Ceratto: [C:03+2] clone_es.py: clone readonly es* hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1183646 (owner: 10Federico Ceratto) [08:54:15] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:54:36] (03PS1) 10Marostegui: db2245: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197597 (https://phabricator.wikimedia.org/T406551) [08:54:45] FIRING: Emergency syslog message: Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [08:55:30] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml, es2056.yaml: prepare es2056 [puppet] - 10https://gerrit.wikimedia.org/r/1197594 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [08:57:54] (03PS5) 10Alexandros Kosiaris: Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 [08:59:29] (03CR) 10Marostegui: [C:03+2] db2245: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1197597 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [08:59:45] RESOLVED: Emergency syslog message: Device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [09:01:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 30%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84167 and previous config saved to /var/cache/conftool/dbconfig/20251021-090134-root.json [09:02:46] (03CR) 10CI reject: [V:04-1] Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 (owner: 10Alexandros Kosiaris) [09:02:49] (03PS1) 10Marostegui: db-test*: Change section [puppet] - 10https://gerrit.wikimedia.org/r/1197599 (https://phabricator.wikimedia.org/T400056) [09:02:56] !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es2056.codfw.wmnet [09:02:57] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2056.codfw.wmnet [09:03:32] !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2056 slowly with 10 steps - Pooling in new host [09:03:47] 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#11292847 (10Marostegui) What is pending here? Is this now a duplicate of https://phabricator.wikimedia.org/T400056? [09:04:19] (03CR) 10Federico Ceratto: [C:03+1] db-test*: Change section [puppet] - 10https://gerrit.wikimedia.org/r/1197599 (https://phabricator.wikimedia.org/T400056) (owner: 10Marostegui) [09:04:45] (03CR) 10Marostegui: [C:03+2] db-test*: Change section [puppet] - 10https://gerrit.wikimedia.org/r/1197599 (https://phabricator.wikimedia.org/T400056) (owner: 10Marostegui) [09:04:59] 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#11292849 (10Ladsgroup) >>! In T389089#11292847, @Marostegui wrote: > What is pending here? Is this now a duplicate of https://phabricator.wikimedia.org/T400056? This is not a duplicate. This VM is for... [09:05:32] 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#11292851 (10Marostegui) Ah cool, thanks [09:07:54] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db-test1003.eqiad.wmnet with OS trixie [09:11:00] (03PS1) 10Marostegui: instances.yaml: Add db2245 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197600 (https://phabricator.wikimedia.org/T406551) [09:12:20] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2245 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1197600 (https://phabricator.wikimedia.org/T406551) (owner: 10Marostegui) [09:14:13] (03PS6) 10Alexandros Kosiaris: Remove wmf.volumes from all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197304 [09:14:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db2245 depooled T406551', diff saved to https://phabricator.wikimedia.org/P84168 and previous config saved to /var/cache/conftool/dbconfig/20251021-091418-marostegui.json [09:14:24] T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551 [09:14:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:14:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:15:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:15:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:16:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 50%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84169 and previous config saved to /var/cache/conftool/dbconfig/20251021-091640-root.json [09:16:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:16:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:17:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:17:43] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db-test1003.eqiad.wmnet with reason: host reimage [09:18:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:18:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 1%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84170 and previous config saved to /var/cache/conftool/dbconfig/20251021-091817-root.json [09:18:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:19:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:19:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:20:08] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:20:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:22:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:23:02] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:23:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:23:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:23:48] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db-test1003.eqiad.wmnet with reason: host reimage [09:23:48] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:24:00] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:24:09] (03PS1) 10Elukey: profile::amd_gpu: upgrade trixie hosts to ROCm 7.0.2 repos [puppet] - 10https://gerrit.wikimedia.org/r/1197602 (https://phabricator.wikimedia.org/T403697) [09:24:56] (03PS1) 10Ladsgroup: api: Fix incorrect templatelinks query in ApiQueryInfo [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197603 (https://phabricator.wikimedia.org/T407842) [09:25:25] (03PS2) 10Elukey: profile::amd_gpu: upgrade trixie hosts to ROCm 7.0.2 repos [puppet] - 10https://gerrit.wikimedia.org/r/1197602 (https://phabricator.wikimedia.org/T403697) [09:27:13] (03PS1) 10Marostegui: db1234: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197604 (https://phabricator.wikimedia.org/T407463) [09:28:00] (03CR) 10Marostegui: [C:03+2] db1234: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1197604 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [09:28:38] (03PS1) 10Elukey: role::maps::master_bookworm: fix EG stream name in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1197605 (https://phabricator.wikimedia.org/T381565) [09:29:07] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197605 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:29:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1234.eqiad.wmnet with reason: Maintenance [09:29:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1234 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84171 and previous config saved to /var/cache/conftool/dbconfig/20251021-092911-marostegui.json [09:29:20] FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:29:24] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:29:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:29:49] FIRING: HelmReleaseBadStatus: Helm release growthbook/ferretdb-growthbook on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=growthbook - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:31:00] (03PS1) 10Slyngshede: data.yaml: record LDAP access for dpogorzelski [puppet] - 10https://gerrit.wikimedia.org/r/1197606 [09:31:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 60%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84172 and previous config saved to /var/cache/conftool/dbconfig/20251021-093146-root.json [09:32:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:32:35] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:32:47] (03PS2) 10Elukey: role::maps::master_bookworm: fix EG stream name in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1197605 (https://phabricator.wikimedia.org/T381565) [09:32:55] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197605 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:33:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84173 and previous config saved to /var/cache/conftool/dbconfig/20251021-093323-root.json [09:34:14] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:34:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:35:32] (03CR) 10Elukey: [C:03+2] role::maps::master_bookworm: fix EG stream name in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1197605 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:36:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1234 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84175 and previous config saved to /var/cache/conftool/dbconfig/20251021-093652-root.json [09:40:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:40:57] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:44:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:44:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:44:49] RESOLVED: HelmReleaseBadStatus: Helm release growthbook/ferretdb-growthbook on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=growthbook - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:44:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/ferretdb-growthbook: apply [09:45:01] (03CR) 10Tim Starling: [C:03+1] "Approved for deployment" [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197603 (https://phabricator.wikimedia.org/T407842) (owner: 10Ladsgroup) [09:46:33] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [09:46:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [09:46:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 75%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84176 and previous config saved to /var/cache/conftool/dbconfig/20251021-094652-root.json [09:48:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 7%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84177 and previous config saved to /var/cache/conftool/dbconfig/20251021-094829-root.json [09:48:49] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [09:48:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [09:49:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [09:49:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [09:50:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [09:50:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [09:51:44] (03PS1) 10Tiziano Fogli: sre/zookeeper: trigger a page for lost quorum on main cluster [alerts] - 10https://gerrit.wikimedia.org/r/1197607 (https://phabricator.wikimedia.org/T309012) [09:51:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1234 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84179 and previous config saved to /var/cache/conftool/dbconfig/20251021-095157-root.json [09:52:22] (03CR) 10Lucas Werkmeister (WMDE): "I don’t think we should revert anything on wmf.23 at this point." [extensions/Wikibase] (wmf/1.45.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1197552 (owner: 10Joely Rooke WMDE) [09:52:27] jouncebot: nowandnext [09:52:27] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [09:52:27] In 0 hour(s) and 7 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1000) [09:52:51] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197608 (https://phabricator.wikimedia.org/T401290) [09:53:02] SREs: may I deploy ^ ASAP? [09:53:32] I deployed this revert to wmf.23 yesterday, but then forgot to +2 it on the master branch, so now wmf.24 has broken code again and I really ought to fix that before the train rolls out [09:55:14] (03PS4) 10Elukey: Revert workarounds to exclude elasticsearch_cluster.py on Bookworm+ [software/spicerack] - 10https://gerrit.wikimedia.org/r/1196923 (https://phabricator.wikimedia.org/T390860) [09:56:08] (03PS1) 10Federico Ceratto: preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859) [09:57:17] (03CR) 10Tiziano Fogli: "I noticed on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1192855 that the "Zookeeper server" checks on nodes conf.* were triggeri" [alerts] - 10https://gerrit.wikimedia.org/r/1197607 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli) [09:58:25] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1197607 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli) [09:58:34] (03CR) 10Tiziano Fogli: [C:03+2] sre/zookeeper: trigger a page for lost quorum on main cluster [alerts] - 10https://gerrit.wikimedia.org/r/1197607 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli) [09:58:42] (03CR) 10Tiziano Fogli: [C:03+2] zookeeper: remove check_prometheus, disable nrpe [puppet] - 10https://gerrit.wikimedia.org/r/1192855 (https://phabricator.wikimedia.org/T309012) (owner: 10Tiziano Fogli) [09:59:45] (03CR) 10Marostegui: preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251021T1000) [10:00:33] reiterating my question from above, to whoever is responsible for this window [10:00:41] can I deploy a MediaWiki revert? [10:01:50] if it's an emergency I'd say go ahead [10:01:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2246 (re)pooling @ 100%: Pooling new host in s4', diff saved to https://phabricator.wikimedia.org/P84180 and previous config saved to /var/cache/conftool/dbconfig/20251021-100158-root.json [10:02:11] mention in _security also [10:02:53] Lucas_WMDE: go [10:03:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84181 and previous config saved to /var/cache/conftool/dbconfig/20251021-100335-root.json [10:03:42] thanks [10:04:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197608 (https://phabricator.wikimedia.org/T401290) (owner: 10Lucas Werkmeister (WMDE)) [10:04:56] (03CR) 10Elukey: [C:03+2] services: move tegola and kartotherian's eqiad configs to the new stack [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196803 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [10:05:02] +2ed, I’ll let gate-and-submit run its course (not urgent enough for a force-merge imho) [10:05:10] and see if I can reproduce it on testwiki in the meantime [10:06:09] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [10:07:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1234 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84183 and previous config saved to /var/cache/conftool/dbconfig/20251021-100703-root.json [10:07:07] (03PS2) 10Tiziano Fogli: dbbackups: enable nrpe2nodexp wrapper on mariadb_${type}_... checks [puppet] - 10https://gerrit.wikimedia.org/r/1196939 (https://phabricator.wikimedia.org/T315866) [10:09:03] (03PS13) 10Brouberol: opensearch-cluster: enable external ingress with TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196700 (https://phabricator.wikimedia.org/T406876) [10:09:12] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:09:33] once you're done, please ping me. I have a train blocker to deploy :D [10:12:36] Amir1: how urgent is it? we could swap :P [10:12:50] I think my revert isn’t super urgent, it should just happen before the train proceeds to group0 (beyond test wikis) [10:13:05] it's not that urgent but also gonna take a while to merge [10:13:17] ok [10:13:21] (03CR) 10Ladsgroup: [C:03+2] api: Fix incorrect templatelinks query in ApiQueryInfo [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197603 (https://phabricator.wikimedia.org/T407842) (owner: 10Ladsgroup) [10:13:34] (03CR) 10Hnowlan: [C:03+2] Route transform/wikitext/to/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [10:14:05] I've +2'ed it, it's quite straightforward, if it gets merged together or sooner than your patch, would you mind bundling it? [10:14:31] I mean, it makes the scap output more confusing [10:14:40] I could cancel the current spiderpig and start another one with both patches [10:14:47] shouldn’t affect the ongoing gate-and-submit biulds [10:14:54] does that sound okay? [10:16:15] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [10:16:26] (03PS1) 10Elukey: kubernetes: add the maps bookworm eqiad external service config [puppet] - 10https://gerrit.wikimedia.org/r/1197611 (https://phabricator.wikimedia.org/T381565) [10:16:36] (03Merged) 10jenkins-bot: Revert "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197608 (https://phabricator.wikimedia.org/T401290) (owner: 10Lucas Werkmeister (WMDE)) [10:16:46] too late [10:16:55] ok, so this scap is just for the Wikibase revert then [10:16:59] (03PS2) 10Elukey: kubernetes: add the maps bookworm eqiad external service config [puppet] - 10https://gerrit.wikimedia.org/r/1197611 (https://phabricator.wikimedia.org/T381565) [10:17:13] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1197608|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] [10:17:21] T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290 [10:17:21] T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684 [10:17:21] T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744 [10:17:28] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1197611 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [10:18:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'db2245 (re)pooling @ 20%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P84184 and previous config saved to /var/cache/conftool/dbconfig/20251021-101841-root.json [10:19:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:21:28] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1197608|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:21:49] testing [10:22:06] yay, works [10:22:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1234 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84186 and previous config saved to /var/cache/conftool/dbconfig/20251021-102209-root.json [10:22:12] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [10:23:56] (03PS3) 10Superpes15: Throttle exemption for Editathon by Wikimedistas en Cruce - 6/7 November 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1196930 (https://phabricator.wikimedia.org/T407630) [10:25:57] (03Merged) 10jenkins-bot: api: Fix incorrect templatelinks query in ApiQueryInfo [core] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1197603 (https://phabricator.wikimedia.org/T407842) (owner: 10Ladsgroup) [10:26:06] (03CR) 10Federico Ceratto: preseed.yaml, site.pp, es2057.yaml: Prepare es2057 for es3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1197609 (https://phabricator.wikimedia.org/T402859) (owner: 10Federico Ceratto) [10:26:28] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197608|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] (duration: 09m 15s) [10:26:36] T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290 [10:26:36] T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684 [10:26:36] T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744 [10:26:49] Amir1: over to you [10:26:53] thanks! [10:27:36] !log oblivian@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply [10:27:43] !log oblivian@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply [10:28:10] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1197603|api: Fix incorrect templatelinks query in ApiQueryInfo (T407842)]] [10:28:14] T407842: PHP Warning: Undefined property: stdClass::$tl_namespace - https://phabricator.wikimedia.org/T407842 [10:30:30] (03PS1) 10Giuseppe Lavagetto: jaeger: fix CIDR for idp1005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197612 [10:30:59] (03CR) 10Elukey: [C:03+1] jaeger: fix CIDR for idp1005 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197612 (owner: 10Giuseppe Lavagetto) [10:31:35] (03CR) 10Jcrespo: [C:03+1] dbbackups: enable nrpe2nodexp wrapper on mariadb_${type}_... checks [puppet] - 10https://gerrit.wikimedia.org/r/1196939 (https://phabricator.wikimedia.org/T315866) (owner: 10Tiziano Fogli) [10:32:07] (03PS1) 10Arthur taylor: Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) [10:32:21] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1197603|api: Fix incorrect templatelinks query in ApiQueryInfo (T407842)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.