[00:00:05] Deploy window Abstract Wikipedia emergency deploy window (one-off) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T0000) [00:00:32] (03PS1) 10Santiago Faci: Metrics Platform PHP client library: set performer_registration_dt as null when the user is anon [extensions/EventLogging] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199525 (https://phabricator.wikimedia.org/T408547) [00:06:09] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11321757 (10Papaul) [00:07:20] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11321759 (10Dzahn) Noticed other inconsistencies like: I can ssh to tcp-proxy2002 and it exists and is fine:... [00:07:29] (03CR) 10Cory Massaro: [C:03+2] Wikifunctions: Upgrade orchestrator from 2025-10-22-011302 to 2025-10-28-205854. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199504 (https://phabricator.wikimedia.org/T406540) (owner: 10Cory Massaro) [00:09:08] (03Merged) 10jenkins-bot: Wikifunctions: Upgrade orchestrator from 2025-10-22-011302 to 2025-10-28-205854. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199504 (https://phabricator.wikimedia.org/T406540) (owner: 10Cory Massaro) [00:12:47] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [00:12:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/EventLogging] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199524 (https://phabricator.wikimedia.org/T408547) (owner: 10Santiago Faci) [00:13:13] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [00:13:39] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [00:14:03] (03PS2) 10Santiago Faci: Metrics Platform PHP client library: performer_registration_dt won't be added to the user when the user is anon [extensions/EventLogging] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199524 (https://phabricator.wikimedia.org/T408547) [00:14:14] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [00:14:17] (03PS2) 10Santiago Faci: Metrics Platform PHP client library: performer_registration_dt won't be added to the user when the user is anon [extensions/EventLogging] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199525 (https://phabricator.wikimedia.org/T408547) [00:14:25] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [00:14:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/EventLogging] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199525 (https://phabricator.wikimedia.org/T408547) (owner: 10Santiago Faci) [00:16:23] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [00:16:53] (03CR) 10Cory Massaro: [C:03+2] Wikifunctions: Update function-evaluators from 2025-10-21-143846 to 2025-10-28-150053. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199505 (https://phabricator.wikimedia.org/T407718) (owner: 10Cory Massaro) [00:18:30] (03Merged) 10jenkins-bot: Wikifunctions: Update function-evaluators from 2025-10-21-143846 to 2025-10-28-150053. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199505 (https://phabricator.wikimedia.org/T407718) (owner: 10Cory Massaro) [00:19:13] (03CR) 10Clare Ming: [C:03+1] Metrics Platform PHP client library: performer_registration_dt won't be added to the user when the user is anon [extensions/EventLogging] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199525 (https://phabricator.wikimedia.org/T408547) (owner: 10Santiago Faci) [00:19:21] (03CR) 10Clare Ming: [C:03+1] Metrics Platform PHP client library: performer_registration_dt won't be added to the user when the user is anon [extensions/EventLogging] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199524 (https://phabricator.wikimedia.org/T408547) (owner: 10Santiago Faci) [00:20:05] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [00:20:43] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [00:21:48] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [00:22:34] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [00:22:44] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [00:23:46] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [00:34:24] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1199528 [00:38:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1199528 (owner: 10TrainBranchBot) [00:50:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1199528 (owner: 10TrainBranchBot) [00:59:24] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:00:46] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:08:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1199530 [01:08:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1199530 (owner: 10TrainBranchBot) [01:14:01] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 15s) [01:31:43] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1199530 (owner: 10TrainBranchBot) [01:33:19] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:40:46] (03PS1) 10Andrew Bogott: dnsrecursor: fix handling of auth_zones for wikimedia-common.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) [01:41:24] (03CR) 10CI reject: [V:04-1] dnsrecursor: fix handling of auth_zones for wikimedia-common.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [01:42:57] (03PS2) 10Andrew Bogott: dnsrecursor: fix handling of auth_zones for wikimedia-common.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) [01:42:59] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [01:43:04] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [01:43:19] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:43:33] (03CR) 10CI reject: [V:04-1] dnsrecursor: fix handling of auth_zones for wikimedia-common.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [01:45:59] (03PS3) 10Andrew Bogott: dnsrecursor: fix handling of auth_zones for wikimedia-common.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) [01:46:04] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [01:46:37] (03CR) 10CI reject: [V:04-1] dnsrecursor: fix handling of auth_zones for wikimedia-common.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [01:49:45] (03PS4) 10Andrew Bogott: dnsrecursor: fix handling of auth_zones for wikimedia-common.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) [01:50:22] (03CR) 10CI reject: [V:04-1] dnsrecursor: fix handling of auth_zones for wikimedia-common.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [01:52:50] (03PS5) 10Andrew Bogott: dnsrecursor: fix handling of auth_zones for wikimedia-common.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) [01:55:21] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [02:24:24] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:32:54] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:39:24] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:49:24] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:53:19] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:53:19] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:24] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:58:19] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:59:24] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:03:19] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:03:43] 06SRE, 10Hiddenparma, 06Traffic, 13Patch-For-Review: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11322007 (10Joe) 05Open→03Resolved [05:08:19] FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:24] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:13:19] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:11] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning [05:30:26] (03CR) 10Marostegui: [C:03+1] sanitize-wiki: log into phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1199301 (https://phabricator.wikimedia.org/T408512) (owner: 10Federico Ceratto) [05:33:19] FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:38:12] (03PS1) 10Marostegui: instances.yaml: Remove es1032 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199540 (https://phabricator.wikimedia.org/T408662) [05:38:56] (03CR) 10Marostegui: [C:03+2] instances.yaml: Remove es1032 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199540 (https://phabricator.wikimedia.org/T408662) (owner: 10Marostegui) [05:40:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1032 from dbctl T408662', diff saved to https://phabricator.wikimedia.org/P84321 and previous config saved to /var/cache/conftool/dbconfig/20251029-054019-marostegui.json [05:40:24] T408662: decommission es1032.eqiad.wmnet - https://phabricator.wikimedia.org/T408662 [05:44:24] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:50:26] (03PS1) 10Marostegui: backup1013.cnf.erb: Change es1032 with es1055 [puppet] - 10https://gerrit.wikimedia.org/r/1199541 (https://phabricator.wikimedia.org/T408662) [05:50:42] (03PS1) 10Marostegui: es1032: Decommission [puppet] - 10https://gerrit.wikimedia.org/r/1199542 (https://phabricator.wikimedia.org/T408662) [05:50:53] (03CR) 10Marostegui: "This is a NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1199541 (https://phabricator.wikimedia.org/T408662) (owner: 10Marostegui) [05:51:23] (03PS2) 10Marostegui: es1032: Decommission [puppet] - 10https://gerrit.wikimedia.org/r/1199542 (https://phabricator.wikimedia.org/T408662) [05:52:22] !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1032.eqiad.wmnet [05:52:28] (03CR) 10Marostegui: [C:03+2] es1032: Decommission [puppet] - 10https://gerrit.wikimedia.org/r/1199542 (https://phabricator.wikimedia.org/T408662) (owner: 10Marostegui) [05:56:05] (03CR) 10Marostegui: [C:03+1] instances.yaml: remove es2027 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199476 (https://phabricator.wikimedia.org/T408406) (owner: 10Federico Ceratto) [05:57:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T407975 [05:58:04] T407975: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T407975 [05:58:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Set db1184 with weight 0 T407975', diff saved to https://phabricator.wikimedia.org/P84323 and previous config saved to /var/cache/conftool/dbconfig/20251029-055813-marostegui.json [05:58:15] !log marostegui@cumin1003 START - Cookbook sre.dns.netbox [05:59:29] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1198037 (https://phabricator.wikimedia.org/T407975) (owner: 10Gerrit maintenance bot) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T0600) [06:01:37] !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1032.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:01:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1032.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003" [06:01:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:01:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1032.eqiad.wmnet [06:02:12] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1032.eqiad.wmnet - https://phabricator.wikimedia.org/T408662#11322068 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1003 for hosts: `es1032.eqiad.wmnet` - es1032.eqiad.wmnet (**PASS**) - Downtimed... [06:02:14] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1032.eqiad.wmnet - https://phabricator.wikimedia.org/T408662#11322069 (10Marostegui) [06:02:30] 10ops-eqiad, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1032.eqiad.wmnet - https://phabricator.wikimedia.org/T408662#11322074 (10Marostegui) This is ready for #dc-ops [06:02:49] !log Starting s1 eqiad failover from db1163 to db1184 - T407975 [06:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote db1184 to s1 primary T407975', diff saved to https://phabricator.wikimedia.org/P84324 and previous config saved to /var/cache/conftool/dbconfig/20251029-060314-marostegui.json [06:03:20] T407975: Switchover s1 master (db1163 -> db1184) - https://phabricator.wikimedia.org/T407975 [06:03:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1163 T407975', diff saved to https://phabricator.wikimedia.org/P84325 and previous config saved to /var/cache/conftool/dbconfig/20251029-060356-marostegui.json [06:05:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning [06:05:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2040.codfw.wmnet onto sretest2003.codfw.wmnet [06:05:53] !log marostegui@cumin1003 START - Cookbook sre.mysql.upgrade for db1163.eqiad.wmnet [06:06:03] !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db1163 - Upgrading db1163.eqiad.wmnet [06:06:11] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1163 - Upgrading db1163.eqiad.wmnet [06:06:34] (03PS1) 10Marostegui: db1163: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1199546 (https://phabricator.wikimedia.org/T407463) [06:14:16] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.mysql.upgrade (exit_code=97) for db1163.eqiad.wmnet [06:14:31] (03CR) 10Marostegui: [C:03+2] db1163: Migration to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1199546 (https://phabricator.wikimedia.org/T407463) (owner: 10Marostegui) [06:15:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1163.eqiad.wmnet with reason: Upgrade [06:16:04] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:18:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'sretest2003 (re)pooling @ 1%: Pooling for the first time in es7', diff saved to https://phabricator.wikimedia.org/P84327 and previous config saved to /var/cache/conftool/dbconfig/20251029-061823-root.json [06:23:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84328 and previous config saved to /var/cache/conftool/dbconfig/20251029-062317-root.json [06:24:24] (03CR) 10Marostegui: [C:03+1] major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [06:24:24] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:25:54] 06SRE, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11322122 (10Krd) We need an analysis what exactly happened, and perhaps a strategy not to accept such fake bounces at all. And we please need some monitoring that detects unusual e-m... [06:33:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'sretest2003 (re)pooling @ 5%: Pooling for the first time in es7', diff saved to https://phabricator.wikimedia.org/P84329 and previous config saved to /var/cache/conftool/dbconfig/20251029-063329-root.json [06:38:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84330 and previous config saved to /var/cache/conftool/dbconfig/20251029-063823-root.json [06:48:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'sretest2003 (re)pooling @ 7%: Pooling for the first time in es7', diff saved to https://phabricator.wikimedia.org/P84331 and previous config saved to /var/cache/conftool/dbconfig/20251029-064835-root.json [06:50:16] (03PS1) 10Marostegui: installserver: Remove es1053 [puppet] - 10https://gerrit.wikimedia.org/r/1199552 [06:52:32] (03CR) 10Marostegui: [C:03+2] installserver: Remove es1053 [puppet] - 10https://gerrit.wikimedia.org/r/1199552 (owner: 10Marostegui) [06:53:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84332 and previous config saved to /var/cache/conftool/dbconfig/20251029-065330-root.json [06:53:53] 06SRE, 10envoy, 06serviceops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Upgrade Envoy to v1.29.12 on wcqs and wdqs hosts - https://phabricator.wikimedia.org/T404867#11322129 (10Gehel) p:05Triage→03High [06:55:01] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf group for jpchev - https://phabricator.wikimedia.org/T408636#11322130 (10Jpchev) >>! In T408636#11321623, @Dzahn wrote: > @Jpchev Hi there, are you a Wikimedia Foundation employee or contractor? Or are you asking for access as a volunteer? Any specific syste... [06:59:59] (03CR) 10Krinkle: [C:03+1] ExtensionDistributor: Mark 1.45 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) (owner: 10Arlolra) [07:00:04] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T0700). [07:00:05] sfaci: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:22] o/ [07:01:05] (03CR) 10Krinkle: [C:03+1] varnishtest: Remove logfile support [puppet] - 10https://gerrit.wikimedia.org/r/1199068 (https://phabricator.wikimedia.org/T408202) (owner: 10BCornwall) [07:03:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'sretest2003 (re)pooling @ 10%: Pooling for the first time in es7', diff saved to https://phabricator.wikimedia.org/P84333 and previous config saved to /var/cache/conftool/dbconfig/20251029-070342-root.json [07:08:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84334 and previous config saved to /var/cache/conftool/dbconfig/20251029-070838-root.json [07:10:30] is the morning backport window going to happen? [07:18:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'sretest2003 (re)pooling @ 20%: Pooling for the first time in es7', diff saved to https://phabricator.wikimedia.org/P84335 and previous config saved to /var/cache/conftool/dbconfig/20251029-071848-root.json [07:32:54] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:32:56] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11322144 (10RKemper) Working on the new metrics [[ https://grafana-rw.wikimedia.org/d/8b066769-b821-4069-9f3e-0... [07:33:00] urbanecm is the backport window going to happen? [07:33:26] !log krinkle@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [07:33:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'sretest2003 (re)pooling @ 25%: Pooling for the first time in es7', diff saved to https://phabricator.wikimedia.org/P84336 and previous config saved to /var/cache/conftool/dbconfig/20251029-073354-root.json [07:36:53] (03CR) 10Slyngshede: [C:03+1] Update account meta data for khantstop [puppet] - 10https://gerrit.wikimedia.org/r/1199302 (owner: 10Muehlenhoff) [07:37:34] !log krinkle@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [07:38:47] !log krinkle@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [07:42:52] (03CR) 10Muehlenhoff: [C:03+2] Update account meta data for khantstop [puppet] - 10https://gerrit.wikimedia.org/r/1199302 (owner: 10Muehlenhoff) [07:43:00] (03CR) 10Nikerabbit: [C:03+1] alertmanager: route Language and Product Localization team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535) (owner: 10Huei Tan) [07:49:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'sretest2003 (re)pooling @ 30%: Pooling for the first time in es7', diff saved to https://phabricator.wikimedia.org/P84337 and previous config saved to /var/cache/conftool/dbconfig/20251029-074902-root.json [07:49:16] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11322154 (10fgiunchedi) 05Stalled→03Open [07:50:12] !log upgrading Java on puppet servers [07:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:53] (03CR) 10DCausse: [C:03+1] cirrus: Start near match A/B test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) (owner: 10Ebernhardson) [08:04:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'sretest2003 (re)pooling @ 50%: Pooling for the first time in es7', diff saved to https://phabricator.wikimedia.org/P84338 and previous config saved to /var/cache/conftool/dbconfig/20251029-080408-root.json [08:09:58] (03CR) 10Elukey: [C:03+2] prometheus-amd-rocm: fix exporter for ROCm 7.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1199465 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [08:11:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:15:08] ^ caused by Puppet server restarts, will recover soonish [08:18:19] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:19:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'sretest2003 (re)pooling @ 60%: Pooling for the first time in es7', diff saved to https://phabricator.wikimedia.org/P84339 and previous config saved to /var/cache/conftool/dbconfig/20251029-081914-root.json [08:21:52] (03CR) 10Elukey: Add the sre.hosts.powercycle cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [08:27:13] (03PS4) 10Huei Tan: alertmanager: route Language and Product Localization team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535) [08:29:18] jouncebot: nowandnext [08:29:18] No deployments scheduled for the next 1 hour(s) and 30 minute(s) [08:29:18] In 1 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1000) [08:29:47] I will backport a patch for T408547 [08:29:48] T408547: Suggested investigations: '.performer.registration_dt' should be string - https://phabricator.wikimedia.org/T408547 [08:31:43] (03PS1) 10Kosta Harlan: SI: Use minimalist keys to reduce action_context size [extensions/CheckUser] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199713 (https://phabricator.wikimedia.org/T408546) [08:32:12] (03PS1) 10Kosta Harlan: SI: Use minimalist keys to reduce action_context size [extensions/CheckUser] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199714 (https://phabricator.wikimedia.org/T408546) [08:32:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/EventLogging] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199524 (https://phabricator.wikimedia.org/T408547) (owner: 10Santiago Faci) [08:32:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/EventLogging] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199525 (https://phabricator.wikimedia.org/T408547) (owner: 10Santiago Faci) [08:32:57] and then another set of patches for T408546 [08:32:57] T408546: Suggested investigations: case_status_change event is not logged - https://phabricator.wikimedia.org/T408546 [08:34:13] (03Merged) 10jenkins-bot: Metrics Platform PHP client library: performer_registration_dt won't be added to the user when the user is anon [extensions/EventLogging] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199524 (https://phabricator.wikimedia.org/T408547) (owner: 10Santiago Faci) [08:34:15] (03Merged) 10jenkins-bot: Metrics Platform PHP client library: performer_registration_dt won't be added to the user when the user is anon [extensions/EventLogging] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199525 (https://phabricator.wikimedia.org/T408547) (owner: 10Santiago Faci) [08:34:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'sretest2003 (re)pooling @ 75%: Pooling for the first time in es7', diff saved to https://phabricator.wikimedia.org/P84340 and previous config saved to /var/cache/conftool/dbconfig/20251029-083423-root.json [08:35:34] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1199524|Metrics Platform PHP client library: performer_registration_dt won't be added to the user when the user is anon (T408547)]], [[gerrit:1199525|Metrics Platform PHP client library: performer_registration_dt won't be added to the user when the user is anon (T408547)]] [08:35:38] T408547: Suggested investigations: '.performer.registration_dt' should be string - https://phabricator.wikimedia.org/T408547 [08:38:05] !log kharlan@deploy2002 sfaci, kharlan: Backport for [[gerrit:1199524|Metrics Platform PHP client library: performer_registration_dt won't be added to the user when the user is anon (T408547)]], [[gerrit:1199525|Metrics Platform PHP client library: performer_registration_dt won't be added to the user when the user is anon (T408547)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can [08:38:05] now be verified there. [08:38:19] RESOLVED: SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:40:08] !log kharlan@deploy2002 sfaci, kharlan: Continuing with sync [08:40:34] (03PS2) 10Kosta Harlan: SI: Use minimalist keys to reduce action_context size [extensions/CheckUser] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199713 (https://phabricator.wikimedia.org/T408546) [08:40:40] (03PS2) 10Kosta Harlan: SI: Use minimalist keys to reduce action_context size [extensions/CheckUser] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199714 (https://phabricator.wikimedia.org/T408546) [08:41:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:44:21] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199524|Metrics Platform PHP client library: performer_registration_dt won't be added to the user when the user is anon (T408547)]], [[gerrit:1199525|Metrics Platform PHP client library: performer_registration_dt won't be added to the user when the user is anon (T408547)]] (duration: 08m 47s) [08:44:25] T408547: Suggested investigations: '.performer.registration_dt' should be string - https://phabricator.wikimedia.org/T408547 [08:44:53] !log installing Jetty security updates [08:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:47] on to the next ones [08:46:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199714 (https://phabricator.wikimedia.org/T408546) (owner: 10Kosta Harlan) [08:46:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199713 (https://phabricator.wikimedia.org/T408546) (owner: 10Kosta Harlan) [08:49:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'sretest2003 (re)pooling @ 100%: Pooling for the first time in es7', diff saved to https://phabricator.wikimedia.org/P84341 and previous config saved to /var/cache/conftool/dbconfig/20251029-084929-root.json [08:52:59] (03PS13) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) [08:54:10] (03PS1) 10Giuseppe Lavagetto: conftool: upgrade to 6.x and above [software/spicerack] - 10https://gerrit.wikimedia.org/r/1199723 [08:54:46] (03PS14) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) [08:54:56] (03PS1) 10Ozge: feat: updates addalink docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199724 [08:57:47] (03PS1) 10Superpes15: [huwiki] Set $wgUploadNavigationUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199725 (https://phabricator.wikimedia.org/T408298) [08:58:33] (03CR) 10CI reject: [V:04-1] [huwiki] Set $wgUploadNavigationUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199725 (https://phabricator.wikimedia.org/T408298) (owner: 10Superpes15) [08:58:42] (03Merged) 10jenkins-bot: SI: Use minimalist keys to reduce action_context size [extensions/CheckUser] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199714 (https://phabricator.wikimedia.org/T408546) (owner: 10Kosta Harlan) [08:58:43] (03Merged) 10jenkins-bot: SI: Use minimalist keys to reduce action_context size [extensions/CheckUser] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199713 (https://phabricator.wikimedia.org/T408546) (owner: 10Kosta Harlan) [08:59:16] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1199714|SI: Use minimalist keys to reduce action_context size (T408546)]], [[gerrit:1199713|SI: Use minimalist keys to reduce action_context size (T408546)]] [08:59:22] T408546: Suggested investigations: case_status_change event is not logged - https://phabricator.wikimedia.org/T408546 [08:59:24] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:01:20] (03CR) 10Urbanecm: feat: updates addalink docker image version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199724 (owner: 10Ozge) [09:01:44] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1199714|SI: Use minimalist keys to reduce action_context size (T408546)]], [[gerrit:1199713|SI: Use minimalist keys to reduce action_context size (T408546)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:02:30] (03PS2) 10Superpes15: [huwiki] Set $wgUploadNavigationUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199725 (https://phabricator.wikimedia.org/T408298) [09:03:17] (03CR) 10CI reject: [V:04-1] conftool: upgrade to 6.x and above [software/spicerack] - 10https://gerrit.wikimedia.org/r/1199723 (owner: 10Giuseppe Lavagetto) [09:03:25] !log kharlan@deploy2002 kharlan: Continuing with sync [09:04:17] (03PS2) 10Ozge: linkrecommendation: updates linkrecommendation docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199724 [09:04:35] (03CR) 10Ozge: linkrecommendation: updates linkrecommendation docker image version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199724 (owner: 10Ozge) [09:05:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:08:04] (03CR) 10Mszwarc: hCaptcha: Enable hCaptcha for form edits on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [09:09:41] (03CR) 10Kosta Harlan: hCaptcha: Enable hCaptcha for form edits on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [09:10:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:12:35] (03CR) 10Mszwarc: [C:03+1] hCaptcha: Enable hCaptcha for form edits on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [09:14:17] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es2027 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199476 (https://phabricator.wikimedia.org/T408406) (owner: 10Federico Ceratto) [09:14:48] (03CR) 10Federico Ceratto: "Yes, in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198962" [puppet] - 10https://gerrit.wikimedia.org/r/1199311 (https://phabricator.wikimedia.org/T408385) (owner: 10Federico Ceratto) [09:15:07] (03PS1) 10Superpes15: [ruwiki] Enable WikiLove extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199727 (https://phabricator.wikimedia.org/T408514) [09:17:28] !log cgoubert@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [09:19:20] !log cgoubert@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [09:20:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [09:21:40] (03Merged) 10jenkins-bot: hCaptcha: Enable hCaptcha for form edits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198100 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [09:22:10] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1198100|hCaptcha: Enable hCaptcha for form edits on testwiki (T405586)]] [09:22:15] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [09:23:18] (03CR) 10Federico Ceratto: [C:03+2] major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [09:24:21] 06SRE, 06Data-Platform-SRE: Make the shell group analytics-privatedata-users less confusing - https://phabricator.wikimedia.org/T405517#11322324 (10elukey) Thanks a lot for the feedback @Dzahn! I checked T408164 and it seems to me that there were two things that caused delay: * The user asking to be added to... [09:24:44] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1198100|hCaptcha: Enable hCaptcha for form edits on testwiki (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:25:38] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: puppetdb import job on netbox fails - Cannot retrieve PuppetDB 'networking' facts for new VMs - https://phabricator.wikimedia.org/T408646#11322327 (10elukey) Hey Daniel! Yeah I think they are related, since the new VMs are in netbox but the fact... [09:28:55] (03PS1) 10Kosta Harlan: hCaptcha: Fix usage of wmgEmergencyCaptcha in closure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199729 (https://phabricator.wikimedia.org/T405586) [09:29:13] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11322337 (10elukey) ganeti2033 is a new routet-ganeti cluster: ` elukey@ganeti2033:~$ sudo gnt-instance list... [09:29:37] !log kharlan@deploy2002 kharlan: Continuing with sync [09:29:58] (03Merged) 10jenkins-bot: major-upgrade.py: MariaDB major version upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [09:30:33] (03CR) 10Mszwarc: [C:03+1] hCaptcha: Fix usage of wmgEmergencyCaptcha in closure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199729 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [09:33:52] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1198100|hCaptcha: Enable hCaptcha for form edits on testwiki (T405586)]] (duration: 11m 41s) [09:33:57] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [09:34:24] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:35:17] (03CR) 10Urbanecm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199724 (owner: 10Ozge) [09:35:25] (03CR) 10Federico Ceratto: major-upgrade.py: MariaDB major version upgrade cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [09:35:26] (03CR) 10Ozge: [C:03+2] linkrecommendation: updates linkrecommendation docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199724 (owner: 10Ozge) [09:35:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:36:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199729 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [09:36:31] Hi! I want to run a maintenance script to add wikidata support for a new language wiki. Let me know if this is a bad time, otherwise I will proceed [09:36:47] (03Merged) 10jenkins-bot: hCaptcha: Fix usage of wmgEmergencyCaptcha in closure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199729 (https://phabricator.wikimedia.org/T405586) (owner: 10Kosta Harlan) [09:37:04] joelyrookewmde: i know kostajh is deploying MediaWiki ATM, and adding wikidata support is relatively error-prone. maybe not do that at the same time? [09:37:13] (03Merged) 10jenkins-bot: linkrecommendation: updates linkrecommendation docker image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199724 (owner: 10Ozge) [09:37:18] (but if both of you are okay with this, no objections, just thinking out loud) [09:37:19] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1199729|hCaptcha: Fix usage of wmgEmergencyCaptcha in closure (T405586)]] [09:37:53] oooh thanks for letting me know, i didn't spot that in the deployments schedule. I'll wait :) [09:40:28] joelyrookewmde: I should be done soon, will message you when the sync is done [09:40:37] sounds great. Thanks! [09:42:16] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1199729|hCaptcha: Fix usage of wmgEmergencyCaptcha in closure (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:42:20] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [09:44:24] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:02] !log kharlan@deploy2002 kharlan: Continuing with sync [09:46:03] !log ozge@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [09:47:35] !log ozge@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [09:48:59] !log ozge@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [09:49:10] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199729|hCaptcha: Fix usage of wmgEmergencyCaptcha in closure (T405586)]] (duration: 11m 51s) [09:49:17] T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586 [09:51:03] !log ozge@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [09:51:52] joelyrookewmde: I'm done [09:52:15] cheers, I'll crack on then [09:52:19] have a good day! [09:52:38] !log ozge@deploy2002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [09:52:39] !log joelyrookewmde@deploy2002 mwscript-k8s job started: foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https # Add wikidata support ticket for minwikisource T408347 and pcmwikiquote T408355 [09:52:44] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11322404 (10elukey) tcp-proxy2001 is in a weird state in netbox, I don't see any IPs associated with it: htt... [09:52:50] T408347: Add Wikidata support for minwikisource - https://phabricator.wikimedia.org/T408347 [09:52:50] T408355: Add Wikidata support for pcmwikiquote - https://phabricator.wikimedia.org/T408355 [09:53:02] 06SRE, 10envoy, 06serviceops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Upgrade Envoy to v1.29.12 on wcqs and wdqs hosts - https://phabricator.wikimedia.org/T404867#11322413 (10Gehel) a:03Gehel A quick read of the puppet repo indicates that we use envoy only through `include prof... [09:53:29] (03CR) 10Federico Ceratto: [C:03+2] major-upgrade.py: MariaDB major version upgrade cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1193835 (https://phabricator.wikimedia.org/T406469) (owner: 10Federico Ceratto) [09:54:15] (03CR) 10Federico Ceratto: sanitize-wiki: log into phabricator (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1199301 (https://phabricator.wikimedia.org/T408512) (owner: 10Federico Ceratto) [09:54:34] !log ozge@deploy2002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [09:54:45] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11322415 (10elukey) And 3002 seems in a bad state too: ` elukey@ganeti3005:~$ sudo gnt-instance console tcp-... [09:55:26] (03CR) 10Federico Ceratto: [C:03+2] sanitize-wiki: log into phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1199301 (https://phabricator.wikimedia.org/T408512) (owner: 10Federico Ceratto) [09:55:47] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] sanitize-wiki: log into phabricator (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1199301 (https://phabricator.wikimedia.org/T408512) (owner: 10Federico Ceratto) [09:55:49] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host tcp-proxy7001.magru.wmnet with OS trixie [09:56:49] 06SRE, 10envoy, 06serviceops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Upgrade Envoy to v1.29.12 on wcqs and wdqs hosts - https://phabricator.wikimedia.org/T404867#11322431 (10MoritzMuehlenhoff) >>! In T404867#11322413, @Gehel wrote: > A quick read of the puppet repo indicates tha... [09:57:29] 06SRE, 10envoy, 06serviceops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Upgrade Envoy to v1.32.12 on wcqs and wdqs hosts - https://phabricator.wikimedia.org/T404867#11322433 (10MoritzMuehlenhoff) [09:59:09] Hello, [09:59:09] We are deploying linkrecommendation. We noticed one of the pods is in status ContainerStatusUnknown. [09:59:09] Do you know what it means and if it's a reason to worry? [09:59:09] ``` [09:59:10] ozge@deploy2002:/srv/deployment-charts/helmfile.d/services/linkrecommendation$ kube_env linkrecommendation codfw [09:59:10] ozge@deploy2002:/srv/deployment-charts/helmfile.d/services/linkrecommendation$ kubectl get pods [09:59:11] NAME READY STATUS RESTARTS AGE [09:59:11] linkrecommendation-external-7bbdfc89cb-z9mln 3/3 Running 0 4m20s [09:59:12] linkrecommendation-internal-575b84f7c5-cq72r 0/3 ContainerStatusUnknown 5 (2d19h ago) 7d11h [09:59:12] linkrecommendation-internal-59888cbfd4-2rm7p 3/3 Running 0 4m20s [09:59:13] linkrecommendation-internal-59888cbfd4-7f58v 3/3 Running 0 4m20s [09:59:13] linkrecommendation-internal-59888cbfd4-9wczq 3/3 Running 0 3m23s [09:59:14] linkrecommendation-internal-59888cbfd4-hdszv 3/3 Running 0 3m23s [09:59:14] linkrecommendation-internal-59888cbfd4-kzwhs 3/3 Running 0 4m20s [09:59:15] linkrecommendation-internal-59888cbfd4-l2nl6 3/3 Running 0 3m17s [09:59:15] linkrecommendation-internal-59888cbfd4-plx24 3/3 Running 0 4m20s [09:59:17] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host tcp-proxy7001.magru.wmnet with OS trixie [09:59:30] (03CR) 10Marostegui: [C:03+1] site.pp, es2026.yaml: Decommission es2026 [puppet] - 10https://gerrit.wikimedia.org/r/1199311 (https://phabricator.wikimedia.org/T408385) (owner: 10Federico Ceratto) [09:59:39] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11322443 (10elukey) Tried to reimage tcp-proxy7001 while being attached to the gnt-console but I don't see an... [09:59:44] ozge_: let me check smth [09:59:56] (03CR) 10Federico Ceratto: [C:03+2] site.pp, es2026.yaml: Decommission es2026 [puppet] - 10https://gerrit.wikimedia.org/r/1199311 (https://phabricator.wikimedia.org/T408385) (owner: 10Federico Ceratto) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1000) [10:00:20] ozge_: container got lost [10:00:25] I'll delete it, it'll pop back up [10:00:57] ozge_: all good, proceed [10:03:29] ozge_: It was a remnant of an old deployment, as you can tell from the pod id not being 59888cbfd4 and the age of the pod. It was not interfering with anything as you can see there were 8 total running pods, which is the configured deployment size [10:04:01] Actually, wrong about that, there were only 7 [10:05:47] claime: (i was deploying together with ozge_) helm successfully finished, so i hope all is good now. it was my first time seeing this status, so i wasn't sure if it's a problem or not. thanks for checking! [10:06:31] (03PS1) 10Santiago Faci: xLab: Deploying v1.1.0 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199733 (https://phabricator.wikimedia.org/T406729) [10:07:36] urbanecm: yeah, current deployment looks ok. tbh I would have hoped helm would have deleted that pod, but somehow that didn't happen. What you could have done yourself that may have fixed it is to do a helmfile sync instead of apply [10:08:26] !log elukey@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [10:10:29] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11322471 (10elukey) I just moved all the traffic to eqiad depooling codfw. This is the last test to make sure the new stack can handle all traffic in case it is needed. [10:12:22] Awesome thank you @claime . We have completed the deployment [10:12:27] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:13:15] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:13:29] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:18:43] phabricator giving db issues for anyone else? [10:19:54] what specific DB issues? Phabricator itself is working fine for me [10:20:59] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-(api-int|jobrunner): serve 10% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199514 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [10:21:34] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-(api-ext|web): scale next releases to 20% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199513 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [10:22:00] (03CR) 10Alexandros Kosiaris: [C:03+1] Enroll 25% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199515 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [10:23:30] I got several 'mysql server has gone away' errors for phabricator_search. works again now [10:23:54] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Release patch for ratelimit test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: 10Clément Goubert) [10:24:24] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:25:47] (03PS3) 10Hnowlan: Route transform/wikitext/to/lint(.*) to the gateway on group0 [puppet] - 10https://gerrit.wikimedia.org/r/1194994 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [10:25:50] (03Merged) 10jenkins-bot: api-gateway: Release patch for ratelimit test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199331 (https://phabricator.wikimedia.org/T408128) (owner: 10Clément Goubert) [10:27:41] (03PS4) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [10:28:52] (03CR) 10Fabfur: [C:04-1] "still working on this" [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [10:29:34] (03CR) 10Hnowlan: [C:03+1] Route transform/wikitext/to/lint(.*) to the gateway on group0 [puppet] - 10https://gerrit.wikimedia.org/r/1194994 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [10:30:32] jouncebot: nowandnext [10:30:32] For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1000) [10:30:32] In 0 hour(s) and 29 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1100) [10:32:02] hnowlan: working on rest-gateway staging fyi [10:32:34] (03CR) 10Vgutierrez: P:cache:haproxy: introduce ua classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [10:32:40] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [10:33:09] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [10:33:23] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [10:33:53] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [10:34:10] claime: ack - cool if I merge the wikitext/to/lint patch for group0 (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1194994) while you're doing that? [10:34:34] hnowlan: puppet change? fine [10:38:17] !log pmiazga@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:38:49] !log pmiazga@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:40:16] (03CR) 10Hnowlan: [C:03+2] Route transform/wikitext/to/lint(.*) to the gateway on group0 [puppet] - 10https://gerrit.wikimedia.org/r/1194994 (https://phabricator.wikimedia.org/T385066) (owner: 10Aaron Schulz) [10:48:40] (03PS5) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [10:49:12] (03CR) 10Fabfur: P:cache:haproxy: introduce ua classes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [10:49:47] (03PS1) 10Federico Ceratto: instances.yaml: remove es2028 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199739 (https://phabricator.wikimedia.org/T408407) [10:49:49] (03PS1) 10Federico Ceratto: instances.yaml: remove es2029 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199740 (https://phabricator.wikimedia.org/T408408) [10:49:51] (03PS1) 10Federico Ceratto: instances.yaml: remove es2030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199741 (https://phabricator.wikimedia.org/T408409) [10:49:54] (03PS1) 10Federico Ceratto: instances.yaml: remove es2031 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199742 (https://phabricator.wikimedia.org/T408410) [10:49:56] (03PS1) 10Federico Ceratto: instances.yaml: remove es2032 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199743 (https://phabricator.wikimedia.org/T408411) [10:49:58] (03PS1) 10Federico Ceratto: instances.yaml: remove es2033 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199744 (https://phabricator.wikimedia.org/T408412) [10:50:00] (03PS1) 10Federico Ceratto: instances.yaml: remove es2034 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199745 (https://phabricator.wikimedia.org/T408414) [10:52:51] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices2004-dev.codfw.wmnet with OS trixie [10:53:36] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:55:21] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:55:46] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:00:04] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1100). nyaa~ [11:00:11] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:03:48] jouncebot: nowandnext [11:03:48] For the next 0 hour(s) and 56 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1100) [11:03:48] In 1 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1300) [11:04:22] (03CR) 10Mvolz: [C:03+2] Update Zotero to node22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199263 (https://phabricator.wikimedia.org/T393434) (owner: 10Mvolz) [11:04:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy7002.magru.wmnet with OS trixie [11:04:34] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199403 (owner: 10PipelineBot) [11:05:02] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11322690 (10MoritzMuehlenhoff) >>! In T408064#11322443, @elukey wrote: > Tried to reimage tcp-proxy7001 while... [11:06:05] (03Merged) 10jenkins-bot: Update Zotero to node22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199263 (https://phabricator.wikimedia.org/T393434) (owner: 10Mvolz) [11:07:25] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply [11:07:50] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:08:26] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/zotero: apply [11:08:55] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:10:03] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/zotero: apply [11:10:57] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [11:13:19] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:33] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/citoid: apply [11:15:52] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:16:49] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/citoid: apply [11:17:20] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:17:44] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:17:58] (03PS1) 10Clément Goubert: rest-gateway: Fix double policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199747 [11:18:11] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:23:34] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Fix double policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199747 (owner: 10Clément Goubert) [11:25:19] (03Merged) 10jenkins-bot: rest-gateway: Fix double policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199747 (owner: 10Clément Goubert) [11:25:50] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:25:55] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:27:11] (03PS1) 10Majavah: perl540: Install libnet-idn-encode-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1199750 (https://phabricator.wikimedia.org/T407707) [11:28:29] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:29:10] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:29:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2108:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2108 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:30:06] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [11:32:54] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:33:45] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy7002.magru.wmnet with reason: host reimage [11:37:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy7002.magru.wmnet with reason: host reimage [11:38:10] (03PS6) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [11:44:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2108:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2108 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:45:17] (03CR) 10Muehlenhoff: [C:03+2] osm_sync_lag.sh: Fix default to current directory [puppet] - 10https://gerrit.wikimedia.org/r/1199265 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:47:18] (03PS1) 10Clément Goubert: api-gateway: Fix fallback_policy in lua [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199752 [11:47:26] (03CR) 10CI reject: [V:04-1] api-gateway: Fix fallback_policy in lua [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199752 (owner: 10Clément Goubert) [11:47:40] (03PS2) 10Clément Goubert: api-gateway: Fix fallback_policy in lua [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199752 [11:47:51] (03PS1) 10Vgutierrez: haproxy: Don't set X-JA4H for http traffic [puppet] - 10https://gerrit.wikimedia.org/r/1199753 (https://phabricator.wikimedia.org/T406990) [11:48:08] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199753 (https://phabricator.wikimedia.org/T406990) (owner: 10Vgutierrez) [11:49:38] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Fix fallback_policy in lua [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199752 (owner: 10Clément Goubert) [11:51:32] (03Merged) 10jenkins-bot: api-gateway: Fix fallback_policy in lua [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199752 (owner: 10Clément Goubert) [11:51:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy7002.magru.wmnet with OS trixie [11:52:20] (03CR) 10Muehlenhoff: [C:03+2] maps: Stop installing osm2pgsql and osmborder [puppet] - 10https://gerrit.wikimedia.org/r/1199271 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:53:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy7001.magru.wmnet with OS trixie [11:53:27] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:53:37] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:53:45] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:53:59] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:57:24] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [11:57:25] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:57:48] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:00:46] (03PS2) 10Abijeet Patro: Remove wmgULSPosition for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199751 (https://phabricator.wikimedia.org/T400067) [12:01:21] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:01:33] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:01:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2108:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2108 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:04:59] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:05:17] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:07:20] Hi xSavitar, any chance you could do yesterdays deploy for me? [12:07:28] (03CR) 10Stevemunene: [C:03+2] druid: Increase the size of the Druid broker cache size to 4GB [puppet] - 10https://gerrit.wikimedia.org/r/1199280 (https://phabricator.wikimedia.org/T408189) (owner: 10Stevemunene) [12:08:00] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:08:19] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:20] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:08:48] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:08:55] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:10:25] (03CR) 10Marostegui: [C:03+1] instances.yaml: remove es2034 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199745 (https://phabricator.wikimedia.org/T408414) (owner: 10Federico Ceratto) [12:10:35] (03CR) 10Marostegui: [C:03+1] instances.yaml: remove es2033 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199744 (https://phabricator.wikimedia.org/T408412) (owner: 10Federico Ceratto) [12:11:03] (03CR) 10Marostegui: [C:03+1] instances.yaml: remove es2032 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199743 (https://phabricator.wikimedia.org/T408411) (owner: 10Federico Ceratto) [12:11:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2108:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2108 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:11:58] (03CR) 10Marostegui: [C:03+1] instances.yaml: remove es2031 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199742 (https://phabricator.wikimedia.org/T408410) (owner: 10Federico Ceratto) [12:12:18] (03CR) 10Marostegui: [C:03+1] instances.yaml: remove es2030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199741 (https://phabricator.wikimedia.org/T408409) (owner: 10Federico Ceratto) [12:12:25] (03CR) 10Marostegui: [C:03+1] instances.yaml: remove es2029 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199740 (https://phabricator.wikimedia.org/T408408) (owner: 10Federico Ceratto) [12:12:33] (03CR) 10Marostegui: [C:03+1] instances.yaml: remove es2028 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199739 (https://phabricator.wikimedia.org/T408407) (owner: 10Federico Ceratto) [12:14:05] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:14:17] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:17:21] !log stevemunene@cumin1003 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [12:19:53] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:19:55] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:20:21] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:23:47] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy7001.magru.wmnet with reason: host reimage [12:24:43] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11322895 (10MoritzMuehlenhoff) [12:25:58] (03CR) 10Nikerabbit: [C:03+1] Remove wmgULSPosition for special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199751 (https://phabricator.wikimedia.org/T400067) (owner: 10Abijeet Patro) [12:26:08] (03PS1) 10Clément Goubert: api-gateway: fix per-route action override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199757 [12:26:27] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:28:28] (03CR) 10Clément Goubert: [C:03+2] api-gateway: fix per-route action override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199757 (owner: 10Clément Goubert) [12:29:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy7001.magru.wmnet with reason: host reimage [12:30:10] (03Merged) 10jenkins-bot: api-gateway: fix per-route action override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199757 (owner: 10Clément Goubert) [12:30:50] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:31:12] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:32:18] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:32:36] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:33:11] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:33:59] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:34:12] (03PS1) 10Majavah: kubeadm::helm: Reduce helm-diff default context [puppet] - 10https://gerrit.wikimedia.org/r/1199761 [12:35:03] (03CR) 10David Caro: [C:03+1] kubeadm::helm: Reduce helm-diff default context [puppet] - 10https://gerrit.wikimedia.org/r/1199761 (owner: 10Majavah) [12:35:28] (03PS1) 10Kosta Harlan: product_metrics/suggested_investigations_interaction: add performer_groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199762 (https://phabricator.wikimedia.org/T404177) [12:36:07] (03PS1) 10Stevemunene: LVS: set druid-coordinator to state lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1199763 (https://phabricator.wikimedia.org/T406222) [12:36:10] (03PS1) 10Stevemunene: LVS: set druid-coordinator to state production [puppet] - 10https://gerrit.wikimedia.org/r/1199764 (https://phabricator.wikimedia.org/T406222) [12:36:16] (03CR) 10Majavah: [C:03+2] kubeadm::helm: Reduce helm-diff default context [puppet] - 10https://gerrit.wikimedia.org/r/1199761 (owner: 10Majavah) [12:38:49] (03CR) 10Mszwarc: [C:03+1] product_metrics/suggested_investigations_interaction: add performer_groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199762 (https://phabricator.wikimedia.org/T404177) (owner: 10Kosta Harlan) [12:39:06] (03CR) 10CDanis: [C:03+1] haproxy: Don't set X-JA4H for http traffic [puppet] - 10https://gerrit.wikimedia.org/r/1199753 (https://phabricator.wikimedia.org/T406990) (owner: 10Vgutierrez) [12:41:33] (03PS1) 10Clément Goubert: api-gateway: Fix apply_rate_limiting override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199765 [12:43:12] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:43:48] (03CR) 10Clément Goubert: [C:03+2] api-gateway: Fix apply_rate_limiting override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199765 (owner: 10Clément Goubert) [12:43:48] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:44:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:45:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:45:49] (03Merged) 10jenkins-bot: api-gateway: Fix apply_rate_limiting override [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199765 (owner: 10Clément Goubert) [12:45:54] (03PS3) 10Anzx: minwikisource: add portal namespace, set sitename, timezone and project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199281 [12:46:06] (03PS4) 10Anzx: pcmwikiquote: set timezone, sitename and projectnamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199298 (https://phabricator.wikimedia.org/T408351) [12:46:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199281 (owner: 10Anzx) [12:46:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199298 (https://phabricator.wikimedia.org/T408351) (owner: 10Anzx) [12:46:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy7001.magru.wmnet with OS trixie [12:46:52] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:46:57] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:50:09] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [12:50:17] I have a puppet patch for which PCC fails in CI (prod and change) because `ipresolve` fails for a given host. Is there a way to avoid failing the whole puppet compilation and just ignore the resolution error? [12:50:18] https://puppet-compiler.wmflabs.org/output/1199297/7729/deploy1003.eqiad.wmnet/prod.deploy1003.eqiad.wmnet.err [12:50:21] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [12:51:01] the underlying ruby function seems to call `fail(msg)`, which I don't know whether we can catch in puppet [12:52:21] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vrts, 10Znuny: VRTS is spammed with bounce e-mails and is going to break - https://phabricator.wikimedia.org/T408632#11322952 (10LSobanski) [12:52:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199762 (https://phabricator.wikimedia.org/T404177) (owner: 10Kosta Harlan) [12:52:45] jmm@cumin2002 reimage (PID 1847919) is awaiting input [12:53:32] (03CR) 10Brouberol: [C:03+1] Change the component from where we install elasticsearch-curator [puppet] - 10https://gerrit.wikimedia.org/r/1196942 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [12:54:14] (03CR) 10Brouberol: [C:03+1] Pin the logstash and logstash-plugins everywhere they are installed [puppet] - 10https://gerrit.wikimedia.org/r/1196057 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [12:55:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy3002.esams.wmnet with OS trixie [12:55:51] (03CR) 10Brouberol: [C:03+1] Pin the version of opensearch-dashboards wherever it is used [puppet] - 10https://gerrit.wikimedia.org/r/1196023 (https://phabricator.wikimedia.org/T407199) (owner: 10Btullis) [12:56:07] (03PS1) 10Clément Goubert: rest-gateway: Revert bad override fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199769 [12:58:57] (03PS2) 10Anzx: pcmwikiquote: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199768 (https://phabricator.wikimedia.org/T408351) [12:59:14] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Revert bad override fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199769 (owner: 10Clément Goubert) [12:59:24] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:59:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199768 (https://phabricator.wikimedia.org/T408351) (owner: 10Anzx) [13:00:05] Urbanecm and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1300). [13:00:05] JavierMonton, anzx, and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] o/ [13:01:19] (03Merged) 10jenkins-bot: rest-gateway: Revert bad override fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199769 (owner: 10Clément Goubert) [13:02:46] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11322967 (10WMDECyn) approved again from wmde side [13:03:08] I'm here [13:03:20] but in a meeting, is anyone else available to deploy? [13:03:38] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:03:43] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [13:03:52] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:04:28] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:04:48] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:05:33] !log stevemunene@cumin1003 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [13:06:12] (03PS1) 10Anzx: minwikisource: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199774 (https://phabricator.wikimedia.org/T408343) [13:06:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199774 (https://phabricator.wikimedia.org/T408343) (owner: 10Anzx) [13:07:11] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:07:28] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:08:09] (03PS1) 10Clément Goubert: rest-gateway: Fix opt_in value name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199775 [13:08:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198626 (https://phabricator.wikimedia.org/T408284) (owner: 10Bunnypranav) [13:09:22] late scheduling, but here I am. (carried from yesterday's deploy, same window) [13:10:24] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11322976 (10MoritzMuehlenhoff) >>! In T408064#11322415, @elukey wrote: > And 3002 seems in a bad state too:... [13:11:50] (03CR) 10Clément Goubert: [C:03+2] rest-gateway: Fix opt_in value name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199775 (owner: 10Clément Goubert) [13:12:04] (03CR) 10Pmiazga: [C:03+1] rest-gateway: Fix opt_in value name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199775 (owner: 10Clément Goubert) [13:13:49] (03Merged) 10jenkins-bot: rest-gateway: Fix opt_in value name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199775 (owner: 10Clément Goubert) [13:14:01] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:14:22] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:15:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199762 (https://phabricator.wikimedia.org/T404177) (owner: 10Kosta Harlan) [13:16:20] if I have time after my meeting, I'll see if I can process some of the other config patches. For now, I'm just shipping my own [13:16:29] (03Merged) 10jenkins-bot: product_metrics/suggested_investigations_interaction: add performer_groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199762 (https://phabricator.wikimedia.org/T404177) (owner: 10Kosta Harlan) [13:17:01] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1199762|product_metrics/suggested_investigations_interaction: add performer_groups (T404177)]] [13:17:08] T404177: Instrumentation for Suggested Investigations - https://phabricator.wikimedia.org/T404177 [13:18:21] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy3002.esams.wmnet with reason: host reimage [13:19:25] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1199762|product_metrics/suggested_investigations_interaction: add performer_groups (T404177)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:20:34] (03PS2) 10Brouberol: global_config: add an urldownloader external service [puppet] - 10https://gerrit.wikimedia.org/r/1199297 (https://phabricator.wikimedia.org/T408012) [13:22:26] (03PS6) 10Seanleong-wmde: Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) [13:22:53] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [13:23:06] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [13:23:41] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [13:23:44] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1199297 (https://phabricator.wikimedia.org/T408012) (owner: 10Brouberol) [13:23:49] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf group for jpchev - https://phabricator.wikimedia.org/T408636#11323030 (10Novem_Linguae) If you're not a WMF employee or contractor, then you will want to request access to the LDAP group `nda`, not `wmf`. To request access to the `nda` group, you need to find... [13:24:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy3002.esams.wmnet with reason: host reimage [13:24:22] (03CR) 10JMeybohm: [C:04-1] "I'm pretty sure the statement `This requires TLS to be enabled as well` is still true and valid. Maybe just remove the reference to _tls_h" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1196174 (https://phabricator.wikimedia.org/T406876) (owner: 10Bking) [13:26:05] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [13:27:29] I'm here too, sorry for the late reply [13:27:30] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11323050 (10MoritzMuehlenhoff) [13:27:34] !log kharlan@deploy2002 kharlan: Continuing with sync [13:28:40] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [13:29:02] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [13:29:06] 06SRE, 06Data-Engineering: stat1011: cannot create directory ‘/srv/published/datasets/one-off’: Permission denied - https://phabricator.wikimedia.org/T408641#11323056 (10Ottomata) ` 13:26:24 [@stat1011:/home/otto] $ ls -la /srv/published/ total 28 drwxrwxr-x 6 root wikidev 4096 Oct 31 2024 . dr... [13:29:42] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [13:30:05] (03PS1) 10Majavah: kubeadm::helm: Fix env variable type [puppet] - 10https://gerrit.wikimedia.org/r/1199776 [13:30:21] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [13:30:34] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [13:30:49] (03CR) 10Brouberol: [C:03+2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199297 (https://phabricator.wikimedia.org/T408012) (owner: 10Brouberol) [13:31:05] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [13:31:11] !log upgrade Envoy on debmonitor* T405808 [13:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:16] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [13:31:19] (03CR) 10Majavah: [C:03+2] kubeadm::helm: Fix env variable type [puppet] - 10https://gerrit.wikimedia.org/r/1199776 (owner: 10Majavah) [13:31:49] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199762|product_metrics/suggested_investigations_interaction: add performer_groups (T404177)]] (duration: 14m 48s) [13:31:54] T404177: Instrumentation for Suggested Investigations - https://phabricator.wikimedia.org/T404177 [13:33:14] (03CR) 10Seanleong-wmde: Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: 10Seanleong-wmde) [13:34:24] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:34:58] (03CR) 10Clare Ming: [C:03+2] xLab: Deploying v1.1.0 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199733 (https://phabricator.wikimedia.org/T406729) (owner: 10Santiago Faci) [13:37:01] (03Merged) 10jenkins-bot: xLab: Deploying v1.1.0 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199733 (https://phabricator.wikimedia.org/T406729) (owner: 10Santiago Faci) [13:37:25] 06SRE, 10envoy, 06serviceops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Upgrade Envoy to v1.32.12 on wcqs and wdqs hosts - https://phabricator.wikimedia.org/T404867#11323077 (10Gehel) Tested on wdqs2025 (our usual test node): ` gehel@wdqs2025:~$ sudo apt install -y envoyproxy=1.32... [13:40:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: 10Seanleong-wmde) [13:40:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy3002.esams.wmnet with OS trixie [13:41:39] (03CR) 10Vgutierrez: [C:03+2] haproxy: Don't set X-JA4H for http traffic [puppet] - 10https://gerrit.wikimedia.org/r/1199753 (https://phabricator.wikimedia.org/T406990) (owner: 10Vgutierrez) [13:43:46] !log deploying envoy 1.32.12-1 + restart on W[CD]QS nodes - T404867 [13:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:55] T404867: Upgrade Envoy to v1.32.12 on wcqs and wdqs hosts - https://phabricator.wikimedia.org/T404867 [13:46:39] xSavitar: Any chance you could the deploy? [13:47:07] 06SRE, 10envoy, 06serviceops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Upgrade Envoy to v1.32.12 on wcqs and wdqs hosts - https://phabricator.wikimedia.org/T404867#11323092 (10Gehel) 05Open→03Resolved [13:50:07] (03PS1) 10Vgutierrez: benthos::webrequest: Provide X-Is-Browser data [puppet] - 10https://gerrit.wikimedia.org/r/1199781 [13:50:56] 06SRE, 10envoy, 06serviceops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Upgrade Envoy to v1.32.12 on wcqs and wdqs hosts - https://phabricator.wikimedia.org/T404867#11323106 (10Gehel) Manual test works. [[ https://grafana.wikimedia.org/goto/dU16ulgDg?orgId=1 | Graphs ]] still l... [13:55:19] (03PS1) 10Xcollazo: dumps: Link to new MW Content File Export. Deprecate legacy XML dumps. [puppet] - 10https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) [13:55:37] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for slyngshede - https://phabricator.wikimedia.org/T408689 (10SLyngshede-WMF) 03NEW [13:56:36] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. [13:58:00] (03PS1) 10Slyngshede: data.yaml: Grant slyngshede access to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1199784 (https://phabricator.wikimedia.org/T408689) [13:59:06] bunnypranav, hey, sorry I'm just coming to IRC now. [13:59:19] Did another deployer help with deploying your patch? [13:59:30] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535) (owner: 10Huei Tan) [13:59:43] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users for slyngshede - https://phabricator.wikimedia.org/T408689#11323158 (10SLyngshede-WMF) @ssingh - for manager sign off @Ottomata - Group access approval [13:59:48] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1400) [14:00:19] (03CR) 10Huei Tan: "yes please, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535) (owner: 10Huei Tan) [14:00:38] no one was online actually [14:00:46] no worries btw [14:01:22] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users for slyngshede - https://phabricator.wikimedia.org/T408689#11323162 (10ssingh) >>! In T408689#11323157, @SLyngshede-WMF wrote: > @ssingh - for manager sign off > > @Ottomata - Group access approval Approved! [14:01:58] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198023 (https://phabricator.wikimedia.org/T405176) (owner: 10Michael Große) [14:02:07] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users for slyngshede - https://phabricator.wikimedia.org/T408689#11323169 (10SLyngshede-WMF) [14:02:31] bunnypranav, thanks! Late window would be late for you as I remember but I'll be available. [14:02:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [14:03:26] (03CR) 10Muehlenhoff: [C:03+2] Enable the Prometheus exporter for the Ganeti CA on Ganeti masters [puppet] - 10https://gerrit.wikimedia.org/r/1196634 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [14:04:25] xSavitar, would a off window deploy be possible? I mean deploy it now, if you are available. [14:04:45] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users for slyngshede - https://phabricator.wikimedia.org/T408689#11323185 (10MoritzMuehlenhoff) >>! In T408689#11323157, @SLyngshede-WMF wrote: > @Ottomata - Group access approval This isn't needed anymore for WMF s... [14:06:24] bunnypranav, the window now is owned by Wikifunctions team: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1400 [14:06:34] ah, i see. [14:06:42] !log fceratto@cumin1003 dbctl commit (dc=all): 'Depool es2027 T408406', diff saved to https://phabricator.wikimedia.org/P84346 and previous config saved to /var/cache/conftool/dbconfig/20251029-140641-fceratto.json [14:06:47] T408406: decommission es2027 - https://phabricator.wikimedia.org/T408406 [14:06:50] Fair enough; will have to see tomorrow then. [14:06:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2151 (T407997)', diff saved to https://phabricator.wikimedia.org/P84347 and previous config saved to /var/cache/conftool/dbconfig/20251029-140652-marostegui.json [14:06:57] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:07:26] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:08:14] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:08:18] bunnypranav, I'll add something to my calendar to deploy your patch tomorrow if someone else doesn't deploy before me. That way I can be reminded I have something to deploy. [14:08:19] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:08:28] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1003 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [14:08:28] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 31.35 ms [14:09:02] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:09:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T407997)', diff saved to https://phabricator.wikimedia.org/P84348 and previous config saved to /var/cache/conftool/dbconfig/20251029-140902-marostegui.json [14:09:15] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:09:52] xSavitar: Sure, thank you so much ! [14:10:31] Ack [14:11:50] (03PS1) 10Muehlenhoff: prometheus/ganeti: Fix typo in systemd timer job name [puppet] - 10https://gerrit.wikimedia.org/r/1199789 (https://phabricator.wikimedia.org/T382902) [14:12:32] (03PS2) 10Brouberol: global_config: urldownloader hostnames end with wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1199778 (https://phabricator.wikimedia.org/T408012) [14:12:42] (03CR) 10Brouberol: [C:03+2] global_config: urldownloader hostnames end with wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1199778 (https://phabricator.wikimedia.org/T408012) (owner: 10Brouberol) [14:13:19] RESOLVED: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:13:51] (03CR) 10Muehlenhoff: [C:03+2] prometheus/ganeti: Fix typo in systemd timer job name [puppet] - 10https://gerrit.wikimedia.org/r/1199789 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [14:17:48] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users for slyngshede - https://phabricator.wikimedia.org/T408689#11323263 (10SLyngshede-WMF) [14:18:00] (03CR) 10Slyngshede: [C:03+2] data.yaml: Add an additional FIDO ssh key for slyngshede [puppet] - 10https://gerrit.wikimedia.org/r/1196796 (owner: 10Slyngshede) [14:21:07] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11323275 (10Jhancock.wm) @elukey i couldn't find anything in particular that is causing this. I did upgrade the bios and idrac. that should help mitigate it. I do know this ser... [14:23:29] (03CR) 10Muehlenhoff: [C:03+1] "Good to merge" [puppet] - 10https://gerrit.wikimedia.org/r/1199784 (https://phabricator.wikimedia.org/T408689) (owner: 10Slyngshede) [14:24:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P84349 and previous config saved to /var/cache/conftool/dbconfig/20251029-142410-marostegui.json [14:24:24] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:24:31] (03PS1) 10Vgutierrez: haproxy,varnish: Report X-Is-Brower back from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1199792 (https://phabricator.wikimedia.org/T398161) [14:24:51] (03PS2) 10Vgutierrez: haproxy,varnish: Report X-Is-Browser back from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1199792 (https://phabricator.wikimedia.org/T398161) [14:25:29] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users for slyngshede - https://phabricator.wikimedia.org/T408689#11323282 (10SLyngshede-WMF) p:05Triage→03Low [14:27:46] (03PS2) 10Slyngshede: data.yaml: Grant slyngshede access to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1199784 (https://phabricator.wikimedia.org/T408689) [14:28:18] (03CR) 10Vgutierrez: [C:04-1] "test-cookbook fails with the following error:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [14:29:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 07Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11323302 (10elukey) >>! In T406656#11323088, @bking wrote: >> I think that you are trying to impose your view on how things should be w... [14:30:00] (03CR) 10Slyngshede: [C:03+2] data.yaml: Grant slyngshede access to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1199784 (https://phabricator.wikimedia.org/T408689) (owner: 10Slyngshede) [14:30:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1400) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1430) [14:31:12] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for slyngshede - https://phabricator.wikimedia.org/T408689#11323307 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [14:31:14] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199792 (https://phabricator.wikimedia.org/T398161) (owner: 10Vgutierrez) [14:31:18] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7496/console" [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [14:32:24] (03CR) 10Elukey: Add the sre.hosts.powercycle cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [14:36:05] (03CR) 10JHathaway: Add the sre.hosts.powercycle cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [14:36:08] (03CR) 10Andrew Bogott: [C:03+2] dnsrecursor: fix handling of auth_zones for wikimedia-common.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [14:37:08] (03PS6) 10Andrew Bogott: dnsrecursor: fix handling of auth_zones for wikimedia-common.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) [14:37:08] (03PS6) 10Andrew Bogott: cloudservices2004-dev.yaml: use new, yaml-style pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1199512 [14:37:46] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve2001 [14:38:01] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199512 (owner: 10Andrew Bogott) [14:38:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11323359 (10ops-monitoring-bot) Host ml-serve2001 powercycled by elukey@cumin2002 with reason: Testing powercycle cookbook [14:39:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P84350 and previous config saved to /var/cache/conftool/dbconfig/20251029-143918-marostegui.json [14:39:34] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:39:38] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission es1032.eqiad.wmnet - https://phabricator.wikimedia.org/T408662#11323367 (10Jclark-ctr) a:03Jclark-ctr [14:40:27] (03CR) 10Andrew Bogott: [C:03+2] dnsrecursor: fix handling of auth_zones for wikimedia-common.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [14:40:28] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host ml-serve2001 [14:41:07] (03CR) 10Elukey: Add the sre.hosts.powercycle cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [14:41:47] (03CR) 10Andrew Bogott: [C:03+2] cloudservices2004-dev.yaml: use new, yaml-style pdns-recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1199512 (owner: 10Andrew Bogott) [14:42:02] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms [14:42:11] (03PS11) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [14:44:11] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts tcp-proxy2001.codfw.wmnet [14:46:16] (03CR) 10Fabfur: [C:03+1] "lgtm, good job!" [puppet] - 10https://gerrit.wikimedia.org/r/1196543 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [14:49:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:49:29] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:50:47] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v12.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1199801 [14:51:05] (03CR) 10Arlolra: ExtensionDistributor: Mark 1.45 as beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) (owner: 10Arlolra) [14:54:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T407997)', diff saved to https://phabricator.wikimedia.org/P84351 and previous config saved to /var/cache/conftool/dbconfig/20251029-145425-marostegui.json [14:54:34] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [14:54:43] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [14:54:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2158 (T407997)', diff saved to https://phabricator.wikimedia.org/P84352 and previous config saved to /var/cache/conftool/dbconfig/20251029-145450-marostegui.json [14:56:00] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:56:02] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:56:57] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:57:22] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [14:57:38] (03CR) 10Vgutierrez: P:cache:haproxy: introduce ua classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [14:59:00] (03CR) 10CI reject: [V:04-1] CHANGELOG: add changelogs for release v12.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1199801 (owner: 10Elukey) [14:59:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T407997)', diff saved to https://phabricator.wikimedia.org/P84353 and previous config saved to /var/cache/conftool/dbconfig/20251029-145901-marostegui.json [15:01:27] (03PS7) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [15:01:36] (03CR) 10Fabfur: P:cache:haproxy: introduce ua classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [15:01:54] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199778 (https://phabricator.wikimedia.org/T408012) (owner: 10Brouberol) [15:03:15] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:03:22] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:03:38] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199778 (https://phabricator.wikimedia.org/T408012) (owner: 10Brouberol) [15:04:04] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:04:47] !log reboot lvs2014 (T407110) [15:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:06:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) (owner: 10Arlolra) [15:06:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts tcp-proxy2001.codfw.wmnet [15:06:14] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs2014.codfw.wmnet [15:06:24] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11323515 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `tcp-pr... [15:06:40] (03PS1) 10Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) [15:06:47] (03CR) 10CI reject: [V:04-1] Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [15:06:58] (03PS1) 10Dpogorzelski: topic: add dpogorzelski to ops [puppet] - 10https://gerrit.wikimedia.org/r/1199810 (https://phabricator.wikimedia.org/T408702) [15:07:29] 06SRE: offline rackspace wikitech-static, online aws wikitech-static - https://phabricator.wikimedia.org/T408704 (10RobH) 03NEW p:05Triage→03High [15:08:19] FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:46] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2028 - Depool es2028 T408407 [15:08:50] T408407: decommission es2028 - https://phabricator.wikimedia.org/T408407 [15:09:15] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2028 - Depool es2028 T408407 [15:09:20] 06SRE: offline rackspace wikitech-static, online aws wikitech-static - https://phabricator.wikimedia.org/T408704#11323568 (10RobH) Pinged in IRC and it is uncertain if this can be killed this month. Options: * Deactivate Rackspace ** The coupa CID 5302 expires this month. * Renew Rackspace coupa Contract for m... [15:10:17] 06SRE, 06Data-Engineering: stat1011: cannot create directory ‘/srv/published/datasets/one-off’: Permission denied - https://phabricator.wikimedia.org/T408641#11323570 (10Addshore) 05Open→03Resolved a:03Addshore Success! ty [15:10:24] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2029 - Depool es2029 T408408 [15:10:30] T408408: decommission es2029 - https://phabricator.wikimedia.org/T408408 [15:10:31] !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es2029 - Depool es2029 T408408 [15:11:10] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2028 - Depool es2028 T408407 [15:11:17] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2028 - Depool es2028 T408407 [15:12:13] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2032 - Depool es2032 T408411 [15:12:19] T408411: decommission es2032 - https://phabricator.wikimedia.org/T408411 [15:12:32] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2032 - Depool es2032 T408411 [15:12:46] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2033 - Depool es2033 T408412 [15:12:50] T408412: decommission es2033 - https://phabricator.wikimedia.org/T408412 [15:13:03] (03CR) 10Elukey: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1199801 (owner: 10Elukey) [15:13:08] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:13:14] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2033 - Depool es2033 T408412 [15:13:24] !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2034 - Depool es2034 T408414 [15:13:31] T408414: decommission es2034 - https://phabricator.wikimedia.org/T408414 [15:13:42] !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2034 - Depool es2034 T408414 [15:13:57] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es2028 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199739 (https://phabricator.wikimedia.org/T408407) (owner: 10Federico Ceratto) [15:14:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P84358 and previous config saved to /var/cache/conftool/dbconfig/20251029-151409-marostegui.json [15:16:48] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:18:00] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:19:38] (03PS2) 10Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) [15:21:08] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:21:11] (03CR) 10CI reject: [V:04-1] Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: 10Muehlenhoff) [15:21:50] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:22:35] (03CR) 10Ssingh: "Post-merge comment: note that the new YAML-style config has not been rolled out anywhere since that is on pdns-recursor 5 and trixie, and " [puppet] - 10https://gerrit.wikimedia.org/r/1199534 (https://phabricator.wikimedia.org/T381608) (owner: 10Andrew Bogott) [15:22:58] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:24:41] !log fceratto@cumin1003 dbctl commit (dc=all): 'Remove es2028 from dbctl T408407', diff saved to https://phabricator.wikimedia.org/P84359 and previous config saved to /var/cache/conftool/dbconfig/20251029-152440-fceratto.json [15:24:47] T408407: decommission es2028 - https://phabricator.wikimedia.org/T408407 [15:26:08] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:26:16] PROBLEM - Host lvs2014 is DOWN: PING CRITICAL - Packet loss = 100% [15:26:52] (03CR) 10DLynch: [C:03+1] "Seems fine to me from the EditCheck end. (Now that we have EditCheck enabled everywhere but enwiki, we may want to revisit this beta confi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198023 (https://phabricator.wikimedia.org/T405176) (owner: 10Michael Große) [15:28:33] <_joe_> !log restarted mailman3-web on lists1004 [15:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:56] <_joe_> uh is anyone working on lvs2014? [15:28:58] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:29:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: 10JavierMonton) [15:29:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P84360 and previous config saved to /var/cache/conftool/dbconfig/20251029-152916-marostegui.json [15:29:34] looks like fabfur just rebooted it [15:30:10] RECOVERY - Host lvs2014 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [15:30:11] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2014.codfw.wmnet [15:30:14] yeah fabfur is working on it. so "expected" (it's the backup host) [15:30:58] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:31:40] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54973 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:31:58] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:32:54] yeah, sorry it took longer than expected [15:32:54] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:33:05] !log reboot lvs2012 (T407110) [15:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:19] FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:34:58] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:35:37] (03PS5) 10Brouberol: global_config: stop relying on DNS to translate FQDNs into IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/1199813 (https://phabricator.wikimedia.org/T408706) [15:36:06] PROBLEM - PyBal backends health check on lvs2012 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:36:26] PROBLEM - pybal on lvs2012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:36:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:36:56] !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs2012.codfw.wmnet [15:36:57] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2012.codfw.wmnet [15:37:02] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: puppetdb import job on netbox fails - Cannot retrieve PuppetDB 'networking' facts for new VMs - https://phabricator.wikimedia.org/T408646#11323720 (10Dzahn) 05Open→03Resolved a:03Dzahn Hey @elukey thanks a lot for taking a look at thi... [15:37:03] (03PS3) 10Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) [15:37:36] (03PS1) 10Marco Fossati: Localisation updates from https://translatewiki.net. [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199814 [15:38:06] !log fabfur@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2012.codfw.wmnet with reason: T407110 [15:39:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199814 (owner: 10Marco Fossati) [15:41:00] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11323732 (10Dzahn) Thank you @elukey and @MoritzMuehlenhoff for looking at this. I was suspecting it's someho... [15:41:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host tcp-proxy2001.codfw.wmnet [15:41:23] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [15:42:54] 10SRE-Access-Requests: Access for new ML team SRE - https://phabricator.wikimedia.org/T408579#11323757 (10elukey) [15:43:00] (03CR) 10Andrea Denisse: [C:03+2] alertmanager: route Language and Product Localization team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535) (owner: 10Huei Tan) [15:43:24] 10SRE-Access-Requests: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11323764 (10elukey) [15:44:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T407997)', diff saved to https://phabricator.wikimedia.org/P84361 and previous config saved to /var/cache/conftool/dbconfig/20251029-154424-marostegui.json [15:44:30] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [15:44:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [15:44:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2169 (T407997)', diff saved to https://phabricator.wikimedia.org/P84362 and previous config saved to /var/cache/conftool/dbconfig/20251029-154448-marostegui.json [15:45:04] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - jmm@cumin2002" [15:45:55] !log upgrade Envoy on people* T405808 [15:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:59] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [15:46:41] 10SRE-Access-Requests: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11323785 (10DPogorzelski-WMF) @calbon if you can please approve :) [15:46:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T407997)', diff saved to https://phabricator.wikimedia.org/P84363 and previous config saved to /var/cache/conftool/dbconfig/20251029-154659-marostegui.json [15:47:03] !log upgrade Envoy on releases* T405808 [15:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:19] !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs2012.codfw.wmnet [15:47:20] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs2012.codfw.wmnet [15:47:31] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs2012.codfw.wmnet [15:47:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM tcp-proxy2001.codfw.wmnet - jmm@cumin2002" [15:47:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:47:56] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache tcp-proxy2001.codfw.wmnet on all recursors [15:47:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) tcp-proxy2001.codfw.wmnet on all recursors [15:48:25] !log dancy@deploy2002 Installing scap version "4.219.0" for 165 host(s) [15:48:32] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy2001.codfw.wmnet - jmm@cumin2002" [15:48:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM tcp-proxy2001.codfw.wmnet - jmm@cumin2002" [15:48:43] (03CR) 10Joely Rooke WMDE: [C:03+1] Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: 10Seanleong-wmde) [15:49:09] (03PS4) 10Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - 10https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) [15:49:22] (03PS1) 10Bking: dse-k8s-eqiad: Raise default resources of for opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199818 (https://phabricator.wikimedia.org/T357753) [15:49:24] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11323805 (10seanleong-WMDE) [15:49:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy2001.codfw.wmnet with OS trixie [15:49:57] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11323809 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host tc... [15:50:07] !log upgrade Envoy on zuul* T405808 [15:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:38] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2012.codfw.wmnet [15:51:06] PROBLEM - Host lvs2012 is DOWN: PING CRITICAL - Packet loss = 100% [15:51:06] RECOVERY - Host lvs2012 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [15:51:18] PROBLEM - Check if Pybal has been restarted after pybal.conf was changed on lvs2012 is CRITICAL: CRITICAL: Service pybal.service is not active. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [15:51:26] PROBLEM - pybal on lvs2012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:52:06] RECOVERY - PyBal backends health check on lvs2012 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:52:19] (03PS1) 10Stevemunene: Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) [15:52:23] !log dancy@deploy2002 Installation of scap version "4.219.0" completed for 165 hosts [15:52:26] RECOVERY - pybal on lvs2012 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:52:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [15:52:32] (03PS4) 10Clément Goubert: trafficserver: action api to rest-gateway group0 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198930 (https://phabricator.wikimedia.org/T408223) [15:52:34] RECOVERY - Check if Pybal has been restarted after pybal.conf was changed on lvs2012 is OK: OK: pybal.service was restarted after /etc/pybal/pybal.conf was changed. https://wikitech.wikimedia.org/wiki/PyBal%23Pybal_service_has_not_been_restarted [15:53:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:53:35] (03PS2) 10Stevemunene: Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) [15:53:41] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11323832 (10seanleong-WMDE) Verified the key out of band and added more details to the ticket description as part of the conversation with @Dzahn And thanks @WMDECyn and @thcipriani... [15:55:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Pool db2169 with full weight', diff saved to https://phabricator.wikimedia.org/P84364 and previous config saved to /var/cache/conftool/dbconfig/20251029-155520-marostegui.json [15:55:32] !log upgrade Envoy on doc* T405808 [15:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:37] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [15:55:58] (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway group0 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198930 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [15:55:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [15:56:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2151 (T407997)', diff saved to https://phabricator.wikimedia.org/P84365 and previous config saved to /var/cache/conftool/dbconfig/20251029-155605-marostegui.json [15:56:11] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:01:18] (03PS2) 10Gehel: hadoop: cleanup /tmp from directories as well as files [puppet] - 10https://gerrit.wikimedia.org/r/1199334 (https://phabricator.wikimedia.org/T396582) [16:02:44] !log upgrade Envoy on planet* T405808 [16:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:49] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [16:03:25] (03CR) 10Gehel: [C:03+2] hadoop: cleanup /tmp from directories as well as files [puppet] - 10https://gerrit.wikimedia.org/r/1199334 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [16:04:29] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission es2026 - https://phabricator.wikimedia.org/T408385#11323878 (10FCeratto-WMF) [16:04:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T407997)', diff saved to https://phabricator.wikimedia.org/P84366 and previous config saved to /var/cache/conftool/dbconfig/20251029-160430-marostegui.json [16:04:35] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:05:18] (03CR) 10Muehlenhoff: hadoop: cleanup /tmp from directories as well as files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199334 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [16:07:51] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy2001.codfw.wmnet with reason: host reimage [16:08:31] 06SRE, 05Vuln-Infoleak: Tomcat Stacktrace Disclosure – idp-test.wikimedia.org - https://phabricator.wikimedia.org/T408714 (10sbassett) 03NEW [16:08:43] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:09:29] (03CR) 10BCornwall: [C:03+2] varnishtest: Remove logfile support [puppet] - 10https://gerrit.wikimedia.org/r/1199068 (https://phabricator.wikimedia.org/T408202) (owner: 10BCornwall) [16:10:06] !log upgrade Envoy on stewards* T405808 [16:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:13] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [16:11:21] (03CR) 10Urbanecm: [C:03+2] beta: Enable ReviseTone Structured Task on enwiki,frwiki,arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198023 (https://phabricator.wikimedia.org/T405176) (owner: 10Michael Große) [16:11:56] !log upgrade Envoy on etherpad* T405808 [16:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:29] (03Merged) 10jenkins-bot: beta: Enable ReviseTone Structured Task on enwiki,frwiki,arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198023 (https://phabricator.wikimedia.org/T405176) (owner: 10Michael Große) [16:12:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy2001.codfw.wmnet with reason: host reimage [16:12:57] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715 (10MatthewVernon) 03NEW [16:13:25] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062#11323972 (10MatthewVernon) [16:14:49] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11323977 (10MatthewVernon) p:05Triage→03High [16:18:08] (03PS2) 10Federico Ceratto: instances.yaml: remove es2029 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199740 (https://phabricator.wikimedia.org/T408408) [16:18:08] (03PS2) 10Federico Ceratto: instances.yaml: remove es2030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199741 (https://phabricator.wikimedia.org/T408409) [16:18:08] (03PS2) 10Federico Ceratto: instances.yaml: remove es2031 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199742 (https://phabricator.wikimedia.org/T408410) [16:18:09] (03PS2) 10Federico Ceratto: instances.yaml: remove es2032 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199743 (https://phabricator.wikimedia.org/T408411) [16:18:09] (03PS2) 10Federico Ceratto: instances.yaml: remove es2033 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199744 (https://phabricator.wikimedia.org/T408412) [16:18:10] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on people1004.eqiad.wmnet with reason: decom [16:18:11] (03PS2) 10Federico Ceratto: instances.yaml: remove es2034 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199745 (https://phabricator.wikimedia.org/T408414) [16:18:36] !log shutting down people1004.eqiad.wmnet, people2003.codfw.wmnet - T408713 T402596 [16:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:50] T408713: decom old people VMs / finish people host upgrade - https://phabricator.wikimedia.org/T408713 [16:18:52] T402596: upgrade people servers to trixie - https://phabricator.wikimedia.org/T402596 [16:19:04] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on people2003.codfw.wmnet with reason: decom [16:19:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P84367 and previous config saved to /var/cache/conftool/dbconfig/20251029-161938-marostegui.json [16:20:02] !log `sgimeno@deploy2002:~$ mwscript-k8s --comment="T407366" --dblist="growthexperiments" --follow -- GrowthExperiments:purgeExpiredMentorStatus.php` (T407366) [16:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:07] T407366: purgeExpiredMentorStatus.php fails with in nawiki with error "Key Mentors is missing" - https://phabricator.wikimedia.org/T407366 [16:22:02] (03PS1) 10Vgutierrez: varnish: Fix requestctl deprecation stub generation [puppet] - 10https://gerrit.wikimedia.org/r/1199823 [16:22:21] (03CR) 10Clément Goubert: [C:03+2] Route "/api/rest_v1/?spec" requests to the rest gateway [puppet] - 10https://gerrit.wikimedia.org/r/1177515 (https://phabricator.wikimedia.org/T397203) (owner: 10Aaron Schulz) [16:22:33] (03PS3) 10Stevemunene: Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) [16:22:51] (03PS3) 10Federico Ceratto: instances.yaml: remove es2032 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199743 (https://phabricator.wikimedia.org/T408411) [16:22:51] (03PS3) 10Federico Ceratto: instances.yaml: remove es2033 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199744 (https://phabricator.wikimedia.org/T408412) [16:22:51] (03PS3) 10Federico Ceratto: instances.yaml: remove es2034 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199745 (https://phabricator.wikimedia.org/T408414) [16:22:51] (03PS3) 10Federico Ceratto: instances.yaml: remove es2029 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199740 (https://phabricator.wikimedia.org/T408408) [16:22:52] (03PS3) 10Federico Ceratto: instances.yaml: remove es2030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199741 (https://phabricator.wikimedia.org/T408409) [16:22:54] (03PS3) 10Federico Ceratto: instances.yaml: remove es2031 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199742 (https://phabricator.wikimedia.org/T408410) [16:23:39] (03PS4) 10Stevemunene: Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) [16:24:45] (03Abandoned) 10Elukey: CHANGELOG: add changelogs for release v12.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1199801 (owner: 10Elukey) [16:25:14] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11324052 (10Dzahn) a:05thcipriani→03Dzahn [16:25:22] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es2032 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199743 (https://phabricator.wikimedia.org/T408411) (owner: 10Federico Ceratto) [16:27:12] !log fceratto@cumin1003 dbctl commit (dc=all): 'Remove es2032 from dbctl T408411', diff saved to https://phabricator.wikimedia.org/P84368 and previous config saved to /var/cache/conftool/dbconfig/20251029-162711-fceratto.json [16:27:23] T408411: decommission es2032 - https://phabricator.wikimedia.org/T408411 [16:28:09] (03PS4) 10Federico Ceratto: instances.yaml: remove es2033 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199744 (https://phabricator.wikimedia.org/T408412) [16:28:09] (03PS4) 10Federico Ceratto: instances.yaml: remove es2034 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199745 (https://phabricator.wikimedia.org/T408414) [16:28:09] (03PS4) 10Federico Ceratto: instances.yaml: remove es2029 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199740 (https://phabricator.wikimedia.org/T408408) [16:28:09] (03PS4) 10Federico Ceratto: instances.yaml: remove es2030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199741 (https://phabricator.wikimedia.org/T408409) [16:28:10] (03PS4) 10Federico Ceratto: instances.yaml: remove es2031 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199742 (https://phabricator.wikimedia.org/T408410) [16:28:29] (03CR) 10Elukey: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1199723 (owner: 10Giuseppe Lavagetto) [16:29:05] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es2033 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199744 (https://phabricator.wikimedia.org/T408412) (owner: 10Federico Ceratto) [16:29:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy2001.codfw.wmnet with OS trixie [16:29:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host tcp-proxy2001.codfw.wmnet [16:29:32] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11324070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host tcp-pr... [16:30:19] (03PS5) 10Stevemunene: Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) [16:30:22] !log fceratto@cumin1003 dbctl commit (dc=all): 'Remove es2033 from dbctl T408412', diff saved to https://phabricator.wikimedia.org/P84369 and previous config saved to /var/cache/conftool/dbconfig/20251029-163021-fceratto.json [16:30:30] T408412: decommission es2033 - https://phabricator.wikimedia.org/T408412 [16:30:35] (03CR) 10BCornwall: [V:03+2 C:03+1] "Tests are happy!" [puppet] - 10https://gerrit.wikimedia.org/r/1199823 (owner: 10Vgutierrez) [16:31:39] (03CR) 10Vgutierrez: [C:03+2] varnish: Fix requestctl deprecation stub generation [puppet] - 10https://gerrit.wikimedia.org/r/1199823 (owner: 10Vgutierrez) [16:31:43] (03PS1) 10Dzahn: admin: upgrade seanleong-wmde to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1199824 (https://phabricator.wikimedia.org/T406592) [16:32:50] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11324097 (10MoritzMuehlenhoff) >>! In T408064#11304916, @Dzahn wrote: > tcp-proxy2001 had problems I will get... [16:33:33] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1199824 (https://phabricator.wikimedia.org/T406592) (owner: 10Dzahn) [16:33:55] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1199332 (https://phabricator.wikimedia.org/T408378) (owner: 10Cathal Mooney) [16:34:11] (03CR) 10Dzahn: [C:03+2] admin: upgrade seanleong-wmde to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1199824 (https://phabricator.wikimedia.org/T406592) (owner: 10Dzahn) [16:34:15] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11324108 (10Dzahn) Gotcha. Thank you, Moritz [16:34:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20251029-163446-marostegui.json [16:34:49] (03CR) 10Gehel: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1198206 (https://phabricator.wikimedia.org/T408063) (owner: 10Ryan Kemper) [16:35:24] !log welcome new deployer Sean Leong - https://meta.wikimedia.org/wiki/User:Sean_Leong_(WMDE) T406592 [16:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:30] T406592: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592 [16:36:20] (03CR) 10Federico Ceratto: [C:03+2] instances.yaml: remove es2034 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199745 (https://phabricator.wikimedia.org/T408414) (owner: 10Federico Ceratto) [16:39:00] !log fceratto@cumin1003 dbctl commit (dc=all): 'Remove es2034 from dbctl T408414', diff saved to https://phabricator.wikimedia.org/P84371 and previous config saved to /var/cache/conftool/dbconfig/20251029-163859-fceratto.json [16:39:04] T408414: decommission es2034 - https://phabricator.wikimedia.org/T408414 [16:41:12] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:43:43] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:09] (03PS4) 10Federico Ceratto: site.pp, es2027.yaml: Decommission es2027 [puppet] - 10https://gerrit.wikimedia.org/r/1199821 (https://phabricator.wikimedia.org/T408406) [16:44:09] (03PS5) 10Federico Ceratto: instances.yaml: remove es2029 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199740 (https://phabricator.wikimedia.org/T408408) [16:44:09] (03PS5) 10Federico Ceratto: instances.yaml: remove es2030 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199741 (https://phabricator.wikimedia.org/T408409) [16:44:09] (03PS5) 10Federico Ceratto: instances.yaml: remove es2031 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1199742 (https://phabricator.wikimedia.org/T408410) [16:44:10] (03PS1) 10Federico Ceratto: site.pp, es2028.yaml: Decommission es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1199825 (https://phabricator.wikimedia.org/T408407) [16:46:02] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11324199 (10Dzahn) 05In progress→03Resolved @seanleong-WMDE Your user has now been created on the deployment server. You should be good to go. Let us kn... [16:46:41] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11324202 (10Papaul) We still have an ongoing email section going on with Juniper on this to understanding why in Eqiad the power is balance o... [16:47:46] (03CR) 10Cathal Mooney: [C:03+2] team-netops: ospf alert: add pint disable promql/series [alerts] - 10https://gerrit.wikimedia.org/r/1199332 (https://phabricator.wikimedia.org/T408378) (owner: 10Cathal Mooney) [16:48:12] 10SRE-Access-Requests, 06Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11324214 (10elukey) [16:48:19] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11324215 (10Dzahn) 05Resolved→03Open [16:49:04] (03Merged) 10jenkins-bot: team-netops: ospf alert: add pint disable promql/series [alerts] - 10https://gerrit.wikimedia.org/r/1199332 (https://phabricator.wikimedia.org/T408378) (owner: 10Cathal Mooney) [16:49:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T407997)', diff saved to https://phabricator.wikimedia.org/P84372 and previous config saved to /var/cache/conftool/dbconfig/20251029-164954-marostegui.json [16:50:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [16:50:11] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [16:50:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2158 (T407997)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20251029-165010-marostegui.json [16:50:52] (03PS1) 10Dzahn: admin: update SSH key for Sean Leong [puppet] - 10https://gerrit.wikimedia.org/r/1199827 (https://phabricator.wikimedia.org/T406592) [16:51:16] (03CR) 10CI reject: [V:04-1] admin: update SSH key for Sean Leong [puppet] - 10https://gerrit.wikimedia.org/r/1199827 (https://phabricator.wikimedia.org/T406592) (owner: 10Dzahn) [16:51:32] (03PS2) 10Dzahn: admin: update SSH key for Sean Leong [puppet] - 10https://gerrit.wikimedia.org/r/1199827 (https://phabricator.wikimedia.org/T406592) [16:52:15] (03CR) 10Andrea Denisse: "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1199827 (https://phabricator.wikimedia.org/T406592) (owner: 10Dzahn) [16:53:05] (03CR) 10Dzahn: [C:03+2] admin: update SSH key for Sean Leong [puppet] - 10https://gerrit.wikimedia.org/r/1199827 (https://phabricator.wikimedia.org/T406592) (owner: 10Dzahn) [16:53:45] (03CR) 10Elukey: topic: add dpogorzelski to ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199810 (https://phabricator.wikimedia.org/T408702) (owner: 10Dpogorzelski) [16:54:34] 10SRE-Access-Requests, 06Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11324247 (10elukey) a:05DPogorzelski-WMF→03None [16:55:34] (03CR) 10Marostegui: "jcrespo, you ok with this change?" [puppet] - 10https://gerrit.wikimedia.org/r/1199541 (https://phabricator.wikimedia.org/T408662) (owner: 10Marostegui) [17:00:05] swfrench-wmf: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1700). [17:00:12] o/ [17:00:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T407997)', diff saved to https://phabricator.wikimedia.org/P84374 and previous config saved to /var/cache/conftool/dbconfig/20251029-170039-marostegui.json [17:00:47] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:00:50] (03CR) 10Scott French: "Thank you both for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199513 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:00:54] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): scale next releases to 20% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199513 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:02:48] (03Merged) 10jenkins-bot: mw-(api-ext|web): scale next releases to 20% of main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199513 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:03:43] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:04:21] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:04:37] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:05:04] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:05:20] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:07:23] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:07:25] (03PS1) 10Clément Goubert: Revert "Route "/api/rest_v1/?spec" requests to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1199830 [17:07:38] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:07:48] (03CR) 10Dzahn: [C:03+2] "double checked out of caution - user had never logged in yet" [puppet] - 10https://gerrit.wikimedia.org/r/1199827 (https://phabricator.wikimedia.org/T406592) (owner: 10Dzahn) [17:07:49] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: Raise default resources of for opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199818 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [17:08:09] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:08:24] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:09:09] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11324324 (10Dzahn) 05Open→03Resolved [17:10:31] (03CR) 10Clément Goubert: [C:03+2] Revert "Route "/api/rest_v1/?spec" requests to the rest gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1199830 (owner: 10Clément Goubert) [17:11:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199515 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:12:49] (03Merged) 10jenkins-bot: Enroll 25% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199515 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:13:20] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1199515|Enroll 25% of client sessions in PHP 8.3 (T405955)]] [17:13:25] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:15:01] (03Merged) 10jenkins-bot: dse-k8s-eqiad: Raise default resources of for opensearch namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199818 (https://phabricator.wikimedia.org/T357753) (owner: 10Bking) [17:15:44] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1199515|Enroll 25% of client sessions in PHP 8.3 (T405955)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:15:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P84375 and previous config saved to /var/cache/conftool/dbconfig/20251029-171547-marostegui.json [17:19:07] !log swfrench@deploy2002 swfrench: Continuing with sync [17:23:28] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199515|Enroll 25% of client sessions in PHP 8.3 (T405955)]] (duration: 10m 08s) [17:23:33] T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955 [17:24:48] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:25:37] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:27:06] (03PS9) 10Krinkle: varnish: Remove temporary enable_m_redir flag [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) [17:29:39] 10SRE-swift-storage, 06Data-Persistence, 10MediaViewer, 10Thumbor, 06Traffic: Compile a list of "canonical" thumbnail sizes - https://phabricator.wikimedia.org/T408715#11324423 (10MatthewVernon) [17:29:54] !log upgrade envoy on phab2002 [17:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:37] (03CR) 10Scott French: [C:03+2] mw-(api-int|jobrunner): serve 10% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199514 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:30:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P84376 and previous config saved to /var/cache/conftool/dbconfig/20251029-173055-marostegui.json [17:31:16] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:31:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:32:13] 06SRE, 10SRE-Access-Requests: Requesting access to 'deployment' for seanleong-wmde - https://phabricator.wikimedia.org/T406592#11324442 (10seanleong-WMDE) Thanks @Dzahn! [17:32:22] (03Merged) 10jenkins-bot: mw-(api-int|jobrunner): serve 10% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199514 (https://phabricator.wikimedia.org/T405955) (owner: 10Scott French) [17:33:06] (03CR) 10BryanDavis: [C:03+1] perl540: Install libnet-idn-encode-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1199750 (https://phabricator.wikimedia.org/T407707) (owner: 10Majavah) [17:36:48] !log upgrade envoy on phab2002, vrts2002, contint2002 T405808 [17:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:53] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [17:37:37] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:37:52] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:38:11] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:38:27] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:38:44] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:38:44] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:39:02] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:39:49] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:39:59] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:41:01] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:41:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:41:15] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:41:26] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:41:34] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:42:03] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:42:14] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:42:27] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:42:34] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:42:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [17:42:54] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [17:46:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T407997)', diff saved to https://phabricator.wikimedia.org/P84377 and previous config saved to /var/cache/conftool/dbconfig/20251029-174602-marostegui.json [17:46:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [17:46:09] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:46:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2169 (T407997)', diff saved to https://phabricator.wikimedia.org/P84378 and previous config saved to /var/cache/conftool/dbconfig/20251029-174616-marostegui.json [17:51:14] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11324508 (10VRiley-WMF) Removed all RAM from unit (except 1) to see if it would boot. Found that it did boot normally. I'm slowly adding more RAM to find... [17:52:16] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [17:52:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T407997)', diff saved to https://phabricator.wikimedia.org/P84379 and previous config saved to /var/cache/conftool/dbconfig/20251029-175238-marostegui.json [17:52:46] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [17:52:48] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [17:58:17] (03CR) 10Brouberol: [C:04-1] Deploy airflow images from airflow-dags repository build (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [18:00:04] dduvall and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T1800). [18:07:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P84380 and previous config saved to /var/cache/conftool/dbconfig/20251029-180746-marostegui.json [18:07:58] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199833 (https://phabricator.wikimedia.org/T405681) [18:08:03] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199833 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [18:08:43] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:08:53] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199833 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [18:11:47] !log deploying refinery source as part of deployment train. [18:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:15:51] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [18:17:39] !log krinkle@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [18:17:44] !log krinkle@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [18:20:36] 06SRE, 10SRE-Access-Requests: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11324627 (10Dzahn) Hi @elukey in the spirit of the discussion over at T408579, do we know what type of `analytics-privatedata-users` is needed here? [18:21:54] (03PS1) 10Scott French: Enroll 50% of client sessions in PHP 8.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199836 (https://phabricator.wikimedia.org/T405955) [18:21:55] (03PS1) 10Scott French: mw-(api-int|jobrunner): serve 25% of traffic on PHP 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199837 (https://phabricator.wikimedia.org/T405955) [18:22:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P84381 and previous config saved to /var/cache/conftool/dbconfig/20251029-182253-marostegui.json [18:23:03] 06SRE, 10SRE-Access-Requests, 06Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11324633 (10Dzahn) @elukey Does this now need approval from Mark? It seems to me it does because he is the group owner of ops. [18:23:12] !log rolling back 1.45.0-wmf.25 from group1 due to spike in `PHP Deprecated: Deprecated cross-wiki access to MediaWiki\Revision\RevisionRecord` errors (T408525) (cc T408525) [18:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:17] T408525: PHP Deprecated: Deprecated cross-wiki access to MediaWiki\Revision\RevisionRecord. Expected: 'testwikidatawiki', Actual: the local wiki. Pass expected $wikiId. [Called from MediaWiki\Revision\RevisionRecord::getId] - https://phabricator.wikimedia.org/T408525 [18:23:28] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199838 (https://phabricator.wikimedia.org/T405681) [18:23:34] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199838 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [18:24:24] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199838 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [18:25:12] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [18:26:52] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [18:28:43] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:31:02] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.25 refs T405681 [18:31:08] T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681 [18:35:16] !log gitlab1003 systemctl start backup-restore T408705 [18:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:26] T408705: SystemdUnitFailed - backup-restore.service on gitlab1003:9100 - https://phabricator.wikimedia.org/T408705 [18:37:31] (03CR) 10Majavah: [C:03+2] perl540: Install libnet-idn-encode-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1199750 (https://phabricator.wikimedia.org/T407707) (owner: 10Majavah) [18:38:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T407997)', diff saved to https://phabricator.wikimedia.org/P84383 and previous config saved to /var/cache/conftool/dbconfig/20251029-183802-marostegui.json [18:38:05] (03Merged) 10jenkins-bot: perl540: Install libnet-idn-encode-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1199750 (https://phabricator.wikimedia.org/T407707) (owner: 10Majavah) [18:38:10] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [18:38:20] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [18:38:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2180 (T407997)', diff saved to https://phabricator.wikimedia.org/P84384 and previous config saved to /var/cache/conftool/dbconfig/20251029-183827-marostegui.json [18:40:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T407997)', diff saved to https://phabricator.wikimedia.org/P84385 and previous config saved to /var/cache/conftool/dbconfig/20251029-184039-marostegui.json [18:41:44] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [18:41:50] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [18:45:07] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1196543 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [18:46:07] !log disable-puppet on A:cp hosts for haproxy config change [18:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:46] (03CR) 10Scott French: [C:03+2] P:cache::haproxy: introduce known-client DSL fragment [puppet] - 10https://gerrit.wikimedia.org/r/1196543 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [18:55:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P84386 and previous config saved to /var/cache/conftool/dbconfig/20251029-185547-marostegui.json [18:58:20] 06SRE, 06Traffic, 05FY2025-26 WE3.3 Engaging core audiences, 06Reader Experience Team (REx Sprint 8 [Q2 Oct 21-Nov 3]): [Reading Lists] Monitor potential performance impact of Reading Lists for Web - https://phabricator.wikimedia.org/T397526#11324749 (10SToyofuku-WMF) DB config: https://noc.wikimedia.org/d... [19:00:18] (03CR) 10BCornwall: [V:03+2 C:03+2] varnish: Remove temporary enable_m_redir flag [puppet] - 10https://gerrit.wikimedia.org/r/1198430 (https://phabricator.wikimedia.org/T405931) (owner: 10Krinkle) [19:00:24] !log dancy@deploy2002 Installing scap version "4.220.0" for 2 host(s) [19:02:17] !log dancy@deploy2002 Installation of scap version "4.220.0" completed for 2 hosts [19:03:56] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:07:24] !log dancy@deploy2002 Started scap sync-world: Testing scap 4.22.0 [19:08:43] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:08:44] FIRING: [3x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:09:37] !log rolling run-puppet-agent on A:cp hosts for haproxy config change [19:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:56] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf group for jpchev - https://phabricator.wikimedia.org/T408636#11324782 (10Jpchev) ok thank you, I'll have a look at https://superset.wmcloud.org/login/ but I can't connect with a mediawiki account [19:10:54] !log dancy@deploy2002 Finished scap sync-world: Testing scap 4.22.0 (duration: 03m 30s) [19:10:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P84387 and previous config saved to /var/cache/conftool/dbconfig/20251029-191055-marostegui.json [19:12:09] !log 'homer on multiple lsw1-*-codfw* 'T390859'' [19:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:14] T390859: wikikube-worker2[248-331] implementation tracking - https://phabricator.wikimedia.org/T390859 [19:16:32] !log Deployed refinery-source [19:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:27] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf group for jpchev - https://phabricator.wikimedia.org/T408636#11324807 (10taavi) 05Open→03Declined [19:20:40] (03CR) 10Jasmine: [C:03+2] wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [19:22:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission kafka-jumbo100[6-7].eqiad.wmnet - https://phabricator.wikimedia.org/T404413#11324822 (10Jclark-ctr) [19:23:13] (03PS1) 10Marco Fossati: Style adjustments [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199844 (https://phabricator.wikimedia.org/T408618) [19:23:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199844 (https://phabricator.wikimedia.org/T408618) (owner: 10Marco Fossati) [19:26:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T407997)', diff saved to https://phabricator.wikimedia.org/P84388 and previous config saved to /var/cache/conftool/dbconfig/20251029-192603-marostegui.json [19:26:09] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [19:26:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance [19:26:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2193 (T407997)', diff saved to https://phabricator.wikimedia.org/P84389 and previous config saved to /var/cache/conftool/dbconfig/20251029-192627-marostegui.json [19:27:45] FIRING: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [19:28:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T407997)', diff saved to https://phabricator.wikimedia.org/P84390 and previous config saved to /var/cache/conftool/dbconfig/20251029-192839-marostegui.json [19:30:03] (03CR) 10Eric Gardner: [C:03+1] Localisation updates from https://translatewiki.net. [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199814 (owner: 10Marco Fossati) [19:30:36] (03CR) 10Eric Gardner: [C:03+1] Style adjustments [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199844 (https://phabricator.wikimedia.org/T408618) (owner: 10Marco Fossati) [19:33:44] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:43:17] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf group for jpchev - https://phabricator.wikimedia.org/T408636#11324963 (10Dzahn) @Jpchev Hello, what you'd want is to request volunteer access under NDA. Please see https://wikitech.wikimedia.org/wiki/Volunteer_NDA [19:43:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P84391 and previous config saved to /var/cache/conftool/dbconfig/20251029-194347-marostegui.json [19:47:04] (03PS1) 10Marco Fossati: Capture more captions [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199847 [19:48:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199847 (owner: 10Marco Fossati) [19:52:48] (03CR) 10Eric Gardner: [C:03+1] Capture more captions [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199847 (owner: 10Marco Fossati) [19:53:47] (03PS1) 10Kamila Součková: k8s::cluster_config: Update max number of hosts [puppet] - 10https://gerrit.wikimedia.org/r/1199848 (https://phabricator.wikimedia.org/T375845) [19:55:39] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199848 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [19:58:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P84392 and previous config saved to /var/cache/conftool/dbconfig/20251029-195855-marostegui.json [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T2000). [20:00:05] anzx and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] o/ [20:00:35] o/ [20:01:08] i have some patches that i still need to add to the queue - just waiting for master change to merge [20:01:57] anzx: do you need a deployer? [20:02:07] cjming: yes [20:02:23] ok - i can deploy those for you - any that can go out together? [20:02:57] yes it can be deployed together [20:03:06] all 4? [20:03:10] yes [20:03:16] alrighty [20:03:36] (03PS4) 10Anzx: minwikisource: add portal namespace, set sitename, timezone and project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199281 [20:03:43] (03PS5) 10Anzx: pcmwikiquote: set timezone, sitename and projectnamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199298 (https://phabricator.wikimedia.org/T408351) [20:03:47] (03PS3) 10Anzx: pcmwikiquote: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199768 (https://phabricator.wikimedia.org/T408351) [20:04:32] (03PS2) 10Anzx: minwikisource: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199774 (https://phabricator.wikimedia.org/T408343) [20:05:53] (03PS1) 10Gehel: WDQS: remove ferm rule for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1199849 (https://phabricator.wikimedia.org/T408736) [20:07:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199281 (owner: 10Anzx) [20:07:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199298 (https://phabricator.wikimedia.org/T408351) (owner: 10Anzx) [20:07:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199768 (https://phabricator.wikimedia.org/T408351) (owner: 10Anzx) [20:07:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199774 (https://phabricator.wikimedia.org/T408343) (owner: 10Anzx) [20:07:53] (03Merged) 10jenkins-bot: minwikisource: add portal namespace, set sitename, timezone and project namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199281 (owner: 10Anzx) [20:08:05] (03Merged) 10jenkins-bot: pcmwikiquote: set timezone, sitename and projectnamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199298 (https://phabricator.wikimedia.org/T408351) (owner: 10Anzx) [20:08:09] (03Merged) 10jenkins-bot: pcmwikiquote: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199768 (https://phabricator.wikimedia.org/T408351) (owner: 10Anzx) [20:08:12] (03Merged) 10jenkins-bot: minwikisource: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199774 (https://phabricator.wikimedia.org/T408343) (owner: 10Anzx) [20:08:43] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1199281|minwikisource: add portal namespace, set sitename, timezone and project namespace]], [[gerrit:1199298|pcmwikiquote: set timezone, sitename and projectnamespace (T408351)]], [[gerrit:1199768|pcmwikiquote: add logos (T408351)]], [[gerrit:1199774|minwikisource: add logos (T408343)]] [20:08:56] T408351: Post-creation work for pcmwikiquote - https://phabricator.wikimedia.org/T408351 [20:08:58] T408343: Post-creation work for minwikisource - https://phabricator.wikimedia.org/T408343 [20:11:08] (03PS1) 10Andrew Bogott: dbutils::statement: add option to --skip-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1199850 [20:11:08] (03PS1) 10Andrew Bogott: pdns_server::db_backups: --skip-ssl for db setup commands [puppet] - 10https://gerrit.wikimedia.org/r/1199851 [20:11:31] !log cjming@deploy2002 anzx, cjming: Backport for [[gerrit:1199281|minwikisource: add portal namespace, set sitename, timezone and project namespace]], [[gerrit:1199298|pcmwikiquote: set timezone, sitename and projectnamespace (T408351)]], [[gerrit:1199768|pcmwikiquote: add logos (T408351)]], [[gerrit:1199774|minwikisource: add logos (T408343)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). [20:11:31] Changes can now be verified there. [20:11:37] (03CR) 10CI reject: [V:04-1] dbutils::statement: add option to --skip-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1199850 (owner: 10Andrew Bogott) [20:11:56] anzx: lmk when to sync [20:12:10] (03PS2) 10Andrew Bogott: pdns_server::db_backups: --skip-ssl for db setup commands [puppet] - 10https://gerrit.wikimedia.org/r/1199851 [20:12:51] (03PS2) 10Andrew Bogott: dbutils::statement: add option to --skip-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1199850 [20:12:51] (03PS3) 10Andrew Bogott: pdns_server::db_backups: --skip-ssl for db setup commands [puppet] - 10https://gerrit.wikimedia.org/r/1199851 [20:12:54] (03CR) 10Scott French: [C:03+1] k8s::cluster_config: Update max number of hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199848 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [20:13:02] cjming: all changes looks good to sync [20:13:10] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199851 (owner: 10Andrew Bogott) [20:13:17] great ! syncing [20:13:22] !log cjming@deploy2002 anzx, cjming: Continuing with sync [20:13:24] (03CR) 10CI reject: [V:04-1] dbutils::statement: add option to --skip-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1199850 (owner: 10Andrew Bogott) [20:14:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T407997)', diff saved to https://phabricator.wikimedia.org/P84393 and previous config saved to /var/cache/conftool/dbconfig/20251029-201406-marostegui.json [20:14:15] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:14:23] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [20:15:02] (03PS3) 10Andrew Bogott: dbutils::statement: add option to --skip-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1199850 [20:15:02] (03PS4) 10Andrew Bogott: pdns_server::db_backups: --skip-ssl for db setup commands [puppet] - 10https://gerrit.wikimedia.org/r/1199851 [20:15:14] (03PS2) 10Kamila Součková: k8s::cluster_config: Update max number of hosts [puppet] - 10https://gerrit.wikimedia.org/r/1199848 (https://phabricator.wikimedia.org/T375845) [20:15:32] (03CR) 10CI reject: [V:04-1] dbutils::statement: add option to --skip-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1199850 (owner: 10Andrew Bogott) [20:16:03] (03CR) 10Kamila Součková: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1199848 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [20:16:18] (03CR) 10Kamila Součková: [C:03+2] k8s::cluster_config: Update max number of hosts [puppet] - 10https://gerrit.wikimedia.org/r/1199848 (https://phabricator.wikimedia.org/T375845) (owner: 10Kamila Součková) [20:16:32] (03PS4) 10Andrew Bogott: dbutils::statement: add option to --skip-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1199850 [20:16:32] (03PS5) 10Andrew Bogott: pdns_server::db_backups: --skip-ssl for db setup commands [puppet] - 10https://gerrit.wikimedia.org/r/1199851 [20:17:10] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199851 (owner: 10Andrew Bogott) [20:17:40] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199281|minwikisource: add portal namespace, set sitename, timezone and project namespace]], [[gerrit:1199298|pcmwikiquote: set timezone, sitename and projectnamespace (T408351)]], [[gerrit:1199768|pcmwikiquote: add logos (T408351)]], [[gerrit:1199774|minwikisource: add logos (T408343)]] (duration: 08m 57s) [20:17:47] T408351: Post-creation work for pcmwikiquote - https://phabricator.wikimedia.org/T408351 [20:17:47] T408343: Post-creation work for minwikisource - https://phabricator.wikimedia.org/T408343 [20:17:49] (03CR) 10Brouberol: [C:03+1] WDQS: remove ferm rule for port 80 [puppet] - 10https://gerrit.wikimedia.org/r/1199849 (https://phabricator.wikimedia.org/T408736) (owner: 10Gehel) [20:18:20] anzx: should be live! i need to run namespaces dupes i think [20:18:46] cjming: yes [20:19:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2217.codfw.wmnet with reason: Maintenance [20:19:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2217 (T407997)', diff saved to https://phabricator.wikimedia.org/P84394 and previous config saved to /var/cache/conftool/dbconfig/20251029-201958-marostegui.json [20:20:03] ok 1 sec [20:20:04] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:23:07] !log cjming@deploy2002 mwscript-k8s job started: namespaceDupes minwikisource --fix # T408343 [20:23:12] T408343: Post-creation work for minwikisource - https://phabricator.wikimedia.org/T408343 [20:23:16] cjming: I'm around to do my deploy [20:23:56] arlolra: great ! let me run one more script and i'll pass over to you - 2 secs [20:24:04] no rush [20:24:21] FIRING: [7x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:25:00] !log cjming@deploy2002 mwscript-k8s job started: namespaceDupes pcmwikiquote --fix # T408351 [20:25:05] T408351: Post-creation work for pcmwikiquote - https://phabricator.wikimedia.org/T408351 [20:25:29] arlolra: ok - all yours - can you lmk when you're done? i still need to add my backports to the queue [20:25:31] cjming: thank you for deploying [20:25:41] will do [20:25:53] anzx: yw! i ran the scripts too - so you should be all good [20:26:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T407997)', diff saved to https://phabricator.wikimedia.org/P84395 and previous config saved to /var/cache/conftool/dbconfig/20251029-202605-marostegui.json [20:26:10] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [20:26:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) (owner: 10Arlolra) [20:27:31] (03Merged) 10jenkins-bot: ExtensionDistributor: Mark 1.45 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) (owner: 10Arlolra) [20:28:03] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199849 (https://phabricator.wikimedia.org/T408736) (owner: 10Gehel) [20:28:04] !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1199113|ExtensionDistributor: Mark 1.45 as beta (T408466)]] [20:28:09] T408466: Add REL1_45 to ExtensionDistributor as the development snapshot - https://phabricator.wikimedia.org/T408466 [20:28:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2258:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2258 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:28:43] FIRING: [20x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:31:07] !log arlolra@deploy2002 arlolra: Backport for [[gerrit:1199113|ExtensionDistributor: Mark 1.45 as beta (T408466)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:31:31] !log arlolra@deploy2002 arlolra: Continuing with sync [20:31:38] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11325146 (10Dzahn) [20:32:04] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11325149 (10Dzahn) All working now except 3002. [20:32:31] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host tcp-proxy3002.esams.wmnet with OS trixie [20:32:44] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11325151 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host... [20:33:15] jouncebot: now and next [20:33:16] For the next 0 hour(s) and 26 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T2000) [20:33:40] FIRING: [15x] KubernetesRsyslogDown: rsyslog on wikikube-worker2253:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:33:43] FIRING: [35x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:21] FIRING: [38x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:36] (03PS7) 10Brouberol: Deploy airflow images from airflow-dags repository build [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: 10Stevemunene) [20:35:24] RECOVERY - Host ms-be1090 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [20:36:00] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11325158 (10VRiley-WMF) After reseating the RAM, it seems lke everything has come back up and it's showing a healthy status. @MatthewVernon Can you please... [20:36:56] !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199113|ExtensionDistributor: Mark 1.45 as beta (T408466)]] (duration: 08m 51s) [20:37:00] T408466: Add REL1_45 to ExtensionDistributor as the development snapshot - https://phabricator.wikimedia.org/T408466 [20:37:18] cjming: back to you [20:37:33] ty! [20:38:40] FIRING: [32x] KubernetesRsyslogDown: rsyslog on wikikube-worker2249:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:38:48] FIRING: [47x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:39:33] (03PS1) 10D3r1ck01: Stats: add getLabels() function [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199854 (https://phabricator.wikimedia.org/T406170) [20:39:36] (03PS1) 10Dduvall: EntitySourceDefinitions: use false as DB name if pointing to current wiki [extensions/Wikibase] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199855 (https://phabricator.wikimedia.org/T408525) [20:40:09] cjming: are you still running the backport window or is it all clear? [20:40:09] (03PS1) 10D3r1ck01: Stats: have RunningTimer manage the initial label set [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199856 (https://phabricator.wikimedia.org/T406170) [20:40:49] there's a fix for a train blocker that i would like to backport/deploy for wmf.25 if possible [20:41:11] dduvall: yes - but please go ahead - can i still deploy my backports after you? [20:41:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P84397 and previous config saved to /var/cache/conftool/dbconfig/20251029-204113-marostegui.json [20:41:45] cjming: ah ok. i'm not quite ready so please continue and i'll find a time for mine. ty [20:42:21] ok - thanks! [20:43:40] FIRING: [43x] KubernetesRsyslogDown: rsyslog on wikikube-worker2248:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:43:44] FIRING: [60x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:44:21] FIRING: [61x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:46:00] i'm also waiting for a master change to finish merging so i can create the backports - should be any minute now [20:47:30] everyone waits on CI :) [20:48:40] FIRING: [57x] KubernetesRsyslogDown: rsyslog on wikikube-worker2248:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:48:44] FIRING: [65x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:21] FIRING: [77x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:50:57] 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Requesting Kerberos access for Jmoore111 - https://phabricator.wikimedia.org/T408165#11325192 (10Dzahn) [20:51:39] 22 minutes 😵‍💫 [20:51:46] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639#11325194 (10Dzahn) a:03SherryYang-WMF [20:52:25] (03PS1) 10Santiago Faci: PHP client library: Fixed spelling for `mediawiki_database` [extensions/EventLogging] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199857 (https://phabricator.wikimedia.org/T408717) [20:52:32] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [20:52:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on wikikube-worker2264:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:52:50] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [20:52:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/EventLogging] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199857 (https://phabricator.wikimedia.org/T408717) (owner: 10Santiago Faci) [20:53:05] (03PS1) 10Santiago Faci: PHP client library: Fixed spelling for `mediawiki_database` [extensions/EventLogging] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199858 (https://phabricator.wikimedia.org/T408717) [20:53:23] (03CR) 10Clare Ming: [C:03+1] PHP client library: Fixed spelling for `mediawiki_database` [extensions/EventLogging] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199857 (https://phabricator.wikimedia.org/T408717) (owner: 10Santiago Faci) [20:53:33] (03CR) 10Clare Ming: [C:03+1] PHP client library: Fixed spelling for `mediawiki_database` [extensions/EventLogging] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199858 (https://phabricator.wikimedia.org/T408717) (owner: 10Santiago Faci) [20:53:40] FIRING: [72x] KubernetesRsyslogDown: rsyslog on wikikube-worker2248:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:53:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/EventLogging] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199858 (https://phabricator.wikimedia.org/T408717) (owner: 10Santiago Faci) [20:53:43] FIRING: [86x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, October 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/EventLogging] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199858 (https://phabricator.wikimedia.org/T408717) (owner: 10Santiago Faci) [20:54:21] FIRING: [88x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:55:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/EventLogging] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199857 (https://phabricator.wikimedia.org/T408717) (owner: 10Santiago Faci) [20:55:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/EventLogging] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199858 (https://phabricator.wikimedia.org/T408717) (owner: 10Santiago Faci) [20:56:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P84398 and previous config saved to /var/cache/conftool/dbconfig/20251029-205621-marostegui.json [20:57:40] FIRING: [41x] KubernetesRsyslogDown: rsyslog on wikikube-worker2248:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:57:47] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [20:57:57] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [20:58:21] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on tcp-proxy3002.esams.wmnet with reason: host reimage [20:58:40] FIRING: [83x] KubernetesRsyslogDown: rsyslog on wikikube-worker2248:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:58:43] FIRING: [88x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:21] FIRING: [88x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:36] (03PS1) 10TChin: [eventgate] Split alerts into global and per-site alerts [alerts] - 10https://gerrit.wikimedia.org/r/1199859 (https://phabricator.wikimedia.org/T405952) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T2100) [21:01:37] (03CR) 10Dzahn: [C:03+2] zookeeper: add support for TLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1197339 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [21:02:40] FIRING: [41x] KubernetesRsyslogDown: rsyslog on wikikube-worker2248:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:03:40] FIRING: [83x] KubernetesRsyslogDown: rsyslog on wikikube-worker2248:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:03:43] FIRING: [88x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:21] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:04:21] FIRING: [88x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:23] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on tcp-proxy3002.esams.wmnet with reason: host reimage [21:04:27] just finishing up late backport window [21:05:04] !log adding TLS support to zookeeper as a feature flag - no existing zookeeper server will change [21:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:06:16] (03PS2) 10TChin: [eventgate] Split alerts into global and per-site alerts [alerts] - 10https://gerrit.wikimedia.org/r/1199859 (https://phabricator.wikimedia.org/T405952) [21:07:00] (03Merged) 10jenkins-bot: PHP client library: Fixed spelling for `mediawiki_database` [extensions/EventLogging] (wmf/1.45.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1199857 (https://phabricator.wikimedia.org/T408717) (owner: 10Santiago Faci) [21:07:03] (03Merged) 10jenkins-bot: PHP client library: Fixed spelling for `mediawiki_database` [extensions/EventLogging] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199858 (https://phabricator.wikimedia.org/T408717) (owner: 10Santiago Faci) [21:07:36] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1199857|PHP client library: Fixed spelling for `mediawiki_database` (T408717)]], [[gerrit:1199858|PHP client library: Fixed spelling for `mediawiki_database` (T408717)]] [21:07:40] RESOLVED: [41x] KubernetesRsyslogDown: rsyslog on wikikube-worker2248:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:07:42] T408717: [PHP client library] Fill mediawiki_database contextual attribute - https://phabricator.wikimedia.org/T408717 [21:08:40] FIRING: [83x] KubernetesRsyslogDown: rsyslog on wikikube-worker2248:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:08:43] FIRING: [88x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:55] (03PS2) 10Dzahn: zookeeper: replace legacy facts, fix lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/1197342 [21:09:21] FIRING: [87x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:10:12] !log cjming@deploy2002 cjming, sfaci: Backport for [[gerrit:1199857|PHP client library: Fixed spelling for `mediawiki_database` (T408717)]], [[gerrit:1199858|PHP client library: Fixed spelling for `mediawiki_database` (T408717)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:10:34] !log cjming@deploy2002 cjming, sfaci: Continuing with sync [21:10:56] (03PS3) 10Dzahn: zookeeper: replace legacy facts, fix lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/1197342 [21:11:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T407997)', diff saved to https://phabricator.wikimedia.org/P84399 and previous config saved to /var/cache/conftool/dbconfig/20251029-211129-marostegui.json [21:11:35] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [21:11:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2224.codfw.wmnet with reason: Maintenance [21:11:48] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:11:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2224 (T407997)', diff saved to https://phabricator.wikimedia.org/P84400 and previous config saved to /var/cache/conftool/dbconfig/20251029-211153-marostegui.json [21:11:59] (03PS1) 10Ryan Kemper: wdqs: fix bg exporter typo in geospatial reqs [puppet] - 10https://gerrit.wikimedia.org/r/1199861 [21:12:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:12:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:13:40] FIRING: [33x] KubernetesRsyslogDown: rsyslog on wikikube-worker2251:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:13:44] FIRING: [73x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:21] FIRING: [73x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:26] (03CR) 10Gehel: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1199861 (owner: 10Ryan Kemper) [21:14:43] (03CR) 10Bking: [C:03+2] wdqs: fix bg exporter typo in geospatial reqs [puppet] - 10https://gerrit.wikimedia.org/r/1199861 (owner: 10Ryan Kemper) [21:15:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker11XX - https://phabricator.wikimedia.org/T408749 (10Jhancock.wm) 03NEW [21:15:17] (03PS1) 10Tim Starling: recentchanges API result contains wrong entries with redirect: False [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199862 (https://phabricator.wikimedia.org/T408667) [21:16:06] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199857|PHP client library: Fixed spelling for `mediawiki_database` (T408717)]], [[gerrit:1199858|PHP client library: Fixed spelling for `mediawiki_database` (T408717)]] (duration: 08m 30s) [21:16:12] T408717: [PHP client library] Fill mediawiki_database contextual attribute - https://phabricator.wikimedia.org/T408717 [21:16:13] !log end of UTC late backport window [21:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker11XX - https://phabricator.wikimedia.org/T408749#11325280 (10Jhancock.wm) @Clement_Goubert could you (or someone on your team) please fill out the Hostname/racking details section and update the puppet files if needed? Thank you! [21:17:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker11XX - https://phabricator.wikimedia.org/T408749#11325282 (10Jhancock.wm) [21:17:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T407997)', diff saved to https://phabricator.wikimedia.org/P84401 and previous config saved to /var/cache/conftool/dbconfig/20251029-211752-marostegui.json [21:17:59] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [21:18:40] FIRING: [33x] KubernetesRsyslogDown: rsyslog on wikikube-worker2251:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:18:43] FIRING: [60x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:07] thanks for running backports, cjming. anyone using the wikifunctions services window or can i deploy a fix to help unblock train? [21:19:21] FIRING: [58x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:19:41] * dduvall gives folks a few minutes to answer before proceeding [21:22:06] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host tcp-proxy3002.esams.wmnet with OS trixie [21:22:29] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11325298 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-... [21:23:16] dduvall: can you backport 1199862 while you're at it? [21:23:40] RESOLVED: [33x] KubernetesRsyslogDown: rsyslog on wikikube-worker2251:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:23:43] FIRING: [51x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:05] TimStarling: sure thing [21:24:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dduvall@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199855 (https://phabricator.wikimedia.org/T408525) (owner: 10Dduvall) [21:24:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dduvall@deploy2002 using scap backport" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199862 (https://phabricator.wikimedia.org/T408667) (owner: 10Tim Starling) [21:25:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:26:33] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11325321 (10Dzahn) [21:27:04] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11325322 (10Dzahn) 05In progress→03Resolved verified all 14 VMs are up and can SSH to them [21:29:54] (03PS3) 10Jdlrobson: Deploy dark mode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184564 (https://phabricator.wikimedia.org/T395628) [21:30:22] (03PS4) 10Jdlrobson: Deploy dark mode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184564 (https://phabricator.wikimedia.org/T395628) [21:31:02] (03CR) 10Cwhite: [C:03+1] Stats: add getLabels() function [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199854 (https://phabricator.wikimedia.org/T406170) (owner: 10D3r1ck01) [21:31:06] (03CR) 10Cwhite: [C:03+1] Stats: have RunningTimer manage the initial label set [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199856 (https://phabricator.wikimedia.org/T406170) (owner: 10D3r1ck01) [21:33:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P84402 and previous config saved to /var/cache/conftool/dbconfig/20251029-213300-marostegui.json [21:34:27] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [21:35:17] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30029 bytes in 0.228 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [21:36:30] (03Merged) 10jenkins-bot: EntitySourceDefinitions: use false as DB name if pointing to current wiki [extensions/Wikibase] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199855 (https://phabricator.wikimedia.org/T408525) (owner: 10Dduvall) [21:37:11] 10SRE-SLO, 10observability, 06Data-Platform-SRE (2025.10.17 - 2025.11.07), 07Essential-Work: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11325355 (10RKemper) @dcausse In this updated version of the SLI we don't want to count throttled requests as e... [21:38:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752 (10Jhancock.wm) 03NEW [21:39:54] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [21:39:57] (03Merged) 10jenkins-bot: recentchanges API result contains wrong entries with redirect: False [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199862 (https://phabricator.wikimedia.org/T408667) (owner: 10Tim Starling) [21:40:32] !log dduvall@deploy2002 Started scap sync-world: Backport for [[gerrit:1199855|EntitySourceDefinitions: use false as DB name if pointing to current wiki (T408525)]], [[gerrit:1199862|recentchanges API result contains wrong entries with redirect: False (T408667)]] [21:40:40] T408525: PHP Deprecated: Deprecated cross-wiki access to MediaWiki\Revision\RevisionRecord. Expected: 'testwikidatawiki', Actual: the local wiki. Pass expected $wikiId. [Called from MediaWiki\Revision\RevisionRecord::getId] - https://phabricator.wikimedia.org/T408525 [21:40:41] T408667: recentchanges API result contains wrong entries with redirect: False - https://phabricator.wikimedia.org/T408667 [21:42:12] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1197342/7514/" [puppet] - 10https://gerrit.wikimedia.org/r/1197342 (owner: 10Dzahn) [21:43:05] !log dduvall@deploy2002 tstarling, dduvall: Backport for [[gerrit:1199855|EntitySourceDefinitions: use false as DB name if pointing to current wiki (T408525)]], [[gerrit:1199862|recentchanges API result contains wrong entries with redirect: False (T408667)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:43:43] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:44:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:44:24] !log dduvall@deploy2002 tstarling, dduvall: Continuing with sync [21:44:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:45:17] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:47:16] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:47:17] confirmed our recentchanges bug fix with X-Wikimedia-Debug [21:47:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [21:48:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P84403 and previous config saved to /var/cache/conftool/dbconfig/20251029-214808-marostegui.json [21:48:35] (03PS2) 10Əkrəm: azwiktionary: use new wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) [21:48:47] !log dduvall@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199855|EntitySourceDefinitions: use false as DB name if pointing to current wiki (T408525)]], [[gerrit:1199862|recentchanges API result contains wrong entries with redirect: False (T408667)]] (duration: 08m 15s) [21:48:57] T408525: PHP Deprecated: Deprecated cross-wiki access to MediaWiki\Revision\RevisionRecord. Expected: 'testwikidatawiki', Actual: the local wiki. Pass expected $wikiId. [Called from MediaWiki\Revision\RevisionRecord::getId] - https://phabricator.wikimedia.org/T408525 [21:48:57] T408667: recentchanges API result contains wrong entries with redirect: False - https://phabricator.wikimedia.org/T408667 [21:56:46] o/ i need to do a deploy shortly as part of web team window. Is anyone doing any deploy related stuff right now? [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251029T2200) [22:02:10] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:02:36] (03CR) 10Anzx: azwiktionary: use new wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [22:02:49] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:02:57] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:02:59] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:03:01] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [22:03:09] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:03:15] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:03:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T407997)', diff saved to https://phabricator.wikimedia.org/P84404 and previous config saved to /var/cache/conftool/dbconfig/20251029-220317-marostegui.json [22:03:24] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [22:03:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2229.codfw.wmnet with reason: Maintenance [22:03:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2229 (T407997)', diff saved to https://phabricator.wikimedia.org/P84405 and previous config saved to /var/cache/conftool/dbconfig/20251029-220341-marostegui.json [22:04:02] okay starting deploy now [22:04:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184564 (https://phabricator.wikimedia.org/T395628) (owner: 10Jdlrobson) [22:05:16] (03Merged) 10jenkins-bot: Deploy dark mode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1184564 (https://phabricator.wikimedia.org/T395628) (owner: 10Jdlrobson) [22:05:48] !log jdlrobson@deploy2002 Started scap sync-world: Backport for [[gerrit:1184564|Deploy dark mode everywhere (T395628)]] [22:05:53] T395628: Enable dark mode on all Wikimedia projects - https://phabricator.wikimedia.org/T395628 [22:07:57] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:08:01] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:08:01] (03PS3) 10Əkrəm: azwiktionary: use new wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) [22:08:18] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:08:20] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:08:35] !log jdlrobson@deploy2002 jdlrobson: Backport for [[gerrit:1184564|Deploy dark mode everywhere (T395628)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:08:48] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [22:08:50] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:08:58] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:09:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T407997)', diff saved to https://phabricator.wikimedia.org/P84406 and previous config saved to /var/cache/conftool/dbconfig/20251029-220937-marostegui.json [22:09:42] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [22:11:23] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:11:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:11:59] !log jdlrobson@deploy2002 jdlrobson: Continuing with sync [22:12:55] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:14:57] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:15:53] (03CR) 10Superpes15: azwiktionary: use new wordmark and tagline (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [22:16:19] !log jdlrobson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1184564|Deploy dark mode everywhere (T395628)]] (duration: 10m 30s) [22:16:24] T395628: Enable dark mode on all Wikimedia projects - https://phabricator.wikimedia.org/T395628 [22:16:52] (03PS2) 10Jdlrobson: Update QuickSurvey platforms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199482 [22:17:21] (03CR) 10Jdlrobson: [C:04-2] "I will deploy this Week of 3rd." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199482 (owner: 10Jdlrobson) [22:18:00] ok done with window. thanks! [22:21:19] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757 (10Jhancock.wm) 03NEW [22:23:13] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [22:23:39] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker2332-56 - https://phabricator.wikimedia.org/T408757#11325585 (10Jhancock.wm) @Clement_Goubert while i was making this racking task, i noticed that we can spread the servers out a little more. we do have 2 racks in 2 rows (4 to... [22:24:08] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [22:24:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P84407 and previous config saved to /var/cache/conftool/dbconfig/20251029-222445-marostegui.json [22:29:21] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:35:27] (03PS4) 10Əkrəm: azwiktionary: use new wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) [22:37:48] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [22:38:19] (03CR) 10Əkrəm: azwiktionary: use new wordmark and tagline (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: 10Əkrəm) [22:38:43] FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:39:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229', diff saved to https://phabricator.wikimedia.org/P84408 and previous config saved to /var/cache/conftool/dbconfig/20251029-223952-marostegui.json [22:44:49] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:54:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10procurement, 06serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760 (10Jhancock.wm) 03NEW [22:55:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2229 (T407997)', diff saved to https://phabricator.wikimedia.org/P84409 and previous config saved to /var/cache/conftool/dbconfig/20251029-225501-marostegui.json [22:55:14] T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997 [23:09:21] FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:15:39] (03CR) 10Aaron Schulz: Revert "Route "/api/rest_v1/?spec" requests to the rest gateway" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199830 (owner: 10Clément Goubert) [23:34:21] FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:42:37] 06SRE, 06Data-Engineering (Q2 FY25/26 October 1st - December 31th): Move Druid realtime configuration out of Refinery into standalone repo on GitLab - https://phabricator.wikimedia.org/T407994#11325805 (10amastilovic) > Do we want only Druid realtime configs its own repo? Perhaps we want the batch ones in the...