[00:00:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:25] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host tcp-proxy3002.esams.wmnet with OS trixie [00:06:37] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11316600 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host tcp-proxy3002.es... [00:12:17] (03PS1) 10Zabe: Initial configuration for minwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199089 (https://phabricator.wikimedia.org/T408317) [00:12:58] (03PS1) 10Zabe: Initial configuration for pcmwikiqoute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199090 (https://phabricator.wikimedia.org/T408317) [00:13:23] (03PS2) 10Zabe: Initial configuration for pcmwikiqoute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199090 (https://phabricator.wikimedia.org/T408318) [00:13:51] jouncebot: nowandnext [00:13:52] No deployments scheduled for the next 1 hour(s) and 46 minute(s) [00:13:52] In 1 hour(s) and 46 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0200) [00:13:55] (03CR) 10Zabe: [C:03+2] Initial configuration for minwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199089 (https://phabricator.wikimedia.org/T408317) (owner: 10Zabe) [00:14:21] (03CR) 10Zabe: [C:03+2] Initial configuration for pcmwikiqoute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199090 (https://phabricator.wikimedia.org/T408318) (owner: 10Zabe) [00:14:48] (03Merged) 10jenkins-bot: Initial configuration for minwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199089 (https://phabricator.wikimedia.org/T408317) (owner: 10Zabe) [00:15:11] (03Merged) 10jenkins-bot: Initial configuration for pcmwikiqoute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199090 (https://phabricator.wikimedia.org/T408318) (owner: 10Zabe) [00:16:44] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1199090|Initial configuration for pcmwikiqoute (T408318)]], [[gerrit:1199089|Initial configuration for minwikisource (T408317)]] [00:16:53] T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318 [00:16:54] T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317 [00:20:04] (03PS1) 10Zabe: Activate minwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199091 (https://phabricator.wikimedia.org/T408317) [00:20:33] (03PS1) 10Zabe: Activate pcmwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199092 (https://phabricator.wikimedia.org/T408318) [00:24:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:56] (03PS6) 10Scott French: P:cache::varnish::frontend: render known-client rate limit VCL [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) [00:34:02] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1198182 (https://phabricator.wikimedia.org/T403220) (owner: 10Scott French) [00:37:01] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:39:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1199093 [00:39:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1199093 (owner: 10TrainBranchBot) [00:42:42] !log zabe@deploy2002 zabe: Backport for [[gerrit:1199090|Initial configuration for pcmwikiqoute (T408318)]], [[gerrit:1199089|Initial configuration for minwikisource (T408317)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:42:48] T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318 [00:42:48] T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317 [00:43:00] !log zabe@deploy2002 zabe: Continuing with sync [00:44:01] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 9.117 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:52:01] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [00:53:55] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 3.562 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [00:55:28] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:56:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1199093 (owner: 10TrainBranchBot) [00:57:20] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199090|Initial configuration for pcmwikiqoute (T408318)]], [[gerrit:1199089|Initial configuration for minwikisource (T408317)]] (duration: 40m 37s) [00:57:26] T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318 [00:57:27] T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317 [00:58:47] (03CR) 10Zabe: [C:03+2] Activate minwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199091 (https://phabricator.wikimedia.org/T408317) (owner: 10Zabe) [00:59:21] (03CR) 10Zabe: [C:03+2] Activate pcmwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199092 (https://phabricator.wikimedia.org/T408318) (owner: 10Zabe) [00:59:40] (03Merged) 10jenkins-bot: Activate minwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199091 (https://phabricator.wikimedia.org/T408317) (owner: 10Zabe) [01:00:09] (03Merged) 10jenkins-bot: Activate pcmwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199092 (https://phabricator.wikimedia.org/T408318) (owner: 10Zabe) [01:00:54] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:04:01] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199096 [01:04:01] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199096 (owner: 10Zabe) [01:04:55] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199096 (owner: 10Zabe) [01:08:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1199097 [01:08:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1199097 (owner: 10TrainBranchBot) [01:14:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:14:14] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 19s) [01:14:28] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1199092|Activate pcmwikisource (T408318)]], [[gerrit:1199091|Activate minwikisource (T408317)]], [[gerrit:1199096|Update interwiki cache]] [01:14:34] T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318 [01:14:34] T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317 [01:15:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:16:31] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408446#11316916 (10Jclark-ctr) →14Duplicate dup:03T408359 [01:16:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.10.17 - 2025.11.07): Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11316918 (10Jclark-ctr) [01:18:42] !log zabe@deploy2002 zabe: Backport for [[gerrit:1199092|Activate pcmwikisource (T408318)]], [[gerrit:1199091|Activate minwikisource (T408317)]], [[gerrit:1199096|Update interwiki cache]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:22:32] !log zabe@deploy2002 zabe: Continuing with sync [01:23:01] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [01:23:57] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 4.728 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [01:30:52] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1199097 (owner: 10TrainBranchBot) [01:32:35] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199092|Activate pcmwikisource (T408318)]], [[gerrit:1199091|Activate minwikisource (T408317)]], [[gerrit:1199096|Update interwiki cache]] (duration: 18m 07s) [01:32:41] T408318: Create Wikiquote Nigerian Pidgin - https://phabricator.wikimedia.org/T408318 [01:32:41] T408317: Create Wikisource Minangkabau - https://phabricator.wikimedia.org/T408317 [01:33:39] zabe, pcmwikisource?? [01:33:57] no worries [01:34:03] I know its pcmwikiquote [01:34:04] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:11] its just the commit message that is wrong [01:34:13] ah, ok, good :) [01:39:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:49:04] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:50:37] (03PS1) 10Andrew Bogott: rabbitmq: rename config file on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1199100 (https://phabricator.wikimedia.org/T406516) [01:50:47] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199100 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [01:53:22] (03CR) 10Andrew Bogott: [C:03+2] rabbitmq: rename config file on Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1199100 (https://phabricator.wikimedia.org/T406516) (owner: 10Andrew Bogott) [01:54:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:55:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0200) [02:07:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.25 [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199103 (https://phabricator.wikimedia.org/T405681) [02:07:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.25 [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199103 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [02:17:59] PROBLEM - Host cloudrabbit2002-dev is DOWN: PING CRITICAL - Packet loss = 100% [02:19:29] RECOVERY - Host cloudrabbit2002-dev is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [02:20:28] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:23:43] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.25 [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199103 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [02:24:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:25:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0300) [03:02:39] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199109 (https://phabricator.wikimedia.org/T405681) [03:02:41] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199109 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [03:03:33] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199109 (https://phabricator.wikimedia.org/T405681) (owner: 10TrainBranchBot) [03:04:01] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.45.0-wmf.25 refs T405681 [03:04:06] T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681 [03:14:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:15:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:20:57] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for gerrit-ssh-proxy - https://phabricator.wikimedia.org/T408064#11317298 (10Dzahn) [03:24:01] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11317299 (10Dzahn) [03:29:06] (03PS1) 10Arlolra: ExtensionDistributor: Mark 1.45 as beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) [03:30:28] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:37:53] (03PS1) 10C. Scott Ananian: Forward-compatibility: allow output flags to be serialized in `OutputFlags` [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199114 (https://phabricator.wikimedia.org/T292868) [03:38:26] (03CR) 10C. Scott Ananian: [C:03+2] "Backport patch to wmf.25 which just missed the cut." [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199114 (https://phabricator.wikimedia.org/T292868) (owner: 10C. Scott Ananian) [03:39:02] (03PS1) 10C. Scott Ananian: ParserOutput: Add deprecation warnings for ParserOutput::getLanguageLinks() [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199115 [03:39:12] (03CR) 10C. Scott Ananian: [C:03+2] "Backport patch to wmf.25 which just missed the cut." [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199115 (owner: 10C. Scott Ananian) [03:39:45] (03PS1) 10C. Scott Ananian: Implement a DOM version of the DeduplicateStyles pass [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199116 (https://phabricator.wikimedia.org/T405929) [03:39:56] (03CR) 10C. Scott Ananian: [C:03+2] "Backport patch to wmf.25 which just missed the cut." [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199116 (https://phabricator.wikimedia.org/T405929) (owner: 10C. Scott Ananian) [03:44:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:45:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:51:51] !log mwpresync@deploy2002 Finished scap sync-world: testwikis to 1.45.0-wmf.25 refs T405681 (duration: 47m 50s) [03:51:55] T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681 [03:53:15] (03Merged) 10jenkins-bot: Forward-compatibility: allow output flags to be serialized in `OutputFlags` [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199114 (https://phabricator.wikimedia.org/T292868) (owner: 10C. Scott Ananian) [03:55:43] (03Merged) 10jenkins-bot: ParserOutput: Add deprecation warnings for ParserOutput::getLanguageLinks() [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199115 (owner: 10C. Scott Ananian) [03:55:47] (03Merged) 10jenkins-bot: Implement a DOM version of the DeduplicateStyles pass [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199116 (https://phabricator.wikimedia.org/T405929) (owner: 10C. Scott Ananian) [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0400) [04:02:40] !log mwpresync@deploy2002 Pruned MediaWiki: 1.45.0-wmf.22 (duration: 02m 38s) [04:29:08] (03PS1) 10C. Scott Ananian: ParserOutput: 'ParseUsedOptions' need not be present in serialized form [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199117 [04:29:49] (03CR) 10C. Scott Ananian: [C:03+2] "Pull late patch into the branch cut." [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199117 (owner: 10C. Scott Ananian) [04:30:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:38:26] (03PS1) 10C. Scott Ananian: Expose the list of behavior switch magic words to Parsoid [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199118 (https://phabricator.wikimedia.org/T407290) [04:39:15] (03CR) 10C. Scott Ananian: [C:03+2] "Late patch onto the train" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199118 (https://phabricator.wikimedia.org/T407290) (owner: 10C. Scott Ananian) [04:43:39] (03Merged) 10jenkins-bot: ParserOutput: 'ParseUsedOptions' need not be present in serialized form [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199117 (owner: 10C. Scott Ananian) [04:45:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:49:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:54:38] (03Merged) 10jenkins-bot: Expose the list of behavior switch magic words to Parsoid [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199118 (https://phabricator.wikimedia.org/T407290) (owner: 10C. Scott Ananian) [04:55:28] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:57:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:01] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:04:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:05:53] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30030 bytes in 0.587 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:09:04] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:01] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [05:18:53] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 30031 bytes in 1.421 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [05:34:04] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:39:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:50:28] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0600). [06:03:42] (03CR) 10Krinkle: ExtensionDistributor: Mark 1.45 as beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199113 (https://phabricator.wikimedia.org/T408466) (owner: 10Arlolra) [06:05:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:16:48] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510 (10Papaul) 03NEW [06:20:28] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:43:12] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511 (10Papaul) 03NEW [06:43:42] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: switch refresh - https://phabricator.wikimedia.org/T408510#11317386 (10Papaul) p:05Triage→03Medium [06:43:54] 10ops-ulsfo, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11317387 (10Papaul) p:05Triage→03Medium [06:44:56] !log marostegui@cumin1003 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis pcmwikiquote in section s5 [06:53:41] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis pcmwikiquote in section s5 [06:54:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:54:43] !log marostegui@cumin1003 START - Cookbook sre.mysql.sanitize-wiki Managing sanitization for wikis minwikisource in section s5 [06:55:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0700). nyaa~ [07:00:05] sefehpisikler: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:32] marostegui@cumin1003 sanitize-wiki (PID 343895) is awaiting input [07:10:45] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.sanitize-wiki (exit_code=0) Managing sanitization for wikis minwikisource in section s5 [07:30:28] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:43:11] !log Deploy schema change on the master x1 T407587 [07:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:15] T407587: Apply ce_event_contributions schema changes in production (x1) - https://phabricator.wikimedia.org/T407587 [07:43:35] (03PS1) 10Muehlenhoff: Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1199225 [07:44:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:47:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199026 (https://phabricator.wikimedia.org/T408428) (owner: 10Kosta Harlan) [07:47:54] marostegui: I'd like to create database tables in x1 for two wikis for the above config patch, can you check the command I am going to run? [07:49:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:28] jouncebot: nowandnext [07:50:28] For the next 0 hour(s) and 9 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T0700) [07:50:28] In 2 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1000) [07:50:45] also, marostegui are you done deploying? [07:51:44] I'll take that as a "yes" [07:51:49] kostajh: Yeah, go for anything [07:51:53] You need :) [07:52:07] kostajh: Show me the command [07:52:52] marostegui: `php maintenance/mysql.php --cluster extension1 --wiki loginwiki ./extensions/CheckUser/schema/mysql/tables-virtual-checkuser-generated.sql` [07:53:41] kostajh: I guess that is correct I guess you'd run another one for metawiki [07:54:21] yeah [07:54:26] ok, I will try it [07:55:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:56:00] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11317482 (10cmooney) @papaul looks good! Nothing jumping out at me as problematic in terms of the connectivity plan. I don't think it makes sense to use 40G tho... [07:56:02] marostegui: hm, mwscript sql.php has a `--wiki` and a `--wikidb` flag [07:56:12] should I specify both as `loginwiki` ? [07:56:23] kostajh: I am not sure, I am not familiar with this procedure :( [07:56:27] just reading over `mwscript sql.php --help` [07:56:31] As we don't use it [07:56:39] (DBAs do not create tables in prod) [07:58:00] ok [07:58:10] it seems to have worked [07:58:41] I will deploy my config patch now [07:58:45] (03PS1) 10Brouberol: opensearch-operator: watch the 3 opensearch namespaces in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199226 (https://phabricator.wikimedia.org/T404874) [07:59:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199026 (https://phabricator.wikimedia.org/T408428) (owner: 10Kosta Harlan) [07:59:20] (03PS2) 10Brouberol: opensearch-operator: watch the 3 opensearch namespaces in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199226 (https://phabricator.wikimedia.org/T404874) [08:00:01] (03Merged) 10jenkins-bot: CheckUser: Enable SI on metawiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199026 (https://phabricator.wikimedia.org/T408428) (owner: 10Kosta Harlan) [08:01:04] (03CR) 10Slyngshede: [C:03+1] Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1199225 (owner: 10Muehlenhoff) [08:02:10] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1199026|CheckUser: Enable SI on metawiki and loginwiki (T408428)]] [08:02:15] T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428 [08:02:40] (03CR) 10Kosta Harlan: "For next time: could you please schedule this as a backport? It was unexpected to see this when I went to deploy a config patch this morni" [core] (wmf/1.45.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1199117 (owner: 10C. Scott Ananian) [08:02:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:04:16] (03CR) 10Muehlenhoff: [C:03+2] Failover idp.w.o [dns] - 10https://gerrit.wikimedia.org/r/1199225 (owner: 10Muehlenhoff) [08:04:24] !log jmm@dns1004 START - running authdns-update [08:05:11] !log jmm@dns1004 END - running authdns-update [08:07:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1019:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:11:12] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve2001 [08:11:14] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.powercycle (exit_code=99) for host ml-serve2001 [08:13:13] !log restarting blazegraph on wdqs1019 - free allocator decreasing - `sudo depool; sleep 30; sudo systemctl restart wdqs-blazegraph.service; sleep 30; sudo pool` [08:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:39] waiting on image building, which will probably take ~30 inutes [08:17:13] (03PS18) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) [08:18:20] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve2001 [08:18:27] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.powercycle (exit_code=99) for host ml-serve2001 [08:19:22] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7480/co" [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [08:21:56] (03PS19) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) [08:23:33] (03CR) 10Brouberol: [C:03+2] opensearch-operator: watch the 3 opensearch namespaces in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199226 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [08:23:56] (03CR) 10Jelto: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [08:24:55] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms [08:25:54] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:26:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:27:48] (03PS7) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [08:28:07] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1199026|CheckUser: Enable SI on metawiki and loginwiki (T408428)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:28:12] T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428 [08:28:38] !log installing openjdk-11 security updates [08:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:04] RESOLVED: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:29:38] testing [08:29:55] !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node depool for host ml-serve2001.codfw.wmnet [08:29:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host ml-serve2001.codfw.wmnet [08:33:09] !log kharlan@deploy2002 kharlan: Continuing with sync [08:34:06] (03PS1) 10Santiago Faci: xLab: Deploying v1.1.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199228 (https://phabricator.wikimedia.org/T406729) [08:34:53] (03PS1) 10Brouberol: opensearch-operator: add a separator between tenant role and rolebinding resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199230 (https://phabricator.wikimedia.org/T404874) [08:35:30] (03PS2) 10Santiago Faci: xLab: Deploying v1.1.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199228 (https://phabricator.wikimedia.org/T406729) [08:36:31] (03PS3) 10Santiago Faci: xLab: Deploying v1.1.0 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199228 (https://phabricator.wikimedia.org/T406729) [08:46:15] (03PS1) 10Kosta Harlan: hCaptcha: Enable on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199231 (https://phabricator.wikimedia.org/T408428) [08:49:07] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199026|CheckUser: Enable SI on metawiki and loginwiki (T408428)]] (duration: 46m 57s) [08:49:16] T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428 [08:49:30] I'm going to sync another patch, unless someone else needs to deploy [08:49:36] jouncebot: nowandnext [08:49:36] No deployments scheduled for the next 1 hour(s) and 10 minute(s) [08:49:36] In 1 hour(s) and 10 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1000) [08:50:13] (03CR) 10Mszwarc: [C:03+1] hCaptcha: Enable on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199231 (https://phabricator.wikimedia.org/T408428) (owner: 10Kosta Harlan) [08:50:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199231 (https://phabricator.wikimedia.org/T408428) (owner: 10Kosta Harlan) [08:51:21] (03PS3) 10Arthur taylor: Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) [08:51:33] (03PS8) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [08:51:38] (03Merged) 10jenkins-bot: hCaptcha: Enable on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199231 (https://phabricator.wikimedia.org/T408428) (owner: 10Kosta Harlan) [08:52:06] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1199231|hCaptcha: Enable on loginwiki (T408428)]] [08:53:11] (03PS9) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [08:53:38] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve2001 [08:53:52] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host ml-serve2001 [08:54:47] (03CR) 10DCausse: [C:03+1] cirrus: Start near match A/B test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199054 (https://phabricator.wikimedia.org/T408154) (owner: 10Ebernhardson) [08:55:27] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:55:28] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:56:31] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1199231|hCaptcha: Enable on loginwiki (T408428)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:56:50] T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428 [08:56:55] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [08:57:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:58:26] (03CR) 10Brouberol: [C:03+2] opensearch-operator: add a separator between tenant role and rolebinding resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199230 (https://phabricator.wikimedia.org/T404874) (owner: 10Brouberol) [08:58:45] !log kharlan@deploy2002 kharlan: Continuing with sync [08:59:55] !log jmm@cumin2002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: OpenJDK security updates - jmm@cumin2002 [08:59:58] (03PS1) 10Gehel: Hadoop: Introduce tmpreaper to cleanup /tmp [puppet] - 10https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) [09:02:01] (03CR) 10CI reject: [V:04-1] Hadoop: Introduce tmpreaper to cleanup /tmp [puppet] - 10https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:02:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:05:17] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:06:59] (03CR) 10Clément Goubert: [C:03+1] Route /page/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1199032 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [09:07:15] (03CR) 10Filippo Giunchedi: "> > Nice find! Yes I think that ought to work and cater for module unload too. And yes I think there shouldn't be too many modules." [puppet] - 10https://gerrit.wikimedia.org/r/1198155 (https://phabricator.wikimedia.org/T407726) (owner: 10JHathaway) [09:08:40] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199231|hCaptcha: Enable on loginwiki (T408428)]] (duration: 16m 35s) [09:08:45] T408428: Suggested investigations: Enable on Metawiki and Loginwiki - https://phabricator.wikimedia.org/T408428 [09:14:40] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:14:44] (03CR) 10Brouberol: Hadoop: Introduce tmpreaper to cleanup /tmp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:15:50] gehel: FYI these days systemd-tmpfiles has replaced tmpreaper, check out e.g. modules/icinga/manifests/init.pp [09:20:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: OpenJDK security updates - jmm@cumin2002 [09:20:28] godog: Oh, nice! I'm too old school! [09:21:56] nice indeed, one line config file and you're done [09:22:41] (03CR) 10Elukey: [C:03+2] Use Thanos rules for Pyrra error metrics for xLab [puppet] - 10https://gerrit.wikimedia.org/r/1199023 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [09:29:06] (03Abandoned) 10Gehel: Hadoop: Introduce tmpreaper to cleanup /tmp [puppet] - 10https://gerrit.wikimedia.org/r/1199233 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:30:52] (03PS1) 10Majavah: P:toolforge::k8s::haproxy: Use hourly logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1199238 (https://phabricator.wikimedia.org/T408457) [09:30:56] (03CR) 10Elukey: LVS: Add druid-public-coordinator to service list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [09:31:32] (03CR) 10Elukey: LVS: etcd data for druid-public-coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [09:34:13] !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Roll-restart for Java security updates - klausman@cumin1003 [09:36:43] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [09:36:45] 06SRE, 10envoy, 06serviceops, 13Patch-For-Review: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11317841 (10LSobanski) Untagging #collaboration-services based on https://phabricator.wikimedia.org/T403663#11196043 [09:37:12] (03PS1) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) [09:38:07] (03CR) 10Stevemunene: LVS: Add druid-public-coordinator to service list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [09:38:27] (03CR) 10Arthur taylor: Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [09:39:32] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:39:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:47] !log cgoubert@cumin1003 START - Cookbook sre.dns.netbox [09:39:54] (03PS2) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) [09:40:07] (03CR) 10Gehel: "check-experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:40:13] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:41:00] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar), 05WMF-NDA: Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532 (10LSobanski) 03NEW [09:41:29] (03CR) 10FNegri: [C:03+1] P:toolforge::k8s::haproxy: Use hourly logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1199238 (https://phabricator.wikimedia.org/T408457) (owner: 10Majavah) [09:41:49] (03PS1) 10Majavah: aptrepo: Retire kubeadm/1.29 components [puppet] - 10https://gerrit.wikimedia.org/r/1199240 [09:41:50] (03PS1) 10Majavah: aptrepo: Import Kubeadm/1.31 packages [puppet] - 10https://gerrit.wikimedia.org/r/1199241 (https://phabricator.wikimedia.org/T372697) [09:41:58] (03CR) 10CI reject: [V:04-1] Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:42:05] (03CR) 10Majavah: [C:03+2] P:toolforge::k8s::haproxy: Use hourly logrotate [puppet] - 10https://gerrit.wikimedia.org/r/1199238 (https://phabricator.wikimedia.org/T408457) (owner: 10Majavah) [09:42:32] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:42:54] (03PS3) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) [09:42:58] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb1014.eqiad.wmnet [09:43:07] (03CR) 10Gehel: "check-experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:43:20] (03CR) 10Brouberol: Hadoop: cleanup /tmp with systemd::tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:43:35] (03CR) 10Brouberol: Hadoop: cleanup /tmp with systemd::tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:43:42] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar), 05WMF-NDA: Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11317892 (10LSobanski) p:05Triage→03High [09:43:59] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:44:21] (03PS1) 10Jelto: aptrepo::staging: add job to clear incoming folder [puppet] - 10https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) [09:44:21] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar), 05WMF-NDA: Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11317895 (10LSobanski) [09:44:22] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10vm-requests, 13Patch-For-Review: Site: 14 VMs request for tcp-proxy (gerrit-ssh-proxy) - https://phabricator.wikimedia.org/T408064#11317894 (10LSobanski) [09:44:27] (03CR) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:45:01] (03Abandoned) 10Brouberol: growthbook: remove all traces of mongoDB from the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1197589 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:45:30] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, two nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:45:48] (03CR) 10Stevemunene: [C:03+1] Definition of a ferretdb chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198977 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:46:25] (03CR) 10Stevemunene: [C:03+1] ferretdb-growthbook: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198978 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:48:52] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1014.eqiad.wmnet [09:49:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:13] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: allow direct access to the DB when pooling is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198974 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:49:16] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: set env vars disabling s3 security feature not implemented in radosgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198975 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:49:17] (03CR) 10Brouberol: [C:03+2] postgresql-growthbook: define a custom PG image, libraries and post init SQL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198514 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:49:24] !log cgoubert@cumin1003 START - Cookbook sre.hosts.reboot-single for host rdb1012.eqiad.wmnet [09:49:25] (03CR) 10Brouberol: [C:03+2] Definition of a ferretdb chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198977 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:49:27] (03CR) 10Brouberol: [C:03+2] ferretdb-growthbook: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198978 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:50:11] (03PS4) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) [09:50:18] (03CR) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:50:28] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:14] (03Merged) 10jenkins-bot: cloudnative-pg-cluster: allow direct access to the DB when pooling is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198974 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:51:28] (03Merged) 10jenkins-bot: cloudnative-pg-cluster: set env vars disabling s3 security feature not implemented in radosgw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198975 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:51:42] (03Merged) 10jenkins-bot: postgresql-growthbook: define a custom PG image, libraries and post init SQL [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198514 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [09:51:52] (03Merged) 10jenkins-bot: Definition of a ferretdb chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198977 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:51:54] (03Merged) 10jenkins-bot: ferretdb-growthbook: define helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1198978 (https://phabricator.wikimedia.org/T406579) (owner: 10Brouberol) [09:51:57] !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Roll-restart for Java security updates - klausman@cumin1003 [09:52:15] !log klausman@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Roll-restart for Java security updates - klausman@cumin1003 [09:53:20] (03CR) 10Mark Bergsma: [C:03+1] admin: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [09:54:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:05] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ops-limited for dpogorzelski - https://phabricator.wikimedia.org/T407955#11317933 (10mark) Approved in Gerrit! [09:54:07] (03PS2) 10Tiziano Fogli: nrpe2nodexp: use service description as alertname [puppet] - 10https://gerrit.wikimedia.org/r/1199242 (https://phabricator.wikimedia.org/T395446) [09:54:18] lookinfg at that alert [09:55:27] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1012.eqiad.wmnet [09:55:59] (03CR) 10Brouberol: [C:03+1] Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [09:59:57] (03CR) 10Elukey: LVS: Add druid-public-coordinator to service list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1000) [10:01:34] (03CR) 10Stevemunene: LVS: etcd data for druid-public-coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [10:02:53] (03CR) 10Clément Goubert: wikikube: Add wikikube-worker2[248-330] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [10:03:44] (03PS2) 10Jelto: aptrepo::staging: add job to clear incoming folder [puppet] - 10https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) [10:03:53] (03CR) 10Clément Goubert: [C:03+2] taskgen: Update calico IPPool check [puppet] - 10https://gerrit.wikimedia.org/r/1191671 (https://phabricator.wikimedia.org/T375845) (owner: 10Clément Goubert) [10:05:20] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7482/co" [puppet] - 10https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: 10Jelto) [10:05:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:05:32] (03PS2) 10Daniel Kinzler: rest-gateway: Create metrics mapping for ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199008 (https://phabricator.wikimedia.org/T408183) [10:09:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:09:22] (03PS1) 10JavierMonton: Disable default user-agent collection. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) [10:09:37] FIRING: Failing Rate (Dashboard - Desktop & Mobile): - https://alerts.wikimedia.org/?q=alertname%3DFailing+Rate+%28Dashboard+-+Desktop+%26+Mobile%29 [10:10:00] !log klausman@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Roll-restart for Java security updates - klausman@cumin1003 [10:10:32] (03PS1) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [10:13:06] (03PS1) 10Huei Tan: alertmanager: route Language and Product Localization team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535) [10:14:14] (03PS2) 10Huei Tan: alertmanager: route Language and Product Localization team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535) [10:14:21] (03PS3) 10Huei Tan: alertmanager: route Language and Product Localization team alerts [puppet] - 10https://gerrit.wikimedia.org/r/1199248 (https://phabricator.wikimedia.org/T376535) [10:14:25] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407833#11318022 (10cmooney) 05Open→03Resolved I removed these additional sessions last week but got distracted and didn't come back to edi... [10:20:28] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:22:05] (03CR) 10Klausman: [C:03+1] admin: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [10:26:59] (03CR) 10Elukey: LVS: etcd data for druid-public-coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [10:28:47] (03CR) 10Hnowlan: [C:03+1] Route /page/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1199032 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [10:29:37] RESOLVED: Failing Rate (Dashboard - Desktop & Mobile): - https://alerts.wikimedia.org/?q=alertname%3DFailing+Rate+%28Dashboard+-+Desktop+%26+Mobile%29 [10:29:41] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group0 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198929 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:30:23] (03CR) 10Stevemunene: LVS: etcd data for druid-public-coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [10:30:51] (03CR) 10Clément Goubert: [C:03+2] Route /page/lint(.*) to the gateway on test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1199032 (https://phabricator.wikimedia.org/T384216) (owner: 10Aaron Schulz) [10:32:14] (03CR) 10Fabfur: "as @Elukey correctly pointed out, the procedure needs to be followed here, happy to review it again later" [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [10:34:27] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [10:37:02] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11318126 (10elukey) [10:37:46] (03CR) 10Dpogorzelski: [C:03+1] admin: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [10:38:01] 10SRE-SLO, 06Experimentation Lab (Experiment Platform Sprint 14), 07OKR-Work: Create Pyrra SLOs for xLab - https://phabricator.wikimedia.org/T398869#11318132 (10elukey) We finally have all three SLO published in Pyrra: https://slo.wikimedia.org/?search=xlab Let's wait a couple of weeks to observe the new SL... [10:41:58] (03CR) 10Clément Goubert: [C:03+2] trafficserver: action api to rest-gateway group0 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198929 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:43:27] (03CR) 10Muehlenhoff: "That would work, alternative proposal inline (which doesn't interfere with people working late in the American timezones)." [puppet] - 10https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: 10Jelto) [10:44:32] (03PS1) 10Fabfur: P:cache:haproxy: don't repeat contact validation regex [puppet] - 10https://gerrit.wikimedia.org/r/1199251 (https://phabricator.wikimedia.org/T408060) [10:44:52] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [10:45:33] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group0 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198931 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:45:57] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group1 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198932 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:46:11] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group1 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198933 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:46:22] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group1 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198934 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:46:47] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group2 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198935 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:47:02] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group2 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198936 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:47:11] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group2 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198937 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:47:24] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway enwiki 10% [puppet] - 10https://gerrit.wikimedia.org/r/1198938 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:50:03] (03PS2) 10Clément Goubert: trafficserver: action api to rest-gateway group0 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198930 (https://phabricator.wikimedia.org/T408223) [10:50:37] !log installing openjdk-17 security updates [10:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:07] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway enwiki 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198939 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:51:17] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway enwiki 100% [puppet] - 10https://gerrit.wikimedia.org/r/1198940 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:51:35] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1198941 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [10:57:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:58:50] !log zabe@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply [11:00:03] !log zabe@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply [11:11:50] (03PS1) 10Stevemunene: druid: add druid-coordinator to druid public worker role [puppet] - 10https://gerrit.wikimedia.org/r/1199256 (https://phabricator.wikimedia.org/T406222) [11:14:51] (03CR) 10Mahmoud-abdelsattar: [C:03+1] Enable the MEX / wbui2025 beta feature on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [11:14:54] (03PS2) 10Stevemunene: druid: add druid-coordinator to druid public worker role [puppet] - 10https://gerrit.wikimedia.org/r/1199256 (https://phabricator.wikimedia.org/T406222) [11:20:08] (03PS3) 10Stevemunene: LVS: etcd data for druid-public-coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) [11:20:12] (03PS4) 10Stevemunene: LVS: Add druid-public-coordinator to service list [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) [11:21:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 05 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1197613 (https://phabricator.wikimedia.org/T407737) (owner: 10Arthur taylor) [11:24:04] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:28] FIRING: [2x] SystemdUnitFailed: prometheus_amd_rocm_stats.service on ml-serve1012:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:27:48] (03PS1) 10Muehlenhoff: osm: Remove obsolete spec files [puppet] - 10https://gerrit.wikimedia.org/r/1199260 (https://phabricator.wikimedia.org/T381565) [11:29:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199260 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:29:26] (03PS10) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [11:30:32] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve2001 [11:31:39] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:31:48] I'm going to do a deployment to private code, related to Suggested Investigations [11:32:03] (03CR) 10Elukey: [C:03+1] osm: Remove obsolete spec files [puppet] - 10https://gerrit.wikimedia.org/r/1199260 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:33:55] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms [11:35:59] (03CR) 10Muehlenhoff: [C:03+2] osm: Remove obsolete spec files [puppet] - 10https://gerrit.wikimedia.org/r/1199260 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [11:37:33] (03PS1) 10Brouberol: cloudnative-pg-cluster: allow release values to override the pg_hba field [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199261 (https://phabricator.wikimedia.org/T406578) [11:37:56] (03PS1) 10Brouberol: postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578) [11:40:35] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host ml-serve2001 [11:41:07] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host sretest2010 [11:42:12] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408478#11318289 (10Jclark-ctr) [11:42:13] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11318292 (10Jclark-ctr) →14Duplicate dup:03T408478 [11:42:50] (03PS1) 10Mvolz: Update Zotero to node22 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199263 (https://phabricator.wikimedia.org/T393434) [11:42:53] !log fceratto@cumin1003 START - Cookbook sre.hosts.decommission for hosts es2026.codfw.wmnet [11:42:53] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host sretest2010 [11:43:31] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11318295 (10Jclark-ctr) 05Duplicate→03Open Closed by mistake [11:44:07] 10ops-eqiad, 06SRE, 06DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408478#11318299 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Down due to work with card install T400877 [11:44:34] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on A:swift-fe-codfw [11:45:40] (03CR) 10Slyngshede: [C:03+1] admin: add dpogorzelski to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [11:47:44] (03PS1) 10Muehlenhoff: osm_sync_lag.sh: Fix default to current directory [puppet] - 10https://gerrit.wikimedia.org/r/1199265 (https://phabricator.wikimedia.org/T381565) [11:47:57] (03CR) 10Stevemunene: [C:03+1] cloudnative-pg-cluster: allow release values to override the pg_hba field [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199261 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:48:04] (03CR) 10Stevemunene: [C:03+1] postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:48:52] !log fceratto@cumin1003 START - Cookbook sre.dns.netbox [11:49:06] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: allow release values to override the pg_hba field [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199261 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:49:08] (03CR) 10Brouberol: [C:03+2] postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:49:19] (03PS2) 10Brouberol: postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578) [11:50:43] (03CR) 10Brouberol: [V:03+2 C:03+2] postgresql-growthbook: allow IPv4/6 remote TCP connections for the app user/db [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199262 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:50:47] (03CR) 10Brouberol: [V:03+2 C:03+2] cloudnative-pg-cluster: allow release values to override the pg_hba field [deployment-charts] - 10https://gerrit.wikimedia.org/r/1199261 (https://phabricator.wikimedia.org/T406578) (owner: 10Brouberol) [11:54:33] (03PS2) 10Fabfur: P:cache:haproxy: introduce ua classes [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) [11:54:35] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [11:54:36] fceratto@cumin1003 decommission (PID 372416) is awaiting input [11:59:27] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for neslihanturan - https://phabricator.wikimedia.org/T406590#11318342 (10Neslihan_Turan_WMDE) Hi, sorry for the delay. I had a problem accessing Slack but now I managed to sent my public key to Amir. My public key is already... [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1200) [12:00:36] Noting that I'll finish my deployment to private code in 2-3 minutes [12:01:16] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11318344 (10Jclark-ctr) @VRiley-WMF Hey, just a heads up — the fiber was installed with RX-to-RX and TX-to-TX, so the polarity wasn’t verified. Make sure to check polarity next time to avoid c... [12:04:38] !log Deployed changes to Suggested Investigations [12:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:44] I'm finished with deploying [12:08:08] 10ops-eqiad, 06SRE, 06DC-Ops: Eqiad: row C/D switch refresh cabling task - https://phabricator.wikimedia.org/T396065#11318379 (10cmooney) >>! In T396065#11318344, @Jclark-ctr wrote: > @cmooney link is up Ok great yep BGP looking good I've added it now. ` cmooney@ssw1-e1-eqiad> show bgp summary group core |... [12:08:51] (03PS1) 10Muehlenhoff: maps: Stop installing osm2pgsql and osmborder [puppet] - 10https://gerrit.wikimedia.org/r/1199271 (https://phabricator.wikimedia.org/T381565) [12:09:14] (03PS1) 10Cathal Mooney: ssw1-e1-eqiad: Add BGP peering to ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1199272 (https://phabricator.wikimedia.org/T396065) [12:12:05] (03CR) 10Vgutierrez: [C:04-1] P:cache:haproxy: introduce ua classes (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [12:16:35] (03CR) 10Dpogorzelski: [C:03+1] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1198343 (https://phabricator.wikimedia.org/T407955) (owner: 10Kamila Součková) [12:19:43] (03CR) 10Hnowlan: [C:03+1] trafficserver: action api to rest-gateway group0 50% [puppet] - 10https://gerrit.wikimedia.org/r/1198930 (https://phabricator.wikimedia.org/T408223) (owner: 10Clément Goubert) [12:19:57] (03CR) 10Cathal Mooney: [C:03+2] ssw1-e1-eqiad: Add BGP peering to ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1199272 (https://phabricator.wikimedia.org/T396065) (owner: 10Cathal Mooney) [12:21:15] (03Merged) 10jenkins-bot: ssw1-e1-eqiad: Add BGP peering to ssw1-d8-eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/1199272 (https://phabricator.wikimedia.org/T396065) (owner: 10Cathal Mooney) [12:24:09] !log fceratto@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2026.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [12:26:28] Msz2001: is deploying a follow up [12:27:14] fceratto@cumin1003 decommission (PID 372416) is awaiting input [12:27:27] these issues appeared after the previous deploy https://logstash.wikimedia.org/goto/d13b6c9cd8e42929d855b4c081e43484 [12:35:20] Deployed [12:44:45] (03PS1) 10Stevemunene: druid: Increase the size of the Druid broker cache size to 4GB [puppet] - 10https://gerrit.wikimedia.org/r/1199280 (https://phabricator.wikimedia.org/T408189) [12:45:22] !log sukhe@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: reboot [12:46:03] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host pybal-test2003.codfw.wmnet [12:49:18] 10ops-eqiad, 06SRE, 06DC-Ops: Audit Eqiad Patch panels for variance from Netbox - https://phabricator.wikimedia.org/T408197#11318475 (10Jclark-ctr) a:05Jclark-ctr→03None [12:49:48] !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2003.codfw.wmnet [12:53:07] !log fceratto@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es2026.codfw.wmnet decommissioned, removing all IPs except the asset tag one - fceratto@cumin1003" [12:53:07] !log fceratto@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:53:08] !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es2026.codfw.wmnet [12:55:28] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:00:05] Urbanecm and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1300). [13:00:06] Bunnypranav and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:53] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on A:swift-fe-codfw [13:01:15] hi [13:03:07] anyone deploying? [13:04:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:09] !log sukhe@cumin1003 START - Cookbook sre.hosts.reboot-single for host lvs2011.codfw.wmnet [13:06:13] (03PS5) 10Gehel: Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) [13:07:15] (03PS2) 10Muehlenhoff: Shift tile eqiad invalidation to the bookworm master [puppet] - 10https://gerrit.wikimedia.org/r/1195717 (https://phabricator.wikimedia.org/T381565) [13:08:08] (03CR) 10CDanis: git_ssh_proxy: add role::git_ssh_proxy for Gerrit and GitLab ssh proxies (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1198281 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [13:08:23] (03CR) 10Gehel: [C:03+2] Hadoop: cleanup /tmp with systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/1199239 (https://phabricator.wikimedia.org/T396582) (owner: 10Gehel) [13:10:29] (03Abandoned) 10Muehlenhoff: Shift tile eqiad invalidation to the bookworm master [puppet] - 10https://gerrit.wikimedia.org/r/1195717 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:11:13] (03CR) 10Muehlenhoff: "The mwdebug servers are gone" [puppet] - 10https://gerrit.wikimedia.org/r/1178528 (https://phabricator.wikimedia.org/T360636) (owner: 10Muehlenhoff) [13:11:20] (03PS2) 10Muehlenhoff: Remove obsolete appserver cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/1178528 (https://phabricator.wikimedia.org/T360636) [13:14:04] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [13:14:54] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [13:17:38] MatmaRex, I can help if you'll assist with testing :) [13:17:46] !log sukhe@cumin1003 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host lvs2011.codfw.wmnet [13:17:50] Are you still around? [13:17:58] hi :) thanks [13:18:28] 10ops-codfw, 06DC-Ops, 06Traffic: lvs2011 hardware issue after reboot - https://phabricator.wikimedia.org/T408549 (10ssingh) 03NEW [13:18:29] Seems like Bunnypranav is not around [13:18:36] 10ops-codfw, 06DC-Ops, 06Traffic: lvs2011 hardware issue after reboot - https://phabricator.wikimedia.org/T408549#11318574 (10ssingh) p:05Triage→03High [13:18:37] So I'll just quickly do MatmaRex's [13:18:50] Hi! [13:19:07] Bit late, apologies. I'm fine with waiting [13:19:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199074 (https://phabricator.wikimedia.org/T408447) (owner: 10Bartosz Dziewoński) [13:20:03] bunnypranav, okay! Will signal you once I'm done, thanks! [13:20:13] Sure :) [13:20:39] (03Merged) 10jenkins-bot: Make wgVectorMaxWidthOptions specify Special:Userlogin correctly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199074 (https://phabricator.wikimedia.org/T408447) (owner: 10Bartosz Dziewoński) [13:21:13] !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1199074|Make wgVectorMaxWidthOptions specify Special:Userlogin correctly (T408447)]] [13:21:19] T408447: Under Vector 2022 on Wikimedia wikis, page width is different between Special:UserLogin and Special:CreateAccount - https://phabricator.wikimedia.org/T408447 [13:23:23] (03PS1) 10Mszwarc: Remove hCaptcha site key from private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199291 [13:23:50] (03CR) 10Kosta Harlan: [C:03+1] Remove hCaptcha site key from private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199291 (owner: 10Mszwarc) [13:24:14] xSavitar MatmaRex we need to sync the above patch ^ [13:24:15] (03PS14) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) [13:25:04] are either of you able to sync that? it should be a no-op. if not, either me or Msz2001 can do it [13:25:08] !log derick@deploy2002 derick, matmarex: Backport for [[gerrit:1199074|Make wgVectorMaxWidthOptions specify Special:Userlogin correctly (T408447)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:25:12] kostajh, sure! After bunnypranav or now? [13:25:25] MatmaRex, you can test [13:25:26] as soon as possible, I'd say [13:25:49] my change looks good [13:25:53] Okay, once MatmaRex is done testing, maybe you can take over before bunnypranav (just an idea). That is if bunnypranav is up for it. [13:26:05] MatmaRex, okay will sync now. [13:26:06] I'm fine, can wait if needed. [13:26:12] !log derick@deploy2002 derick, matmarex: Continuing with sync [13:26:38] kostajh, okay bunnypranav agrees. I'll poke you once MatmaRex's patch is done syncing. [13:27:39] kostajh, I can also help in doing it. [13:28:22] (03CR) 10Ottomata: Disable default user-agent collection. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: 10JavierMonton) [13:29:02] thank you! [13:29:17] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [13:29:30] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [13:29:39] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [13:29:46] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [13:29:49] (03PS15) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) [13:29:49] (03CR) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [13:32:10] !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199074|Make wgVectorMaxWidthOptions specify Special:Userlogin correctly (T408447)]] (duration: 10m 56s) [13:32:14] T408447: Under Vector 2022 on Wikimedia wikis, page width is different between Special:UserLogin and Special:CreateAccount - https://phabricator.wikimedia.org/T408447 [13:33:05] (03CR) 10Muehlenhoff: "Looks good to me!" [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) (owner: 10Muehlenhoff) [13:33:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199291 (owner: 10Mszwarc) [13:33:36] kostajh, so nothing to test I suppose? [13:33:45] xSavitar: nothing to test [13:33:57] Ack! Will just sync it when it's time then, thanks~ [13:34:01] *! [13:34:16] (03Merged) 10jenkins-bot: Remove hCaptcha site key from private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199291 (owner: 10Mszwarc) [13:34:48] !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1199291|Remove hCaptcha site key from private/readme.php]] [13:35:35] thanks for deploying xSavitar [13:35:59] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11318699 (10LSobanski) [13:36:22] MatmaRex, thank you :) [13:38:53] !log derick@deploy2002 mszwarc, derick: Backport for [[gerrit:1199291|Remove hCaptcha site key from private/readme.php]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:39:16] !log derick@deploy2002 mszwarc, derick: Continuing with sync [13:39:42] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO:Switch refresh diagram - https://phabricator.wikimedia.org/T408511#11318700 (10Papaul) @cmooney thanks for the feedback, I will upgrade the diagram to match the 100G links between the core routers and the switches and the type of... [13:42:43] bunnypranav, 64% done, will hand over to you in a few mins. [13:42:56] sure! [13:43:46] !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199291|Remove hCaptcha site key from private/readme.php]] (duration: 08m 58s) [13:43:55] bunnypranav over to you. [13:44:18] and thank you for your patience. 🙏🏽 [13:44:27] No worries [13:45:21] I need some help of yours as well, the patch is a creation of an namespace; do we need to run any maintenance scripts [13:46:17] btw, the namespace is "R:", and they already use that prefix, technically in the mainspace, so i assume the former. [13:46:25] xSavitar: ^^^ [13:46:38] bunnypranav: run namespacedupes [13:46:49] anzx beat me to it. [13:47:23] I assume the pages wont be lost right? [13:49:30] (03PS2) 10JavierMonton: Disable default user-agent collection. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) [13:49:32] bunnypranav, I think everything should be fine. [13:49:36] bunnypranav: https://www.mediawiki.org/wiki/Manual:NamespaceDupes.php add prefix to check of any pages lost/unmoved/need manually moved can be retrieved [13:49:53] Are there any pages that are already in that namespace? In the past? [13:50:12] I guess I shouldn't say namespace but prefixed by R: [13:50:28] FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:37] After running that script, everything should work correctly and they should be part of the R: and R_talk: namespace I suppose. [13:51:14] Okay! [13:51:19] * xSavitar runs for a meeting... [13:51:28] xSavitar: BTW I need you to deploy it for me, I am just a volunteer. [13:51:57] (03CR) 10Giuseppe Lavagetto: "I think the patch goes in the right direction, but is overcomplicated and misses a couple things:" [puppet] - 10https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: 10Fabfur) [13:52:12] bunnypranav, Oh I could do that but having a meeting now. Will you be fine doing the next backport window? That is if another deployer isn't around to help. [13:52:15] (03CR) 10JavierMonton: Disable default user-agent collection. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: 10JavierMonton) [13:52:27] I thought you would be the one deploying, apologies, I would have asked. [13:52:31] The next window is 1:30 am for me [13:52:49] Its fine [13:53:22] Ops :(, I'll ping you here in a few hours (later this evening). If there is an open window, we can deploy your patch. [13:53:39] Otherwise, we can do it tomorrow afternoon (that's when I'll be available). [13:53:54] Is that okay by you? [13:54:21] (03CR) 10Clément Goubert: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) (owner: 10Pmiazga) [13:54:28] Fine, I'll see if I am available tomorrow. [13:54:46] These deploy windows are pretty tough for asian timezones [13:55:10] bunnypranav, FYI - this is the docs for adding a new namespace: https://wikitech.wikimedia.org/wiki/Adding_namespaces [13:55:15] I hope it's still up to date. [13:55:19] Can I ping you in a few hours once I am available as well? [13:55:34] bunnypranav, yes ping me please. I want to help. [13:55:48] Thank you so much! [13:56:01] bunnypranav, no thank you for all the work. 🙏🏽 [13:56:12] :D [13:56:31] Re tz friendlyness, maybe you can ask on #wikimedia-releng about it. [13:56:52] But we have multiple of these windows per day so I'm pretty sure one is friendly I suppose to your TZ [13:57:11] * xSavitar goes AFK to attend a meeting. [13:57:28] Checked the wikitech page earlier, commit is fine; just needed confirmation on the maintenence scripts [13:58:07] yeah, the afternoon one was fine, today I was busy for the morning one, so couldn't schedule for it. [14:00:05] Deploy window Metrics Platform Experimentation Lab Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251028T1400) [14:01:51] (03CR) 10Elukey: [C:03+1] osm_sync_lag.sh: Fix default to current directory [puppet] - 10https://gerrit.wikimedia.org/r/1199265 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:02:12] (03CR) 10Elukey: [C:03+1] maps: Stop installing osm2pgsql and osmborder [puppet] - 10https://gerrit.wikimedia.org/r/1199271 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:02:41] (03CR) 10Elukey: [C:03+1] LVS: etcd data for druid-public-coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1198498 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:02:58] (03CR) 10Elukey: [C:03+1] LVS: Add druid-public-coordinator to service list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1198499 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:03:12] (03CR) 10Elukey: [C:03+1] druid: add druid-coordinator to druid public worker role [puppet] - 10https://gerrit.wikimedia.org/r/1199256 (https://phabricator.wikimedia.org/T406222) (owner: 10Stevemunene) [14:05:32] (03PS16) 10Pmiazga: api-gateway: rest gw should call ratelimit only when x-wmf-user-class header is present [deployment-charts] - 10https://gerrit.wikimedia.org/r/1191318 (https://phabricator.wikimedia.org/T405574) [14:05:51] (03PS1) 10Brouberol: global_config: add an urldownloader external service [puppet] - 10https://gerrit.wikimedia.org/r/1199297 (https://phabricator.wikimedia.org/T408012)