[00:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1151817 [00:08:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1151817 (owner: 10TrainBranchBot) [00:12:24] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd200[567] - https://phabricator.wikimedia.org/T393614#10866283 (10Jhancock.wm) a:03Jhancock.wm [00:15:25] !log dreamyjazz Deployed security patch for T394692 [00:17:32] FIRING: CalicoHighMemoryUsage: Calico container calico-kube-controllers-5d889dc5fc-4l2v5:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [00:27:28] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install db2244 - https://phabricator.wikimedia.org/T393195#10866298 (10Jhancock.wm) a:03Jhancock.wm [00:31:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1151817 (owner: 10TrainBranchBot) [00:33:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [00:34:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mc-misc2002.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [00:39:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [00:39:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc-misc2002.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [00:42:53] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10866312 (10Jhancock.wm) 05Open→03Resolved this looks like it's a server failure. the server won't boot. I'm gonna open a troubleshooting ticket rather than reopen the instal... [00:47:14] 10ops-codfw, 06SRE, 06DC-Ops: mc-misc2001 won't power up - https://phabricator.wikimedia.org/T395526 (10Jhancock.wm) 03NEW [00:57:48] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10866331 (10KFrancis) Hello all, the NDA has been sent for signatures. I'll confirm when it's complete. [01:00:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye [01:00:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-be200[6-9] - https://phabricator.wikimedia.org/T392908#10866334 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-be2006.codfw.wmnet with OS bu... [01:25:45] !log restructured core patches in /srv/patches/1.45.0-wmf.2 and /srv/patches/1.45.0-wmf.3 (T395528) [01:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:33] RESOLVED: CalicoHighMemoryUsage: Calico container calico-kube-controllers-5d889dc5fc-4l2v5:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [01:27:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [01:37:08] !log Re-deployed security fix for T394396 to 1.45.0-wmf.2 [01:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:32] FIRING: CalicoHighMemoryUsage: Calico container calico-kube-controllers-5d889dc5fc-4l2v5:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [01:48:35] !log Re-deployed security fix for T394396 to 1.45.0-wmf.3 [01:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frnetmon1002, pay-lb1001, pay-lb1002 - https://phabricator.wikimedia.org/T369565#10866399 (10Dwisehaupt) Need to reopen this on the DC-Ops side. Other projects have pushed this aside but I went today to try build these hosts. I enc... [02:43:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:44:33] RESOLVED: CalicoHighMemoryUsage: Calico container calico-kube-controllers-5d889dc5fc-4l2v5:calico-kube-controllers is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://grafana.wikimedia.org/d/2AfU0X_Mz?var-site=eqiad&var-prometheus=k8s&var-container_name=calico-kube-controllers - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [03:15:12] FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:17:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [03:25:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [04:11:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [04:21:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [04:29:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [04:34:59] !log oblivian@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "T394511 - oblivian@cumin2002" [04:35:01] !log oblivian@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: T394511 - oblivian@cumin2002 [04:35:04] T394511: Consolidate development of requestctl in one repository - https://phabricator.wikimedia.org/T394511 [04:35:29] !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: T394511 - oblivian@cumin2002 [04:35:31] !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "T394511 - oblivian@cumin2002" [04:39:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [05:10:12] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:20:27] (03PS1) 10Marostegui: site.pp: Move x3 codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1151829 (https://phabricator.wikimedia.org/T351820) [05:21:45] (03PS2) 10Marostegui: site.pp: Move x3 codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1151829 (https://phabricator.wikimedia.org/T351820) [05:22:30] (03CR) 10Marostegui: "This should match the topology at https://orchestrator.wikimedia.org/web/cluster/alias/x3" [puppet] - 10https://gerrit.wikimedia.org/r/1151829 (https://phabricator.wikimedia.org/T351820) (owner: 10Marostegui) [05:24:28] (03PS1) 10Marostegui: wmnet: Change m2-master [dns] - 10https://gerrit.wikimedia.org/r/1151830 (https://phabricator.wikimedia.org/T395241) [05:24:59] !log failover m2 master eqiad dbmaint T395241 [05:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:09] (03CR) 10Marostegui: [C:03+2] wmnet: Change m2-master [dns] - 10https://gerrit.wikimedia.org/r/1151830 (https://phabricator.wikimedia.org/T395241) (owner: 10Marostegui) [05:25:12] !log marostegui@dns1006 START - running authdns-update [05:25:58] !log marostegui@dns1006 END - running authdns-update [05:29:06] (03PS1) 10Samwilson: iInitialiseSettings: wgTemplateDataEnableDiscovery on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151831 (https://phabricator.wikimedia.org/T377975) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T0600) [06:00:05] marostegui, Amir1, and federico3: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T0600). [06:00:12] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:16:34] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts cirrussearch[1060-1062].eqiad.wmnet [06:20:17] ryankemper@cumin2002 decommission (PID 18488) is awaiting input [06:30:20] ryankemper@cumin2002 decommission (PID 18488) is awaiting input [06:34:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [06:39:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [06:40:07] (03CR) 10Elukey: [C:03+2] sre.hosts.provision: allow more Drive types for Dell's NVMe settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1151717 (https://phabricator.wikimedia.org/T392844) (owner: 10Elukey) [06:46:33] ryankemper@cumin2002 decommission (PID 18488) is awaiting input [06:47:06] hmm, shouldn't be awaiting input. checking process list [06:47:25] (03CR) 10Ecarg: [C:03+2] functions-orchestrator: add mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 (https://phabricator.wikimedia.org/T391986) (owner: 10Effie Mouzeli) [06:47:30] ah my tmux isn't updating properly, had to exit and reattach [06:49:09] (03Merged) 10jenkins-bot: functions-orchestrator: add mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149633 (https://phabricator.wikimedia.org/T391986) (owner: 10Effie Mouzeli) [06:52:44] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [06:55:56] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1060-1062].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [06:56:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [06:59:02] ryankemper@cumin2002 decommission (PID 18488) is awaiting input [07:00:05] Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:04] (03PS7) 10Winston Sung: Make Wikifunctions $wgTranslateDisabledTargetLanguages align with the translate target languages of ZObjects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143697 (https://phabricator.wikimedia.org/T328838) [07:01:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [07:02:49] (03PS8) 10Winston Sung: Make Wikifunctions $wgTranslateDisabledTargetLanguages use the translatewiki-model translate target languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143697 (https://phabricator.wikimedia.org/T328838) [07:05:11] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1060-1062].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [07:05:11] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:05:12] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1060-1062].eqiad.wmnet [07:18:55] Anyone around for backport window deployments? [07:22:14] Anyone available for backport window deployments? [07:24:58] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10866620 (10elukey) To keep archives happy - I merged the change, please let me know how it goes for the next hosts @Jclark-ctr! [07:25:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [07:27:09] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:27:18] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:27:35] (03PS5) 10Elukey: kserve-inference: set seccomp defaults in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) [07:27:35] (03PS5) 10Elukey: admin_ng: set secure-pod-defaults to "enabled" for knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151604 (https://phabricator.wikimedia.org/T369493) [07:29:56] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:30:02] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:32:55] I've added https://gerrit.wikimedia.org/r/c/1143697 to the current UTC morning backport window, would be good if anyone can help with it. [07:36:54] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Nice, everything is much cleaner now!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [07:37:46] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:37:51] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:38:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143697 (https://phabricator.wikimedia.org/T328838) (owner: 10Winston Sung) [07:40:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [07:40:51] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [07:40:57] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:44:52] Winston_Sung: hey, if you're still around I can deploy [07:45:05] Thanks. [07:46:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143697 (https://phabricator.wikimedia.org/T328838) (owner: 10Winston Sung) [07:47:16] (03Merged) 10jenkins-bot: Make Wikifunctions $wgTranslateDisabledTargetLanguages use the translatewiki-model translate target languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1143697 (https://phabricator.wikimedia.org/T328838) (owner: 10Winston Sung) [07:47:53] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1143697|Make Wikifunctions $wgTranslateDisabledTargetLanguages use the translatewiki-model translate target languages (T328838)]] [07:47:58] T328838: Multidirectional language conversion for content pages using LanguageConverter should be prevented on multilingual wikis - https://phabricator.wikimedia.org/T328838 [07:50:16] !log dcausse@deploy1003 wsung, dcausse: Backport for [[gerrit:1143697|Make Wikifunctions $wgTranslateDisabledTargetLanguages use the translatewiki-model translate target languages (T328838)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:50:37] Winston_Sung: your change is available on test servers, please let me know if everything's fine [07:50:47] Verifying... [07:53:01] (03CR) 10Volans: "@rkemper@wikimedia.org When removing cookbooks please make sure to follow also [1], thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/1151797 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [07:53:32] (03PS2) 10Winston Sung: Fix disabled description for Wikifunctions $wgDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152004 [07:53:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 29 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152004 (owner: 10Winston Sung) [07:54:13] A follow-up fix: https://gerrit.wikimedia.org/r/1152004 . [07:55:05] (03PS3) 10Winston Sung: Fix "disabled target language" message for Wikifunctions $wgDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152004 [07:55:17] (03PS4) 10Winston Sung: Fix "disabled target language" message for Wikifunctions $wgDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152004 (https://phabricator.wikimedia.org/T328838) [07:56:04] dcausse: Could you help applying the above patch? Thanks. [07:56:23] Winston_Sung: can the current change go out? [07:56:41] I don't think I can stack another patch during the same deploy [07:56:52] Yes. It's just a message fix. [07:56:56] ack [07:57:18] ok shipping the first one first and reviewing you fixup [07:57:35] !log dcausse@deploy1003 wsung, dcausse: Continuing with sync [07:57:51] jouncebot: next [07:57:51] In 0 hour(s) and 2 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T0800) [07:58:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [08:00:05] dancy and andre: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T0800). [08:00:49] (03CR) 10DCausse: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152004 (https://phabricator.wikimedia.org/T328838) (owner: 10Winston Sung) [08:02:00] still running the backport window sorry (cc andre, dancy) [08:03:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [08:03:53] (03CR) 10Stevemunene: [C:03+1] airflow: Stop the airflow services on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1151769 (https://phabricator.wikimedia.org/T395495) (owner: 10Btullis) [08:04:41] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1143697|Make Wikifunctions $wgTranslateDisabledTargetLanguages use the translatewiki-model translate target languages (T328838)]] (duration: 16m 47s) [08:04:47] T328838: Multidirectional language conversion for content pages using LanguageConverter should be prevented on multilingual wikis - https://phabricator.wikimedia.org/T328838 [08:05:12] Winston_Sung: first one deployed, shipping the followup [08:05:46] Acknowledged. Thanks. [08:06:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152004 (https://phabricator.wikimedia.org/T328838) (owner: 10Winston Sung) [08:07:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [08:08:04] (03Merged) 10jenkins-bot: Fix "disabled target language" message for Wikifunctions $wgDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152004 (https://phabricator.wikimedia.org/T328838) (owner: 10Winston Sung) [08:08:28] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1152004|Fix "disabled target language" message for Wikifunctions $wgDisabledTargetLanguages (T328838)]] [08:10:37] !log dcausse@deploy1003 wsung, dcausse: Backport for [[gerrit:1152004|Fix "disabled target language" message for Wikifunctions $wgDisabledTargetLanguages (T328838)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:10:42] T328838: Multidirectional language conversion for content pages using LanguageConverter should be prevented on multilingual wikis - https://phabricator.wikimedia.org/T328838 [08:11:04] Vetifying... [08:11:07] thanks [08:11:26] Verified. [08:11:32] ok shipping [08:11:36] !log dcausse@deploy1003 wsung, dcausse: Continuing with sync [08:11:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [08:12:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [08:13:53] (03PS1) 10Effie Mouzeli: hieradata: Make wikikube-worker2300 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [08:18:37] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152004|Fix "disabled target language" message for Wikifunctions $wgDisabledTargetLanguages (T328838)]] (duration: 10m 08s) [08:18:41] T328838: Multidirectional language conversion for content pages using LanguageConverter should be prevented on multilingual wikis - https://phabricator.wikimedia.org/T328838 [08:18:57] Winston_Sung: should be deployed :) [08:20:08] Confirmed. Thanks for the help. [08:20:23] you're welcome! [08:20:44] !log closing the UTC morning backport window [08:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [08:25:36] (03CR) 10Effie Mouzeli: [C:03+2] deployment:fix-staging-perm: update fix-staging-perms [puppet] - 10https://gerrit.wikimedia.org/r/1151753 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [08:26:28] (03PS1) 10Volans: Automatic reformat: noop change [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152007 [08:26:28] (03PS1) 10Volans: Automatic reformat: move to double quote strings [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152008 [08:26:28] (03PS1) 10Volans: doc: small improvements in the config file [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152009 [08:26:29] (03PS1) 10Volans: tests: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152010 [08:26:29] (03PS1) 10Volans: wmflib: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152011 [08:26:31] (03PS1) 10Volans: dns: alias DnsNotFound to DnsNotFoundError [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152012 [08:26:35] (03PS1) 10Volans: config: make the raises argument keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152013 [08:26:39] (03PS1) 10Volans: phabricator: make secondary arguments keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152014 [08:26:43] (03PS1) 10Volans: Automatic reformat: reorder imports [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152015 [08:26:47] (03PS1) 10Volans: tox: completely refactor static checkers/linters [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152016 [08:27:19] * elukey runs away [08:30:17] <3 [08:31:36] (03CR) 10CI reject: [V:04-1] Automatic reformat: noop change [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152007 (owner: 10Volans) [08:31:56] oh nooo [08:32:32] (03CR) 10CI reject: [V:04-1] wmflib: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152011 (owner: 10Volans) [08:32:32] (03CR) 10CI reject: [V:04-1] dns: alias DnsNotFound to DnsNotFoundError [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152012 (owner: 10Volans) [08:32:44] (03CR) 10Volans: "recheck" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152007 (owner: 10Volans) [08:32:56] (03CR) 10CI reject: [V:04-1] doc: small improvements in the config file [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152009 (owner: 10Volans) [08:33:01] (03CR) 10CI reject: [V:04-1] phabricator: make secondary arguments keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152014 (owner: 10Volans) [08:33:02] (03CR) 10CI reject: [V:04-1] Automatic reformat: move to double quote strings [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152008 (owner: 10Volans) [08:33:05] (03CR) 10CI reject: [V:04-1] config: make the raises argument keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152013 (owner: 10Volans) [08:33:08] (03CR) 10CI reject: [V:04-1] tests: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152010 (owner: 10Volans) [08:33:09] (03CR) 10CI reject: [V:04-1] Automatic reformat: reorder imports [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152015 (owner: 10Volans) [08:34:01] sigh, sorry for the spam, ofc worked fine locally [08:34:09] (03CR) 10CI reject: [V:04-1] tox: completely refactor static checkers/linters [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152016 (owner: 10Volans) [08:35:11] yes yes usual excuse [08:35:13] :D [08:42:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [08:43:46] (03PS1) 10Volans: setup.py: pin prospector [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152019 [08:51:56] (03CR) 10Volans: [C:03+2] "Unblocking CI" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152019 (owner: 10Volans) [08:54:22] (03CR) 10Btullis: [V:03+1 C:03+2] airflow: Stop the airflow services on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1151769 (https://phabricator.wikimedia.org/T395495) (owner: 10Btullis) [08:56:02] (03PS1) 10DCausse: search: cirrussearch1110 is in the psi cluster not omega [puppet] - 10https://gerrit.wikimedia.org/r/1152020 [08:56:43] (03Merged) 10jenkins-bot: setup.py: pin prospector [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152019 (owner: 10Volans) [08:57:03] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1003.eqiad.wmnet with reason: host reimage [08:57:09] (03PS2) 10Volans: Automatic reformat: noop change [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152007 [08:57:09] (03PS2) 10Volans: Automatic reformat: move to double quote strings [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152008 [08:57:09] (03PS2) 10Volans: doc: small improvements in the config file [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152009 [08:57:09] (03PS2) 10Volans: tests: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152010 [08:57:10] (03PS2) 10Volans: wmflib: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152011 [08:57:12] (03PS2) 10Volans: dns: alias DnsNotFound to DnsNotFoundError [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152012 [08:57:16] (03PS2) 10Volans: config: make the raises argument keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152013 [08:57:20] (03PS2) 10Volans: phabricator: make secondary arguments keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152014 [08:57:24] (03PS2) 10Volans: Automatic reformat: reorder imports [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152015 [08:57:28] (03PS2) 10Volans: tox: completely refactor static checkers/linters [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152016 [08:58:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [09:01:23] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1003.eqiad.wmnet with reason: host reimage [09:03:16] (03CR) 10Btullis: [C:03+2] search: cirrussearch1110 is in the psi cluster not omega [puppet] - 10https://gerrit.wikimedia.org/r/1152020 (owner: 10DCausse) [09:03:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [09:11:54] (03CR) 10Cathal Mooney: [C:03+1] Add alerting for long peering BGP down [alerts] - 10https://gerrit.wikimedia.org/r/1151551 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:19:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [09:20:05] (03PS2) 10Effie Mouzeli: hieradata: Make wikikube-worker2300 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [09:22:17] FIRING: [2x] ProbeDown: Service wdqs2015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:27:17] RESOLVED: [2x] ProbeDown: Service wdqs2015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:31:10] !log btullis@cumin1002 conftool action : set/weight=10; selector: name=cirrussearch1110.eqiad.wmnet,service=elasticsearch-psi-ss [09:31:20] !log btullis@cumin1002 conftool action : set/pooled=yes; selector: name=cirrussearch1110.eqiad.wmnet,service=elasticsearch-psi-ss [09:31:35] !log btullis@cumin1002 conftool action : set/weight=10; selector: name=cirrussearch1110.eqiad.wmnet,service=elasticsearch-psi-ssl [09:31:42] !log btullis@cumin1002 conftool action : set/pooled=yes; selector: name=cirrussearch1110.eqiad.wmnet,service=elasticsearch-psi-ssl [09:32:06] (03CR) 10Tiziano Fogli: centrallog: Add a temporary rsyslog debug config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [09:32:58] !log deleting 4 red indices in cirrussearch-psi@eqiad [09:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:13] !log deleting 55 red indices in cirrussearch-omega@eqiad [09:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:08] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1152022 (https://phabricator.wikimedia.org/T395544) [09:35:12] (03PS1) 10Gerrit maintenance bot: wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1152023 (https://phabricator.wikimedia.org/T395544) [09:35:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148299 (https://phabricator.wikimedia.org/T394752) (owner: 10SimmeD) [09:36:15] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2039 to es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1152024 (https://phabricator.wikimedia.org/T395545) [09:37:33] (03CR) 10Elukey: [C:03+2] kserve-inference: set seccomp defaults in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:39:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [09:41:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [09:43:45] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1003.eqiad.wmnet with OS bullseye [09:46:37] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10866824 (10cmooney) @Jhancock.wm can you have a look at running the above connections when you get a chance? No particular rush from my side so whenever you get to it. Once it's done we sh... [09:53:12] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [09:55:25] FIRING: ErrorBudgetBurn: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T1000) [10:00:12] FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:02:42] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [10:06:02] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [10:06:48] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:07:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool es2039 T395294', diff saved to https://phabricator.wikimedia.org/P76665 and previous config saved to /var/cache/conftool/dbconfig/20250529-100704-fceratto.json [10:07:10] T395294: High MariaDB memory usage on es1035, es2038 and es2039 - https://phabricator.wikimedia.org/T395294 [10:07:39] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [10:08:44] (03PS1) 10Ilias Sarantopoulos: ml-services: fix articlequality model dir in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152027 (https://phabricator.wikimedia.org/T393865) [10:08:55] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [10:09:02] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [10:09:08] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:09:18] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [10:09:42] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for es2039.codfw.wmnet [10:09:46] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:10:01] (03CR) 10Elukey: [C:03+1] ml-services: fix articlequality model dir in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152027 (https://phabricator.wikimedia.org/T393865) (owner: 10Ilias Sarantopoulos) [10:10:03] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es2039 - Upgrading es2039.codfw.wmnet [10:10:11] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2039 - Upgrading es2039.codfw.wmnet [10:10:11] !log T395546: creating empty content indices in eqiad for afwiktionary biwiktionary bowikibooks collabwiki cywikiquote fawikibooks iewikibooks kywiktionary mnwiktionary sgwiki svwikiquote trwikisource wikimania2007wiki [10:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:18] T395546: opensearch psi and omega clusters red in eqiad - https://phabricator.wikimedia.org/T395546 [10:10:22] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: fix articlequality model dir in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152027 (https://phabricator.wikimedia.org/T393865) (owner: 10Ilias Sarantopoulos) [10:10:29] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [10:11:50] (03Merged) 10jenkins-bot: ml-services: fix articlequality model dir in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152027 (https://phabricator.wikimedia.org/T393865) (owner: 10Ilias Sarantopoulos) [10:12:07] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:12:23] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [10:13:10] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:13:19] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [10:13:26] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:13:33] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:13:40] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:13:48] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:13:57] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:15:17] !log T395546: creating empty general indices in eqiad for akwikibooks avkwiki azwikibooks bmwikiquote cswikiversity ladwiki liwiktionary nnwiktionary nowikibooks pnbwiktionary ptwikiquote vowikibooks wikimania2018wiki xhwikibooks zhwikiversity [10:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:21] T395546: opensearch psi and omega clusters red in eqiad - https://phabricator.wikimedia.org/T395546 [10:18:44] (03CR) 10Federico Ceratto: [C:03+1] "I see db2187 db2241 db2242 db2243 being moved from s8 to x3. They are all in x3 on orchestrator, with 2241 being the DC master." [puppet] - 10https://gerrit.wikimedia.org/r/1151829 (https://phabricator.wikimedia.org/T351820) (owner: 10Marostegui) [10:18:53] (03PS1) 10Giuseppe Lavagetto: cache::haproxy: remove unused if stanzas [puppet] - 10https://gerrit.wikimedia.org/r/1152028 [10:18:53] (03PS1) 10Giuseppe Lavagetto: cache::haproxy: start optimizing for readability [puppet] - 10https://gerrit.wikimedia.org/r/1152029 [10:19:20] (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/1152030 [10:19:51] (03CR) 10Marostegui: [C:03+2] site.pp: Move x3 codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1151829 (https://phabricator.wikimedia.org/T351820) (owner: 10Marostegui) [10:22:17] (03PS6) 10Elukey: admin_ng: set secure-pod-defaults to "enabled" for knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151604 (https://phabricator.wikimedia.org/T369493) [10:22:17] (03PS1) 10Elukey: ml-services: fix articletopic-outlink config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152031 [10:22:22] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): hw troubleshooting: hard drive for an-worker1119 - https://phabricator.wikimedia.org/T395549 (10Stevemunene) 03NEW [10:23:01] (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152029 (owner: 10Giuseppe Lavagetto) [10:24:00] (03Abandoned) 10Effie Mouzeli: WIP: profile::kubernetes::node: Add script to pull and mount latest mw [puppet] - 10https://gerrit.wikimedia.org/r/1148905 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [10:24:10] (03CR) 10Elukey: [C:03+2] ml-services: fix articletopic-outlink config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152031 (owner: 10Elukey) [10:25:24] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:26:38] * Emperor here [10:27:01] marostegui is doing something, waiting for him [10:27:52] (03PS1) 10Marostegui: db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152032 [10:28:12] (03PS2) 10Giuseppe Lavagetto: cache::haproxy: start optimizing for readability [puppet] - 10https://gerrit.wikimedia.org/r/1152029 [10:28:23] (03CR) 10Jcrespo: [C:03+1] db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152032 (owner: 10Marostegui) [10:29:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by marostegui@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152032 (owner: 10Marostegui) [10:29:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): hw troubleshooting: hard drive for an-worker1119 - https://phabricator.wikimedia.org/T395549#10866935 (10Stevemunene) a:03Jclark-ctr [10:29:30] (03PS7) 10Elukey: admin_ng: set secure-pod-defaults to "enabled" for knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151604 (https://phabricator.wikimedia.org/T369493) [10:29:30] (03PS1) 10Elukey: ml-services: fix revscoring-editquality-reverted config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152034 [10:29:57] (03Merged) 10jenkins-bot: db-production.php: Disable writes on es7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152032 (owner: 10Marostegui) [10:30:21] !log marostegui@deploy1003 Started scap sync-world: Backport for [[gerrit:1152032|db-production.php: Disable writes on es7]] [10:30:53] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5716/co" [puppet] - 10https://gerrit.wikimedia.org/r/1152029 (owner: 10Giuseppe Lavagetto) [10:31:31] (03CR) 10Elukey: [C:03+2] ml-services: fix revscoring-editquality-reverted config in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152034 (owner: 10Elukey) [10:31:51] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [10:32:32] !log marostegui@deploy1003 marostegui: Backport for [[gerrit:1152032|db-production.php: Disable writes on es7]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:33:20] !log marostegui@deploy1003 marostegui: Continuing with sync [10:33:26] !log elukey@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:38:01] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for es2039.codfw.wmnet [10:39:34] (03CR) 10Zoe: [C:03+1] "looks reasonable" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139808 (https://phabricator.wikimedia.org/T361576) (owner: 10Mvolz) [10:39:35] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [10:40:18] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152037 [10:40:18] !log marostegui@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152032|db-production.php: Disable writes on es7]] (duration: 09m 57s) [10:40:23] (03CR) 10Marostegui: [C:04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152037 (owner: 10Marostegui) [10:40:53] !log restarting ircecho on alert1002 [10:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:20] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:43:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): hw troubleshooting: hard drive for an-worker1119 - https://phabricator.wikimedia.org/T395549#10866964 (10Stevemunene) [10:44:08] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): hw troubleshooting: hard drive and MegaRaid for an-worker1135 - https://phabricator.wikimedia.org/T395552 (10Stevemunene) 03NEW [10:48:55] (03CR) 10Marostegui: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152037 (owner: 10Marostegui) [10:49:04] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2039 gradually with 4 steps - Ready [10:49:33] !log fceratto@cumin1002 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) es2039 gradually with 4 steps - Ready [10:50:00] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es2039 gradually with 4 steps - Ready [10:50:44] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:50:45] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Degraded RAID on an-druid1003 - https://phabricator.wikimedia.org/T393229#10866999 (10BTullis) a:05Jclark-ctr→03BTullis I had to reimage this host, because the failed drive was `/dev/sda` which was the only one with grub installed. It'... [10:51:11] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Replace failed disk in an-druid1003 - https://phabricator.wikimedia.org/T395450#10867013 (10BTullis) 05Open→03Resolved Thanks very much @Jclark-ctr - That's all fixed now. [10:51:32] (03PS1) 10Btullis: Revert "Allow an-druid1003 to reformat its data drives" [puppet] - 10https://gerrit.wikimedia.org/r/1152038 [10:51:51] (03CR) 10Marostegui: [C:03+2] Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152037 (owner: 10Marostegui) [10:52:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by marostegui@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152037 (owner: 10Marostegui) [10:52:34] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes on es7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152037 (owner: 10Marostegui) [10:52:55] !log marostegui@deploy1003 Started scap sync-world: Backport for [[gerrit:1152037|Revert "db-production.php: Disable writes on es7"]] [10:55:07] !log marostegui@deploy1003 marostegui: Backport for [[gerrit:1152037|Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:55:44] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:55:57] !log marostegui@deploy1003 marostegui: Continuing with sync [10:56:14] (03CR) 10Btullis: [C:03+2] Revert "Allow an-druid1003 to reformat its data drives" [puppet] - 10https://gerrit.wikimedia.org/r/1152038 (owner: 10Btullis) [10:58:29] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Degraded RAID on an-druid1003 - https://phabricator.wikimedia.org/T393229#10867022 (10BTullis) 05Open→03Resolved [11:01:49] !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for es2039.codfw.wmnet [11:01:50] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2039.codfw.wmnet [11:02:54] !log marostegui@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152037|Revert "db-production.php: Disable writes on es7"]] (duration: 09m 58s) [11:07:19] (03PS3) 10Effie Mouzeli: mw-experimental: create new service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) [11:07:20] (03PS13) 10Effie Mouzeli: kubernetes:mediawiki_runner: introduce mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) [11:12:27] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: stop using the 'section' macro in jinja templates - https://phabricator.wikimedia.org/T395555 (10cmooney) 03NEW p:05Triage→03Low [11:15:25] FIRING: [2x] ErrorBudgetBurn: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:16:26] (03PS3) 10Effie Mouzeli: hieradata: Make wikikube-worker2300 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [11:17:03] (03PS4) 10Effie Mouzeli: hieradata: Make wikikube-worker2300 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [11:22:55] 06SRE, 06Infrastructure-Foundations, 10netops: Homer: stop using the 'section' macro in jinja templates - https://phabricator.wikimedia.org/T395555#10867084 (10cmooney) [11:23:04] (03Abandoned) 10Tchanders: Temp accounts: Remove temporary-account-viewer from labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151220 (https://phabricator.wikimedia.org/T390942) (owner: 10Tchanders) [11:26:07] !log kcvelaga@deploy1003 Started deploy [airflow-dags/analytics_product@6b26a59]: T393560 [11:26:12] T393560: Data pipeline to load cx_corpora to Data Lake, at wmf_product - https://phabricator.wikimedia.org/T393560 [11:27:17] !log kcvelaga@deploy1003 Finished deploy [airflow-dags/analytics_product@6b26a59]: T393560 (duration: 01m 12s) [11:30:25] FIRING: [2x] ErrorBudgetBurn: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [11:35:25] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2039 gradually with 4 steps - Ready [11:37:54] (03CR) 10Marostegui: [C:03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/1152030 (owner: 10Marostegui) [11:37:58] !log marostegui@dns1006 START - running authdns-update [11:38:21] !log Failover m3-master T395241 [11:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:43] !log marostegui@dns1006 END - running authdns-update [11:39:11] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps: stop installing openstack package osbpos on VMs [puppet] - 10https://gerrit.wikimedia.org/r/1147166 (https://phabricator.wikimedia.org/T394438) (owner: 10Andrew Bogott) [11:40:05] (03PS14) 10Effie Mouzeli: kubernetes:mediawiki_runner: introduce mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) [11:41:48] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [11:42:04] (03PS5) 10Effie Mouzeli: hieradata: Make wikikube-worker2300 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [11:42:07] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [11:42:34] (03PS6) 10Effie Mouzeli: hieradata: Make wikikube-worker2300 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [11:42:55] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [11:45:14] (03PS15) 10Effie Mouzeli: kubernetes:mediawiki_runner: introduce mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) [11:45:21] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [11:47:38] (03PS1) 10Btullis: Upgrade an-redacteddb1001 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152042 (https://phabricator.wikimedia.org/T394930) [11:48:04] (03PS16) 10Effie Mouzeli: kubernetes:mediawiki_runner: introduce mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) [11:48:06] (03CR) 10Marostegui: [C:03+1] Upgrade an-redacteddb1001 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152042 (https://phabricator.wikimedia.org/T394930) (owner: 10Btullis) [11:48:26] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5717/co" [puppet] - 10https://gerrit.wikimedia.org/r/1152042 (https://phabricator.wikimedia.org/T394930) (owner: 10Btullis) [11:48:31] (03PS1) 10Dr0ptp4kt: xLab (FKA MPIC) v0.6.2 bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152044 [11:49:30] (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/1152045 [11:49:50] (03CR) 10Btullis: [V:03+1 C:03+2] Upgrade an-redacteddb1001 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1152042 (https://phabricator.wikimedia.org/T394930) (owner: 10Btullis) [11:50:03] (03CR) 10Marostegui: [C:03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/1152045 (owner: 10Marostegui) [11:50:06] !log marostegui@dns1006 START - running authdns-update [11:50:11] (03CR) 10Santiago Faci: [C:03+2] "looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152044 (owner: 10Dr0ptp4kt) [11:50:21] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [11:50:28] !log Failover m5-master T395241 [11:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:45] !log btullis@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on an-redacteddb1001.eqiad.wmnet with reason: Upgrading MariaDB to 10.11 [11:50:53] !log marostegui@dns1006 END - running authdns-update [11:51:32] (03Merged) 10jenkins-bot: xLab (FKA MPIC) v0.6.2 bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152044 (owner: 10Dr0ptp4kt) [11:52:05] !log T395546: creating empty archive indices in eqiad for bat_smgwiki bmwiktionary crwiktionary cswikinews gorwiktionary hewiktionary iewikibooks kgwiki kowikibooks lldwiki niawiki nowikimedia plwikibooks quwikibooks sahwikisource sswiki vecwikisource wikimania2014wiki [11:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:10] T395546: opensearch psi and omega clusters red in eqiad - https://phabricator.wikimedia.org/T395546 [11:53:19] (03PS1) 10Marostegui: site.pp: Remove codfw sanitarium masters [puppet] - 10https://gerrit.wikimedia.org/r/1152046 (https://phabricator.wikimedia.org/T394884) [11:53:32] (03PS7) 10Effie Mouzeli: hieradata: Make wikikube-worker2300 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [11:55:07] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [11:55:32] !log dr0ptp4kt@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [11:56:01] !log dr0ptp4kt@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [11:57:44] (03PS1) 10Dreamy Jazz: Unset IP reveal rights on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152047 (https://phabricator.wikimedia.org/T395560) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T1200) [12:01:40] (03PS1) 10Dr0ptp4kt: xLab (FKA MPIC) v0.6.2 bump PROD [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152048 [12:03:32] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:03:33] (03PS8) 10Effie Mouzeli: hieradata: Make wikikube-worker2300 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [12:03:59] (03PS1) 10Marostegui: osc_host.sh: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1152049 [12:04:28] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:04:36] (03CR) 10Marostegui: [C:03+2] osc_host.sh: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1152049 (owner: 10Marostegui) [12:05:04] (03Merged) 10jenkins-bot: osc_host.sh: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1152049 (owner: 10Marostegui) [12:05:14] (03CR) 10Santiago Faci: [C:03+2] xLab (FKA MPIC) v0.6.2 bump PROD [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152048 (owner: 10Dr0ptp4kt) [12:05:28] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [12:05:55] (03PS9) 10Effie Mouzeli: hieradata: Make wikikube-worker2100 a mw-experimental worker [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) [12:06:15] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [12:06:21] (03CR) 10Federico Ceratto: site.pp: Remove codfw sanitarium masters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152046 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [12:06:38] (03Merged) 10jenkins-bot: xLab (FKA MPIC) v0.6.2 bump PROD [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152048 (owner: 10Dr0ptp4kt) [12:07:06] (03CR) 10Marostegui: site.pp: Remove codfw sanitarium masters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152046 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [12:07:28] (03PS2) 10Marostegui: site.pp: Remove codfw sanitarium masters [puppet] - 10https://gerrit.wikimedia.org/r/1152046 (https://phabricator.wikimedia.org/T394884) [12:07:35] (03CR) 10Marostegui: site.pp: Remove codfw sanitarium masters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1152046 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [12:08:40] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [12:09:28] !log T395546: populating archive indices in eqiad for bat_smgwiki bmwiktionary crwiktionary cswikinews gorwiktionary hewiktionary iewikibooks kgwiki kowikibooks lldwiki niawiki nowikimedia plwikibooks quwikibooks sahwikisource sswiki vecwikisource wikimania2014wiki [12:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:33] T395546: opensearch psi and omega clusters red in eqiad - https://phabricator.wikimedia.org/T395546 [12:10:47] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152005 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [12:11:33] !log dr0ptp4kt@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [12:12:40] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:13:07] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [12:13:10] !log dr0ptp4kt@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [12:14:23] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:14:28] jouncebot: nowandnext [12:14:28] For the next 0 hour(s) and 45 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T1200) [12:14:28] In 0 hour(s) and 45 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T1300) [12:14:45] Want to merge a beta only config change [12:14:48] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:14:52] Which should be fine AFAICS [12:14:52] !log btullis@cumin1002 START - Cookbook sre.hosts.remove-downtime for an-redacteddb1001.eqiad.wmnet [12:14:52] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-redacteddb1001.eqiad.wmnet [12:15:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152047 (https://phabricator.wikimedia.org/T395560) (owner: 10Dreamy Jazz) [12:15:10] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:15:19] !log Deploy schema change on s6 eqiad dbmaint with replication T395335 [12:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:23] T395335: Make gbw_id in global_block_whitelist table unsigned on WMF wikis - https://phabricator.wikimedia.org/T395335 [12:15:24] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:15:34] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:15:49] (03Merged) 10jenkins-bot: Unset IP reveal rights on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152047 (https://phabricator.wikimedia.org/T395560) (owner: 10Dreamy Jazz) [12:16:18] !log Deploy schema change on s2 eqiad dbmaint with replication T395335 [12:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:33] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:17:32] !log elukey@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:18:38] !log Deploy schema change on s5 eqiad dbmaint with replication T395335 [12:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:20:21] (03CR) 10Giuseppe Lavagetto: "Overall seems fine - I haven't looked into the details of the bash script but that will be tested at a later time I guess." [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [12:22:05] !log Deploy schema change on s8 eqiad dbmaint with replication T395335 [12:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:10] T395335: Make gbw_id in global_block_whitelist table unsigned on WMF wikis - https://phabricator.wikimedia.org/T395335 [12:22:48] !log Deploy schema change on s4 eqiad dbmaint with replication T395335 [12:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:23] !log Deploy schema change on s1 eqiad dbmaint with replication T395335 [12:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:37] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [12:24:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:24:56] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [12:25:11] (03PS1) 10Dreamy Jazz: Unset 'checkuser' group on the beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152055 (https://phabricator.wikimedia.org/T395560) [12:25:17] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [12:25:35] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:26:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152055 (https://phabricator.wikimedia.org/T395560) (owner: 10Dreamy Jazz) [12:26:05] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [12:26:25] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [12:26:44] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [12:26:46] (03Merged) 10jenkins-bot: Unset 'checkuser' group on the beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152055 (https://phabricator.wikimedia.org/T395560) (owner: 10Dreamy Jazz) [12:27:03] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [12:27:21] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:27:43] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [12:28:04] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [12:28:27] (03PS8) 10Elukey: admin_ng: set secure-pod-defaults to "enabled" for knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151604 (https://phabricator.wikimedia.org/T369493) [12:28:40] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [12:28:49] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:29:13] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [12:29:32] (03CR) 10Effie Mouzeli: kubernetes:mediawiki_runner: introduce mw-experimental (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [12:29:35] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:29:54] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:30:59] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [12:32:12] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:33:07] (03PS1) 10Cathal Mooney: Add additional sretest2xxx servers to pattern in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1152057 (https://phabricator.wikimedia.org/T387504) [12:33:27] !log T395546 restoring bowikibooks_content in psi@eqiad from psi@codfw [12:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:31] T395546: opensearch psi and omega clusters red in eqiad - https://phabricator.wikimedia.org/T395546 [12:34:31] cmooney@cumin1002 netbox (PID 588605) is awaiting input [12:34:38] !log T395546 restoring xhwikibooks_general in psi@eqiad from psi@codfw [12:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:28] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1152057 (https://phabricator.wikimedia.org/T387504) (owner: 10Cathal Mooney) [12:35:38] (03CR) 10Elukey: [C:03+2] admin_ng: set secure-pod-defaults to "enabled" for knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151604 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [12:35:58] (03CR) 10Cathal Mooney: [C:03+2] Add additional sretest2xxx servers to pattern in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1152057 (https://phabricator.wikimedia.org/T387504) (owner: 10Cathal Mooney) [12:36:32] !log T395546 restoring collabwiki iewikibooks cywikiquote fawikibooks svwikiquote biwiktionary afwiktionary kywiktionary trwikisource mnwiktionary wikimania2007wiki sgwiki content indices in psi@eqiad from psi@codfw [12:36:36] (03PS17) 10Effie Mouzeli: kubernetes:mediawiki_runner: introduce mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) [12:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:37] !log Deploy schema change on s7 eqiad dbmaint with replication T395335 [12:36:40] !log elukey@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:41] T395335: Make gbw_id in global_block_whitelist table unsigned on WMF wikis - https://phabricator.wikimedia.org/T395335 [12:36:49] !log elukey@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:37:03] (03CR) 10Effie Mouzeli: kubernetes:mediawiki_runner: introduce mw-experimental (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [12:37:06] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for cagefive2001 test server - cmooney@cumin1002" [12:37:56] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dns entries for cagefive2001 test server - cmooney@cumin1002" [12:37:56] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:38:20] !log Deploy schema change on s3 eqiad dbmaint with replication T395335 [12:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:39] !log T395546 (errata from previous log: s/psi/omega/) restoring collabwiki iewikibooks cywikiquote fawikibooks svwikiquote biwiktionary afwiktionary kywiktionary trwikisource mnwiktionary wikimania2007wiki sgwiki content indices in omega@eqiad from omega@codfw [12:38:43] (03PS1) 10Btullis: dumps: Bump toolbox mediawiki image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152058 (https://phabricator.wikimedia.org/T394389) [12:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:44] T395546: opensearch psi and omega clusters red in eqiad - https://phabricator.wikimedia.org/T395546 [12:41:56] (03CR) 10Btullis: [C:03+2] dumps: Bump toolbox mediawiki image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152058 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [12:42:33] (03PS1) 10AOkoth: Merge branch 'production' into change-1145333 [puppet] - 10https://gerrit.wikimedia.org/r/1152059 [12:42:52] (03CR) 10CI reject: [V:04-1] Merge branch 'production' into change-1145333 [puppet] - 10https://gerrit.wikimedia.org/r/1152059 (owner: 10AOkoth) [12:43:21] (03Merged) 10jenkins-bot: dumps: Bump toolbox mediawiki image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1152058 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [12:43:26] (03Abandoned) 10AOkoth: Merge branch 'production' into change-1145333 [puppet] - 10https://gerrit.wikimedia.org/r/1152059 (owner: 10AOkoth) [12:45:38] !log T395546: restoring pnbwiktionary avkwiki ladwiki ptwikiquote cswikiversity azwikibooks wikimania2018wiki liwiktionary nnwiktionary vowikibooks bmwikiquote nowikibooks zhwikiversity akwikibooks general indices in omega@eqiad from omega@codfw [12:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:42] T395546: opensearch psi and omega clusters red in eqiad - https://phabricator.wikimedia.org/T395546 [12:49:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:50:25] FIRING: [2x] ErrorBudgetBurn: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:51:03] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [12:51:05] ACKNOWLEDGEMENT - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.7% Jcrespo https://phabricator.wikimedia.org/T395294 - The acknowledgement expires at: 2025-06-02 12:47:19. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:51:18] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [12:51:42] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [12:51:48] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [12:52:03] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [12:52:05] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [12:56:18] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:57:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10867443 (10Jclark-ctr) @VRiley-WMF an-worker1185 where is this server netbox list F2 U7. no server is in that physical location an-worker1186... [12:59:09] (03CR) 10Tchanders: "These have been made" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T1300). [13:00:05] DreamRimmer: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:18] i can deploy [13:01:26] although DreamRimmer doesn't seem to be around [13:04:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:05:51] (03CR) 10Dreamy Jazz: [C:03+1] Set $wgCentralAuthAutomaticGlobalGroups for global IP reveal group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders) [13:10:48] (03CR) 10Tchanders: "Not a hard blocker, but might be nice to deploy this once the UI improvement lands: I92c41e5fe255b6c3fd775ec1d39c0289dc556dda" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127960 (https://phabricator.wikimedia.org/T376315) (owner: 10Tchanders) [13:11:02] taavi: could you help me with a late entry to the backport window? [13:11:17] Jdlrobson: sure, please add it to the calendar [13:11:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151813 (https://phabricator.wikimedia.org/T393943) (owner: 10Jdlrobson) [13:12:07] (03PS2) 10Jdlrobson: Fixes issues with recommendations config in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151813 (https://phabricator.wikimedia.org/T393943) [13:12:25] taavi: thanks! [13:12:31] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 144 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:13:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151813 (https://phabricator.wikimedia.org/T393943) (owner: 10Jdlrobson) [13:14:23] (03Merged) 10jenkins-bot: Fixes issues with recommendations config in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151813 (https://phabricator.wikimedia.org/T393943) (owner: 10Jdlrobson) [13:14:43] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1151813|Fixes issues with recommendations config in production (T393943)]] [13:14:48] T393943: Deploy Vector empty search recommendations to pilot wikis - https://phabricator.wikimedia.org/T393943 [13:16:58] !log taavi@deploy1003 jdlrobson, taavi: Backport for [[gerrit:1151813|Fixes issues with recommendations config in production (T393943)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:17:02] Jdlrobson: please test [13:17:07] on it [13:19:22] taavi: yep that's working! please sync! [13:19:45] !log taavi@deploy1003 jdlrobson, taavi: Continuing with sync [13:20:10] Am I late? [13:20:25] FIRING: [2x] ErrorBudgetBurn: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [13:21:17] DreamRimmer: the window started 20 minutes ago, but we can still fit your patch in once the current one finishes syncing [13:21:33] jclark@cumin1002 provision (PID 616542) is awaiting input [13:22:00] thanks :) [13:26:53] !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151813|Fixes issues with recommendations config in production (T393943)]] (duration: 12m 09s) [13:26:58] T393943: Deploy Vector empty search recommendations to pilot wikis - https://phabricator.wikimedia.org/T393943 [13:27:08] DreamRimmer: yours is next [13:27:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148299 (https://phabricator.wikimedia.org/T394752) (owner: 10SimmeD) [13:28:18] DreamRimmer: do you have the XWikimediaDebug browser plugin installed already? [13:28:34] yes [13:28:35] (03Merged) 10jenkins-bot: Allow itwiki bureaucrat to remove sysop permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148299 (https://phabricator.wikimedia.org/T394752) (owner: 10SimmeD) [13:28:55] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1148299|Allow itwiki bureaucrat to remove sysop permission (T394752)]] [13:29:00] T394752: Allow itwiki bureaucrat to remove sysop permission - https://phabricator.wikimedia.org/T394752 [13:29:53] thanks taavi ! change is looking good! [13:31:07] !log taavi@deploy1003 simmed, taavi: Backport for [[gerrit:1148299|Allow itwiki bureaucrat to remove sysop permission (T394752)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:31:21] checking [13:32:06] looks good [13:32:13] !log taavi@deploy1003 simmed, taavi: Continuing with sync [13:34:10] (03PS1) 10Andrew Bogott: Keystone: hack out user ID validation [puppet] - 10https://gerrit.wikimedia.org/r/1152063 (https://phabricator.wikimedia.org/T395542) [13:34:38] (03CR) 10CI reject: [V:04-1] Keystone: hack out user ID validation [puppet] - 10https://gerrit.wikimedia.org/r/1152063 (https://phabricator.wikimedia.org/T395542) (owner: 10Andrew Bogott) [13:38:27] (03PS2) 10Andrew Bogott: Keystone: hack out user ID validation [puppet] - 10https://gerrit.wikimedia.org/r/1152063 (https://phabricator.wikimedia.org/T395542) [13:39:08] !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148299|Allow itwiki bureaucrat to remove sysop permission (T394752)]] (duration: 10m 13s) [13:39:13] done [13:39:13] T394752: Allow itwiki bureaucrat to remove sysop permission - https://phabricator.wikimedia.org/T394752 [13:39:24] (03PS1) 10D3r1ck01: SUL3: Enable client hints data on the auth shared domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) [13:39:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy[1022,1025,1028-1029].eqiad.wmnet with reason: Maintenance [13:39:53] !log Reboot dbproxy[1022,1025,1028-1029].eqiad.wmnet T395241 [13:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:20] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152063 (https://phabricator.wikimedia.org/T395542) (owner: 10Andrew Bogott) [13:41:24] thanks taavi :) [13:41:34] (03CR) 10Elukey: [C:03+1] "Little nit - maybe let's specify/mention that we are doing it with/for ruff?" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152007 (owner: 10Volans) [13:45:13] (03CR) 10Elukey: [C:03+1] Automatic reformat: move to double quote strings [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152008 (owner: 10Volans) [13:45:46] (03CR) 10Elukey: [C:03+1] doc: small improvements in the config file [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152009 (owner: 10Volans) [13:46:20] (03CR) 10Elukey: [C:03+1] tests: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152010 (owner: 10Volans) [13:47:08] (03CR) 10Elukey: [C:03+1] wmflib: small simplification [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152011 (owner: 10Volans) [13:47:29] RECOVERY - MegaRAID on an-worker1135 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:47:29] (03CR) 10Federico Ceratto: [C:03+1] "I see each host removed from sanitarium in codfw being moved into a section to be used there." [puppet] - 10https://gerrit.wikimedia.org/r/1152046 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [13:47:40] (03CR) 10Elukey: [C:03+1] dns: alias DnsNotFound to DnsNotFoundError [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152012 (owner: 10Volans) [13:47:58] (03CR) 10Elukey: [C:03+1] config: make the raises argument keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152013 (owner: 10Volans) [13:48:09] (03CR) 10Elukey: [C:03+1] phabricator: make secondary arguments keyword only [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152014 (owner: 10Volans) [13:48:23] (03PS3) 10Andrew Bogott: Keystone: hack out user ID validation [puppet] - 10https://gerrit.wikimedia.org/r/1152063 (https://phabricator.wikimedia.org/T395542) [13:49:17] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152063 (https://phabricator.wikimedia.org/T395542) (owner: 10Andrew Bogott) [13:52:11] (03CR) 10Andrew Bogott: [C:03+2] Keystone: hack out user ID validation [puppet] - 10https://gerrit.wikimedia.org/r/1152063 (https://phabricator.wikimedia.org/T395542) (owner: 10Andrew Bogott) [13:54:03] (03CR) 10Marostegui: [C:03+2] site.pp: Remove codfw sanitarium masters [puppet] - 10https://gerrit.wikimedia.org/r/1152046 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [14:00:12] FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:02:37] (03PS1) 10Marostegui: db2211,db2228: Make them SBR [puppet] - 10https://gerrit.wikimedia.org/r/1152070 (https://phabricator.wikimedia.org/T383795) [14:04:02] (03CR) 10Marostegui: [C:03+2] db2211,db2228: Make them SBR [puppet] - 10https://gerrit.wikimedia.org/r/1152070 (https://phabricator.wikimedia.org/T383795) (owner: 10Marostegui) [14:06:45] 07Puppet, 06DBA, 13Patch-For-Review: labtestpuppetmaster2001 is failing to backup - https://phabricator.wikimedia.org/T256846#10867604 (10jcrespo) @Andrew should we revert now https://gerrit.wikimedia.org/r/c/operations/puppet/+/612167/6/modules/profile/files/backup/job_monitoring_ignorelist ? [14:08:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:08:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T395241)', diff saved to https://phabricator.wikimedia.org/P76670 and previous config saved to /var/cache/conftool/dbconfig/20250529-140811-fceratto.json [14:13:21] (03PS1) 10Marostegui: sanitarium_multiinstance.pp: Fix comments [puppet] - 10https://gerrit.wikimedia.org/r/1152071 [14:13:49] (03CR) 10Marostegui: [C:03+2] sanitarium_multiinstance.pp: Fix comments [puppet] - 10https://gerrit.wikimedia.org/r/1152071 (owner: 10Marostegui) [14:15:29] (03PS2) 10Santiago Faci: xLab: Reduce staging/production logging level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151232 (https://phabricator.wikimedia.org/T394425) [14:17:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T395241)', diff saved to https://phabricator.wikimedia.org/P76671 and previous config saved to /var/cache/conftool/dbconfig/20250529-141744-fceratto.json [14:29:19] (03CR) 10Elukey: [C:03+1] Automatic reformat: reorder imports [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152015 (owner: 10Volans) [14:31:14] !log T395546: rebuildind eqiad completion indices for cswikiversity pnbwiktionary avkwiki mlwikiquote ladwiki ptwikiquote cswikiversity olowiki azwikibooks wikimania2018wiki rowikibooks rswikimedia liwiktionary oswiki nnwiktionary trwikisource vowikibooks ndswiktionary bmwikiquote ilowiki pawiktionary nowikibooks zhwikiversity tawikiquote hawiktionary akwikibooks udmwiki xhwikibooks [14:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:19] T395546: opensearch psi and omega clusters red in eqiad - https://phabricator.wikimedia.org/T395546 [14:32:52] (03CR) 10Elukey: [C:03+1] tox: completely refactor static checkers/linters [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1152016 (owner: 10Volans) [14:32:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P76672 and previous config saved to /var/cache/conftool/dbconfig/20250529-143251-fceratto.json [14:36:24] (03PS1) 10Andrew Bogott: Keystone: update hack out user ID validation [puppet] - 10https://gerrit.wikimedia.org/r/1152072 (https://phabricator.wikimedia.org/T395542) [14:36:51] (03CR) 10CI reject: [V:04-1] Keystone: update hack out user ID validation [puppet] - 10https://gerrit.wikimedia.org/r/1152072 (https://phabricator.wikimedia.org/T395542) (owner: 10Andrew Bogott) [14:37:04] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152072 (https://phabricator.wikimedia.org/T395542) (owner: 10Andrew Bogott) [14:38:30] (03PS2) 10Andrew Bogott: Keystone: update hack out user ID validation [puppet] - 10https://gerrit.wikimedia.org/r/1152072 (https://phabricator.wikimedia.org/T395542) [14:39:50] (03PS3) 10Andrew Bogott: Keystone: update hack out user ID validation [puppet] - 10https://gerrit.wikimedia.org/r/1152072 (https://phabricator.wikimedia.org/T395542) [14:39:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152072 (https://phabricator.wikimedia.org/T395542) (owner: 10Andrew Bogott) [14:42:15] (03CR) 10Andrew Bogott: [C:03+2] Keystone: update hack out user ID validation [puppet] - 10https://gerrit.wikimedia.org/r/1152072 (https://phabricator.wikimedia.org/T395542) (owner: 10Andrew Bogott) [14:47:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:47:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P76673 and previous config saved to /var/cache/conftool/dbconfig/20250529-144759-fceratto.json [14:48:22] (03PS1) 10Andrew Bogott: Keystone: fix patch filenames [puppet] - 10https://gerrit.wikimedia.org/r/1152076 (https://phabricator.wikimedia.org/T395542) [14:48:29] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:49:20] (03CR) 10Andrew Bogott: [C:03+2] Keystone: fix patch filenames [puppet] - 10https://gerrit.wikimedia.org/r/1152076 (https://phabricator.wikimedia.org/T395542) (owner: 10Andrew Bogott) [14:51:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10867849 (10cmooney) @VRiley-WMF no problem thanks! I'm actually not sure if step 5 is needed (if it is we will also need the additional ports on lvs1016 connected as per the table in T3... [14:52:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:53:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:54:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:55:13] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:55:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10867854 (10cmooney) Ok step 5 is not needed, lvs1016 will only require it's primary interface connected. I've configured the switch port it's connected on so it should be good to go. B... [15:00:04] dancy and andre: Time to do the Train log triage deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T1500). [15:00:20] o/ I'll be in the meeting which starts in 5 minutes. [15:02:10] 07sre-alert-triage, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Alert in need of triage: MegaRAID (instance an-worker1135) - https://phabricator.wikimedia.org/T394632#10867868 (10Stevemunene) 05Open→03Resolved This has been resolved and the host is OK. [15:03:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T395241)', diff saved to https://phabricator.wikimedia.org/P76674 and previous config saved to /var/cache/conftool/dbconfig/20250529-150306-fceratto.json [15:03:09] (03CR) 10Elukey: [V:03+1 C:03+2] role::puppetdb: increase WAL kept segments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108768 (https://phabricator.wikimedia.org/T383114) (owner: 10Elukey) [15:03:25] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:03:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T395241)', diff saved to https://phabricator.wikimedia.org/P76675 and previous config saved to /var/cache/conftool/dbconfig/20250529-150332-fceratto.json [15:09:54] (03PS1) 10Cwhite: logstash: drop request.query field [puppet] - 10https://gerrit.wikimedia.org/r/1152079 (https://phabricator.wikimedia.org/T390215) [15:10:12] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): hw troubleshooting: hard drive and MegaRaid for an-worker1135 - https://phabricator.wikimedia.org/T395552#10867888 (10Stevemunene) 05Open→03Resolved The failed hard drive was swapped and is back online so we can close this... [15:11:31] (03CR) 10Effie Mouzeli: kubernetes:mediawiki_runner: introduce mw-experimental (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:11:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): hw troubleshooting: hard drive for an-worker1119 - https://phabricator.wikimedia.org/T395549#10867890 (10Stevemunene) 05Open→03Resolved The failed hard drive was swapped and is back online so we can close this, thanks @Jcl... [15:11:45] (03PS2) 10BCornwall: admin: Add skivlehan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1145333 (https://phabricator.wikimedia.org/T393626) [15:12:08] (03CR) 10CI reject: [V:04-1] logstash: drop request.query field [puppet] - 10https://gerrit.wikimedia.org/r/1152079 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [15:12:41] (03PS2) 10Cwhite: logstash: drop request.query field [puppet] - 10https://gerrit.wikimedia.org/r/1152079 (https://phabricator.wikimedia.org/T390215) [15:12:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T395241)', diff saved to https://phabricator.wikimedia.org/P76676 and previous config saved to /var/cache/conftool/dbconfig/20250529-151255-fceratto.json [15:15:12] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:15:13] (03CR) 10BCornwall: [C:03+2] admin: Add skivlehan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1145333 (https://phabricator.wikimedia.org/T393626) (owner: 10BCornwall) [15:15:27] (03CR) 10Cwhite: [C:03+2] logstash: drop request.query field [puppet] - 10https://gerrit.wikimedia.org/r/1152079 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [15:15:34] (03PS1) 10C. Scott Ananian: Campaign: Ensure `
` wrapper is removed [extensions/UploadWizard] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152080 (https://phabricator.wikimedia.org/T395023) [15:16:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/UploadWizard] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152080 (https://phabricator.wikimedia.org/T395023) (owner: 10C. Scott Ananian) [15:16:46] I found a spiderpig bug :) [15:17:12] The title of the gerrit patch needs to be HTML-escaped https://usercontent.irccloud-cdn.com/file/bJCCGOj1/image.png [15:18:10] (03PS2) 10Giuseppe Lavagetto: cache::haproxy: remove unused if stanzas [puppet] - 10https://gerrit.wikimedia.org/r/1152028 [15:18:11] (03PS3) 10Giuseppe Lavagetto: cache::haproxy: start optimizing for readability [puppet] - 10https://gerrit.wikimedia.org/r/1152029 [15:18:11] (03PS1) 10Giuseppe Lavagetto: cp2027: remove experimental connection-rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1152081 [15:18:11] (03PS1) 10Giuseppe Lavagetto: cache::haproxy: remove generic ring definition [puppet] - 10https://gerrit.wikimedia.org/r/1152082 [15:18:12] (03PS1) 10Giuseppe Lavagetto: cache::haproxy: remove unused variables from configuration [puppet] - 10https://gerrit.wikimedia.org/r/1152083 [15:25:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:26:13] (03CR) 10BCornwall: [C:03+1] wmnet: Update es7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1152023 (https://phabricator.wikimedia.org/T395544) (owner: 10Gerrit maintenance bot) [15:28:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P76677 and previous config saved to /var/cache/conftool/dbconfig/20250529-152802-fceratto.json [15:32:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:34:02] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2007.codfw.wmnet with OS bookworm [15:34:10] 06SRE, 06Infrastructure-Foundations, 10netops: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10868004 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2007.codfw.wmnet with OS bookworm [15:37:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151236 (https://phabricator.wikimedia.org/T341859) (owner: 10Krinkle) [15:41:12] (03CR) 10CDanis: [C:03+1] cp2027: remove experimental connection-rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1152081 (owner: 10Giuseppe Lavagetto) [15:41:34] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152028 (owner: 10Giuseppe Lavagetto) [15:43:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P76678 and previous config saved to /var/cache/conftool/dbconfig/20250529-154309-fceratto.json [15:43:43] (03CR) 10CDanis: "This causes a breaking diff it seems?" [puppet] - 10https://gerrit.wikimedia.org/r/1152029 (owner: 10Giuseppe Lavagetto) [15:44:06] if there's no objection i'm gonna run a few config & patch backports for charts [15:44:25] (03CR) 10CDanis: [C:03+1] cache::haproxy: remove generic ring definition [puppet] - 10https://gerrit.wikimedia.org/r/1152082 (owner: 10Giuseppe Lavagetto) [15:44:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151750 (https://phabricator.wikimedia.org/T393788) (owner: 10Jforrester) [15:45:17] (03Merged) 10jenkins-bot: Enable Chart for Phase 4 wikis (all remaining public wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151750 (https://phabricator.wikimedia.org/T393788) (owner: 10Jforrester) [15:45:36] (03CR) 10CDanis: [C:04-1] "from https://puppet-compiler.wmflabs.org/output/1152029/5716/cp4050.ulsfo.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1152029 (owner: 10Giuseppe Lavagetto) [15:45:40] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1151750|Enable Chart for Phase 4 wikis (all remaining public wikis) (T393788)]] [15:45:44] T393788: Enable Charts for Phase 4 wikis (all remaining wikis) - https://phabricator.wikimedia.org/T393788 [15:45:50] bvibber: 🎉 [15:45:56] :) [15:47:51] !log bvibber@deploy1003 jforrester, bvibber: Backport for [[gerrit:1151750|Enable Chart for Phase 4 wikis (all remaining public wikis) (T393788)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:48:06] (03PS1) 10Cwhite: logstash: remove mw-script large fields [puppet] - 10https://gerrit.wikimedia.org/r/1152088 (https://phabricator.wikimedia.org/T390215) [15:48:59] !log bvibber@deploy1003 jforrester, bvibber: Continuing with sync [15:52:53] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2007.codfw.wmnet with reason: host reimage [15:53:13] (03CR) 10Cwhite: [C:03+2] logstash: remove mw-script large fields [puppet] - 10https://gerrit.wikimedia.org/r/1152088 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [15:55:25] (03CR) 10Volans: "@fceratto@wikimedia.org find reminder." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136843 (owner: 10Volans) [15:56:00] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151750|Enable Chart for Phase 4 wikis (all remaining public wikis) (T393788)]] (duration: 10m 20s) [15:56:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2007.codfw.wmnet with reason: host reimage [15:56:04] (03CR) 10Scott French: "Thanks, Effie! I've not had a chance to fully review the script (just given it a quick read through)." [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [15:56:07] T393788: Enable Charts for Phase 4 wikis (all remaining wikis) - https://phabricator.wikimedia.org/T393788 [15:57:10] now to backport the experimental feature and enable it on beta/test [15:58:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T395241)', diff saved to https://phabricator.wikimedia.org/P76679 and previous config saved to /var/cache/conftool/dbconfig/20250529-155817-fceratto.json [15:58:36] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [15:58:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T395241)', diff saved to https://phabricator.wikimedia.org/P76680 and previous config saved to /var/cache/conftool/dbconfig/20250529-155843-fceratto.json [15:58:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1151787 (https://phabricator.wikimedia.org/T388434) (owner: 10Bvibber) [15:58:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1151788 (https://phabricator.wikimedia.org/T388616) (owner: 10Bvibber) [16:00:04] jhathaway and moritzm: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:03:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151148 (https://phabricator.wikimedia.org/T393918) (owner: 10Phuedx) [16:05:25] RESOLVED: ErrorBudgetBurn: search-update-lag eqiad - https://slo.wikimedia.org/?search=search-update-lag - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [16:07:08] (03PS1) 10Cwhite: logstash: mw-script remove text field [puppet] - 10https://gerrit.wikimedia.org/r/1152093 (https://phabricator.wikimedia.org/T390215) [16:07:21] (03CR) 10CI reject: [V:04-1] logstash: mw-script remove text field [puppet] - 10https://gerrit.wikimedia.org/r/1152093 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [16:07:54] (03PS2) 10Cwhite: logstash: mw-script remove text field [puppet] - 10https://gerrit.wikimedia.org/r/1152093 (https://phabricator.wikimedia.org/T390215) [16:07:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T395241)', diff saved to https://phabricator.wikimedia.org/P76681 and previous config saved to /var/cache/conftool/dbconfig/20250529-160758-fceratto.json [16:08:41] (03Merged) 10jenkins-bot: Lua transform backend for JsonConfig Data: pages [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1151787 (https://phabricator.wikimedia.org/T388434) (owner: 10Bvibber) [16:08:42] (03Merged) 10jenkins-bot: Chart-side support for Lua transforms [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1151788 (https://phabricator.wikimedia.org/T388616) (owner: 10Bvibber) [16:09:05] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1151787|Lua transform backend for JsonConfig Data: pages (T388434)]], [[gerrit:1151788|Chart-side support for Lua transforms (T388616)]] [16:09:11] T388434: JsonConfig remote-data-with-Lua-transform API query - https://phabricator.wikimedia.org/T388434 [16:09:11] T388616: Expose Data: Lua filter interface to Charts via the .chart format setup - https://phabricator.wikimedia.org/T388616 [16:12:06] (03CR) 10Cwhite: [C:03+2] logstash: mw-script remove text field [puppet] - 10https://gerrit.wikimedia.org/r/1152093 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [16:15:14] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10868132 (10Volans) @bking as agreed on IRC let me know when another 1~2 hosts are ready for testing so we can complete the change for th... [16:16:45] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:17:09] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1002" [16:17:28] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch2112*,cirrussearch2113* for T394543 - bking@cumin2002 [16:17:28] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmooney@cumin1002" [16:17:29] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2007.codfw.wmnet with OS bookworm [16:17:30] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch2112*,cirrussearch2113* for T394543 - bking@cumin2002 [16:17:36] T394543: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543 [16:17:40] 06SRE, 06Infrastructure-Foundations, 10netops: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10868139 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2007.codfw.wmnet with OS bookworm completed:... [16:18:01] (03PS1) 10Cwhite: logstash: remove heading field [puppet] - 10https://gerrit.wikimedia.org/r/1152095 (https://phabricator.wikimedia.org/T390215) [16:18:17] (03PS2) 10Cwhite: logstash: mw-script remove heading field [puppet] - 10https://gerrit.wikimedia.org/r/1152095 (https://phabricator.wikimedia.org/T390215) [16:19:05] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:19:23] (03PS3) 10Cwhite: logstash: mw-script remove heading field [puppet] - 10https://gerrit.wikimedia.org/r/1152095 (https://phabricator.wikimedia.org/T390215) [16:19:27] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:23:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P76682 and previous config saved to /var/cache/conftool/dbconfig/20250529-162305-fceratto.json [16:23:17] (03PS4) 10Cwhite: logstash: mw-script remove heading and outgoing_link field [puppet] - 10https://gerrit.wikimedia.org/r/1152095 (https://phabricator.wikimedia.org/T390215) [16:23:33] (03PS5) 10Cwhite: logstash: mw-script remove heading and outgoing_link field [puppet] - 10https://gerrit.wikimedia.org/r/1152095 (https://phabricator.wikimedia.org/T390215) [16:24:29] !log bking@cumin2002 START - Cookbook sre.hosts.remove-downtime for cirrussearch[2111-2112].codfw.wmnet [16:24:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cirrussearch[2111-2112].codfw.wmnet [16:25:01] cmooney@cumin1002 netbox (PID 828103) is awaiting input [16:25:06] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cirrussearch[2112-2113].codfw.wmnet with reason: firmware update [16:25:11] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): SSD firmware update for cirrussearch211[0-5] - https://phabricator.wikimedia.org/T394432#10868196 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2f27b7ce-4c25-4aef-bf2f-d42a7e1b8005) set by bking@cumin2002 for... [16:25:44] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for sretest2007 - cmooney@cumin1002" [16:25:49] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for sretest2007 - cmooney@cumin1002" [16:25:50] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:25:50] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10868209 (10bking) @Volans , the hosts cirrussearch211[2-3] are ready for your use. I've set a downtime for the next 7 days. Hit me up w... [16:26:18] jouncebot: nowandnext [16:26:18] For the next 0 hour(s) and 33 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T1600) [16:26:18] In 0 hour(s) and 33 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T1700) [16:26:18] In 0 hour(s) and 33 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T1700) [16:26:21] (03CR) 10Cwhite: [C:03+2] logstash: mw-script remove heading and outgoing_link field [puppet] - 10https://gerrit.wikimedia.org/r/1152095 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [16:26:28] (03PS1) 10Reedy: GenerateFancyCaptchas: Explicitly set all limits to 0 [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152096 (https://phabricator.wikimedia.org/T388531) [16:27:28] (03PS1) 10Reedy: GenerateFancyCaptchas: Explicitly set all limits to 0 [extensions/ConfirmEdit] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1152097 (https://phabricator.wikimedia.org/T388531) [16:29:55] (03CR) 10Reedy: [C:03+2] GenerateFancyCaptchas: Explicitly set all limits to 0 [extensions/ConfirmEdit] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1152097 (https://phabricator.wikimedia.org/T388531) (owner: 10Reedy) [16:29:59] (03CR) 10Reedy: [C:03+2] GenerateFancyCaptchas: Explicitly set all limits to 0 [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152096 (https://phabricator.wikimedia.org/T388531) (owner: 10Reedy) [16:30:34] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10868249 (10Jhancock.wm) the ports have been patched. I didn't see them come up but i am detecting light. Please let me know if you need assistance from me getting these online [16:33:13] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1151787|Lua transform backend for JsonConfig Data: pages (T388434)]], [[gerrit:1151788|Chart-side support for Lua transforms (T388616)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:33:19] T388434: JsonConfig remote-data-with-Lua-transform API query - https://phabricator.wikimedia.org/T388434 [16:33:19] T388616: Expose Data: Lua filter interface to Charts via the .chart format setup - https://phabricator.wikimedia.org/T388616 [16:33:32] lookin good so far [16:35:39] !log bvibber@deploy1003 bvibber: Continuing with sync [16:37:30] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache sretest2007.codfw.wmnet on all recursors [16:37:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest2007.codfw.wmnet on all recursors [16:38:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P76683 and previous config saved to /var/cache/conftool/dbconfig/20250529-163812-fceratto.json [16:40:48] (03PS18) 10Effie Mouzeli: kubernetes:mediawiki_runner: introduce mw-experimental [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) [16:40:52] (03CR) 10Effie Mouzeli: kubernetes:mediawiki_runner: introduce mw-experimental (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [16:41:52] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:42:02] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:42:39] (03Merged) 10jenkins-bot: GenerateFancyCaptchas: Explicitly set all limits to 0 [extensions/ConfirmEdit] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1152097 (https://phabricator.wikimedia.org/T388531) (owner: 10Reedy) [16:42:40] (03Merged) 10jenkins-bot: GenerateFancyCaptchas: Explicitly set all limits to 0 [extensions/ConfirmEdit] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152096 (https://phabricator.wikimedia.org/T388531) (owner: 10Reedy) [16:43:40] 10ops-codfw, 06DC-Ops: Alert for device lsw1-b7-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T395588 (10phaultfinder) 03NEW [16:44:37] (03CR) 10Scott French: "Thanks, Effie! I like the idea of basing this on a clone of another service - definitely made review simpler :) So, two thoughts:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [16:45:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149720 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [16:49:17] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151787|Lua transform backend for JsonConfig Data: pages (T388434)]], [[gerrit:1151788|Chart-side support for Lua transforms (T388616)]] (duration: 40m 11s) [16:49:23] T388434: JsonConfig remote-data-with-Lua-transform API query - https://phabricator.wikimedia.org/T388434 [16:49:23] T388616: Expose Data: Lua filter interface to Charts via the .chart format setup - https://phabricator.wikimedia.org/T388616 [16:49:42] woot ok one more config for tst: [16:49:44] test [16:50:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151747 (https://phabricator.wikimedia.org/T395516) (owner: 10Bvibber) [16:50:54] (03Merged) 10jenkins-bot: Enable Lua transform switch for Charts on test and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151747 (https://phabricator.wikimedia.org/T395516) (owner: 10Bvibber) [16:52:45] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1151747|Enable Lua transform switch for Charts on test and beta (T395516)]] [16:52:49] T395516: Beta deploy of Lua transforms for Charts - https://phabricator.wikimedia.org/T395516 [16:52:52] looks like i'm taking someone's rate limiter patch on backports as well :D [16:53:10] Reedy: that's you [16:53:18] bvibber: cheers [16:53:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T395241)', diff saved to https://phabricator.wikimedia.org/P76684 and previous config saved to /var/cache/conftool/dbconfig/20250529-165320-fceratto.json [16:53:22] :D [16:53:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [16:53:28] it's changes to maintenance scripts, so nothing to test [16:53:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T395241)', diff saved to https://phabricator.wikimedia.org/P76685 and previous config saved to /var/cache/conftool/dbconfig/20250529-165334-fceratto.json [16:53:43] perfect [16:53:49] yeah it looked very safe :D [16:54:03] There are a ton of errors in logspam-watch at the moment [16:54:06] Started about 15 minutes ago. [16:54:11] `i/t/TitleValue:191 Bad value for parameter $namespace: must be a int` [16:55:01] !log removing "session-mode automatic" from IBGP config on lsw1-e8-eqiad [16:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:54] bvibber: Any relation to what you're doing? [16:56:03] dancy: that might be mine [16:56:08] maybe [16:56:08] ah [16:56:18] dancy, is it testwiki? [16:56:21] heh [16:56:51] It's hard to tell. The mediawiki-errors dashboard is not showing the same info as logspam-watch right now. [16:56:55] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1151747|Enable Lua transform switch for Charts on test and beta (T395516)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:57:14] For example, there have been 16000 of these errors in the last 15 minutes, but they're not showing in the dashboard. [16:57:52] dancy: got it, I'll take a look there. so, if it was my maintenance script just now, it would have been a brief spike around 16:55 or so. [16:58:05] !log bvibber@deploy1003 bvibber: Continuing with sync [16:58:06] if it's continuous, then it's definitely not me [16:58:22] It started around 16:39 and is ongoing. [16:58:25] not showing up in mediawiki-errors and my tests are working so far [16:58:35] but if you have a backtrace show me and i'll double-check it :D [16:58:39] dancy: ah, alright ... yeah that's not me then =/ [16:59:10] The mediawiki-errors dashboard is not reliable at the moment: https://phabricator.wikimedia.org/T390215 [16:59:14] (03CR) 10Dreamy Jazz: SUL3: Enable client hints data on the auth shared domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152064 (https://phabricator.wikimedia.org/T395185) (owner: 10D3r1ck01) [16:59:25] I'll have to do some manual digging. Bleh [17:00:06] bd808: OwO what's this, a deployment window?? Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T1700). nyaa~ [17:00:06] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T1700) [17:01:23] bvibber: https://phabricator.wikimedia.org/P76686 [17:01:50] ok that is mine it looks like :D lemme track it down [17:01:51] damn [17:02:30] ah it's hidden in deferred updates [17:02:34] that's why i never saw it [17:03:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T395241)', diff saved to https://phabricator.wikimedia.org/P76687 and previous config saved to /var/cache/conftool/dbconfig/20250529-170259-fceratto.json [17:06:03] PROBLEM - BFD status on ssw1-f1-codfw.mgmt is CRITICAL: Down: 4 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:06:22] bvibber: This is an unbreak now situation. I need to step away for a bit but I hope to come back to a cleaner logspam-watch. :-) [17:06:47] ^^ this bfd alert must be because of me, working on those switches [17:06:47] (21,450 errors logged in the last 15 minutes) [17:06:53] no servers connected so no need to worry [17:06:56] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151747|Enable Lua transform switch for Charts on test and beta (T395516)]] (duration: 14m 10s) [17:07:01] fixing [17:07:01] T395516: Beta deploy of Lua transforms for Charts - https://phabricator.wikimedia.org/T395516 [17:07:10] FIRING: [2x] BFDdown: BFD session down between ssw1-e1-codfw and 10.192.253.141 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-e1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:07:13] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:07:27] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:08:03] (03PS1) 10Bvibber: Fix type error in GlobalJsonLinks processing [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152103 (https://phabricator.wikimedia.org/T395593) [17:08:03] RECOVERY - BFD status on ssw1-f1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:08:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152103 (https://phabricator.wikimedia.org/T395593) (owner: 10Bvibber) [17:08:36] ok attempting to deploy the fix [17:08:46] sorry all, i wish i'd caught that before i pushed the full deploy :D [17:10:08] dancy: ok this should clear it up once it goes through [17:10:25] !log volans@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2112.codfw.wmnet [17:10:57] !log volans@cumin1003 START - Cookbook sre.hosts.reboot-single for host cirrussearch2112.codfw.wmnet [17:11:13] damn i guess i just need to improve my test coverage [17:12:10] FIRING: [14x] BFDdown: BFD session down between lsw1-f3-codfw and 10.192.253.152 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:13:11] (03PS3) 10Jforrester: Drop Chart roll-out dblists, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151751 (https://phabricator.wikimedia.org/T383079) [17:17:10] RESOLVED: [14x] BFDdown: BFD session down between lsw1-f3-codfw and 10.192.253.152 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [17:18:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P76688 and previous config saved to /var/cache/conftool/dbconfig/20250529-171807-fceratto.json [17:21:43] !log volans@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2112.codfw.wmnet [17:21:44] !log volans@cumin1003 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cirrussearch2112.codfw.wmnet [17:22:09] (03Merged) 10jenkins-bot: Fix type error in GlobalJsonLinks processing [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152103 (https://phabricator.wikimedia.org/T395593) (owner: 10Bvibber) [17:22:33] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1152103|Fix type error in GlobalJsonLinks processing (T395593)]] [17:22:39] T395593: Regression in GlobalJsonLinks deferred updates in JsonConfig - https://phabricator.wikimedia.org/T395593 [17:24:48] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1152103|Fix type error in GlobalJsonLinks processing (T395593)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:25:28] !log bvibber@deploy1003 bvibber: Continuing with sync [17:25:50] (03PS1) 10Cwhite: logstash: drop logs from DumpIndex.php [puppet] - 10https://gerrit.wikimedia.org/r/1152108 (https://phabricator.wikimedia.org/T390215) [17:26:19] bvibber: if you could ping me when you're done, I'd like to merge some changes during the infrastructure window [17:26:40] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10868485 (10cmooney) >>! In T387504#10868249, @Jhancock.wm wrote: > the ports have been patched. I didn't see them come up but i am detecting light. Please let me know if you need assistance... [17:27:20] swfrench-wmf: as soon as this finishes you're free to go :D [17:27:22] sorry for running over [17:27:44] no worries! I had a conflict during the first part of the window anyway :) [17:27:44] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10868488 (10Volans) @bking great, thanks a lot. I've already done `cirrussearch2112` with my latest version of the patch. I'll do ``cirru... [17:27:57] hehe [17:27:58] (03CR) 10CI reject: [V:04-1] logstash: drop logs from DumpIndex.php [puppet] - 10https://gerrit.wikimedia.org/r/1152108 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [17:31:44] (03PS1) 10Andrew Bogott: nova vendordata: remove /etc/resolv.conf after purging systemd-resolved [puppet] - 10https://gerrit.wikimedia.org/r/1152110 [17:32:37] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152103|Fix type error in GlobalJsonLinks processing (T395593)]] (duration: 10m 04s) [17:32:42] T395593: Regression in GlobalJsonLinks deferred updates in JsonConfig - https://phabricator.wikimedia.org/T395593 [17:32:47] swfrench-wmf: all yours! [17:32:53] thanks! [17:32:59] dancy: that error log should start clearing up now [17:33:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P76689 and previous config saved to /var/cache/conftool/dbconfig/20250529-173314-fceratto.json [17:33:59] (03CR) 10Scott French: [C:03+2] mediawiki: Remove backwards compatibility path for running php directly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148491 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [17:34:20] (03PS2) 10Andrew Bogott: nova vendordata: remove /etc/resolv.conf after purging systemd-resolved [puppet] - 10https://gerrit.wikimedia.org/r/1152110 [17:34:44] bvibber: Thanks! I'll check it when I get back. [17:35:28] (03CR) 10Cwhite: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1152108 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [17:35:58] (03CR) 10Andrew Bogott: [C:03+2] nova vendordata: remove /etc/resolv.conf after purging systemd-resolved [puppet] - 10https://gerrit.wikimedia.org/r/1152110 (owner: 10Andrew Bogott) [17:36:31] (03Merged) 10jenkins-bot: mediawiki: Remove backwards compatibility path for running php directly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1148491 (https://phabricator.wikimedia.org/T378479) (owner: 10RLazarus) [17:44:30] !log swfrench@deploy1003 Started scap sync-world: Clear noop helmfile diffs from gerrit change r/1148491 - T378479 [17:44:35] T378479: Allow using helper scripts inside of mwscript-k8s - https://phabricator.wikimedia.org/T378479 [17:45:01] (03PS1) 10BCornwall: varnish: Implement translation analytics vars [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) [17:46:59] !log swfrench@deploy1003 Finished scap sync-world: Clear noop helmfile diffs from gerrit change r/1148491 - T378479 (duration: 02m 29s) [17:47:28] (03PS8) 10BCornwall: varnish: Replace analytics fake headers with vars [puppet] - 10https://gerrit.wikimedia.org/r/1147912 (https://phabricator.wikimedia.org/T373550) [17:48:07] PROBLEM - Disk space on an-worker1109 is CRITICAL: DISK CRITICAL - free space: / 2084 MB (3% inode=93%): /tmp 2084 MB (3% inode=93%): /var/tmp 2084 MB (3% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1109&var-datasource=eqiad+prometheus/ops [17:48:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T395241)', diff saved to https://phabricator.wikimedia.org/P76690 and previous config saved to /var/cache/conftool/dbconfig/20250529-174821-fceratto.json [17:48:40] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [17:48:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T395241)', diff saved to https://phabricator.wikimedia.org/P76691 and previous config saved to /var/cache/conftool/dbconfig/20250529-174847-fceratto.json [17:49:13] FYI, I'm all done [17:52:11] (03PS8) 10BCornwall: varnish: Replace date/stamp headers with vars [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) [17:52:11] (03PS9) 10BCornwall: varnish: Replace analytics fake headers with vars [puppet] - 10https://gerrit.wikimedia.org/r/1147912 (https://phabricator.wikimedia.org/T373550) [17:52:38] (03CR) 10BCornwall: "Reworded the stdlog message for resets to make them a little more concise." [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [17:53:08] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:53:19] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:54:26] (03CR) 10BCornwall: [V:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [17:54:33] (03CR) 10BCornwall: [V:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1147912 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [17:57:09] (03PS1) 10Clare Ming: ext.wikimediaEvents: Add XLab PageVisit instrument [extensions/WikimediaEvents] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152115 (https://phabricator.wikimedia.org/T393918) [17:58:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152115 (https://phabricator.wikimedia.org/T393918) (owner: 10Clare Ming) [18:00:05] dancy and andre: MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T1800). Please do the needful. [18:01:20] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [18:01:31] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [18:02:34] o/ [18:02:53] bvibber: Confirmed! [18:03:17] (03PS4) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [18:03:59] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: SSD firmware update not working in firmware cookbook - https://phabricator.wikimedia.org/T394543#10868646 (10RobH) Thank you for all the work on this and polishing it up for general SRE use! Current plan: * @volans updates cookbook t... [18:09:24] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1151386/5720/" [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [18:11:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T395241)', diff saved to https://phabricator.wikimedia.org/P76692 and previous config saved to /var/cache/conftool/dbconfig/20250529-181118-fceratto.json [18:12:39] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [18:12:52] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [18:13:34] (03CR) 10Scott French: [C:03+1] wikikube: decommission wikikube-worker102[6-8].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1151759 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [18:14:22] (03PS3) 10Dr0ptp4kt: xLab: Reduce staging/production logging level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151232 (https://phabricator.wikimedia.org/T394425) (owner: 10Santiago Faci) [18:15:25] (03CR) 10Scott French: [C:03+1] wikikube: decommission wikikube-worker103[23].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1151808 (https://phabricator.wikimedia.org/T383227) (owner: 10Jasmine) [18:15:39] alright, pressing the train button after getting distracted by logs. [18:16:41] (03CR) 10Clare Ming: [C:03+2] xLab: Reduce staging/production logging level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151232 (https://phabricator.wikimedia.org/T394425) (owner: 10Santiago Faci) [18:17:02] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152119 (https://phabricator.wikimedia.org/T392173) [18:17:03] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152119 (https://phabricator.wikimedia.org/T392173) (owner: 10TrainBranchBot) [18:17:50] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152119 (https://phabricator.wikimedia.org/T392173) (owner: 10TrainBranchBot) [18:18:02] (03Merged) 10jenkins-bot: xLab: Reduce staging/production logging level [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151232 (https://phabricator.wikimedia.org/T394425) (owner: 10Santiago Faci) [18:21:35] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [18:21:38] !log dr0ptp4kt@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [18:21:52] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [18:22:09] !log dr0ptp4kt@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [18:26:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P76693 and previous config saved to /var/cache/conftool/dbconfig/20250529-182624-fceratto.json [18:26:41] !log dr0ptp4kt@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [18:27:34] !log dr0ptp4kt@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [18:27:39] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.3 refs T392173 [18:27:44] T392173: 1.45.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T392173 [18:34:44] (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152115 (https://phabricator.wikimedia.org/T393918) (owner: 10Clare Ming) [18:41:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P76694 and previous config saved to /var/cache/conftool/dbconfig/20250529-184132-fceratto.json [18:42:28] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10868800 (10Arnoldokoth) @SKivlehan-WMF This was merged and deployed... Kindly test and confirm if it works as expected. [18:46:06] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for kgraessle - https://phabricator.wikimedia.org/T395370#10868814 (10Arnoldokoth) @DMburugu Kindly approve. [18:46:20] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for kgraessle - https://phabricator.wikimedia.org/T395370#10868816 (10Arnoldokoth) a:03DMburugu [18:47:03] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for kgraessle - https://phabricator.wikimedia.org/T395370#10868828 (10Arnoldokoth) [18:47:04] (03CR) 10Scott French: "Thanks, Effie! Only one potentially blocking issue (the string substitution in mediawiki_runner.pp). Otherwise I think you're good to give" [puppet] - 10https://gerrit.wikimedia.org/r/1123048 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [18:50:04] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [18:52:28] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10868835 (10Arnoldokoth) Hi @Milimetric Do you approve of this request? [18:55:45] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (and kerberos) for Giuseppe Lavagetto - https://phabricator.wikimedia.org/T395413#10868842 (10Arnoldokoth) Hello @Kappakayala Do you approve this request? [18:56:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T395241)', diff saved to https://phabricator.wikimedia.org/P76695 and previous config saved to /var/cache/conftool/dbconfig/20250529-185639-fceratto.json [18:56:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [18:57:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:57:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T395241)', diff saved to https://phabricator.wikimedia.org/P76696 and previous config saved to /var/cache/conftool/dbconfig/20250529-185711-fceratto.json [18:58:23] bvibber: Are you still around? [18:58:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151812 (https://phabricator.wikimedia.org/T380510) (owner: 10Jdlrobson) [18:58:29] dancy: yo [18:58:34] how's it looking? [18:58:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 29 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151313 (owner: 10Jdlrobson) [18:59:19] Hi! I was looking at https://gerrit.wikimedia.org/r/q/project:mediawiki/extensions/JsonConfig and I see that you've done stuff in that repo. I'm wondering if you can help with https://phabricator.wikimedia.org/T395604 and/or https://phabricator.wikimedia.org/T395368 [19:00:11] dancy: i think i can help with those, looks like just sloppy code dealing with the json objects [19:00:39] 🙏🏾 Many thanks! It's getting spammy [19:05:47] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on relforge[1003-1004,1008-1009].eqiad.wmnet with reason: noisy alerts [19:12:59] (03CR) 10CDanis: [C:03+1] cache::haproxy: remove unused if stanzas [puppet] - 10https://gerrit.wikimedia.org/r/1152028 (owner: 10Giuseppe Lavagetto) [19:15:12] FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:16:49] (03CR) 10Cwhite: centrallog: Add a temporary rsyslog debug config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [19:19:10] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10868893 (10Arnoldokoth) [19:20:57] (03PS1) 10Andrew Bogott: Nova vendordata: prime resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/1152131 [19:30:00] (03CR) 10Scott French: [C:03+1] mediawiki/apache: redirect tj.*.org to tg.*.org for all projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [19:32:57] (03PS2) 10Andrew Bogott: Nova vendordata: prime resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/1152131 [19:43:56] (03PS1) 10Jasmine: Add tj to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1152136 (https://phabricator.wikimedia.org/T393803) [19:47:05] (03CR) 10Dzahn: [C:03+1] "usually for languages added to this file we would link to a decision of the language committee or show that this is a valid ISO 639 langua" [dns] - 10https://gerrit.wikimedia.org/r/1152136 (https://phabricator.wikimedia.org/T393803) (owner: 10Jasmine) [19:49:18] (03CR) 10Pppery: mediawiki/apache: redirect tj.*.org to tg.*.org for all projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [19:50:33] (03CR) 10Dzahn: mediawiki/apache: redirect tj.*.org to tg.*.org for all projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [19:51:42] (03PS1) 10Ahmon Dancy: Revert "Use buildkit wmf-v0.21.1 on WMCS and trusted runners" [puppet] - 10https://gerrit.wikimedia.org/r/1152138 [19:52:16] (03PS2) 10Ahmon Dancy: Revert "Use buildkit wmf-v0.21.1 on WMCS and trusted runners" [puppet] - 10https://gerrit.wikimedia.org/r/1152138 (https://phabricator.wikimedia.org/T393856) [19:53:59] mutante: Would you be willing to merge and deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1152138 for me? [19:54:25] looking [19:54:45] ah, a downgrade [19:54:48] Nod. [19:55:17] ok, i just hope downgrades work just like upgrades, heh [19:55:41] (03CR) 10Andrew Bogott: [C:03+2] Nova vendordata: prime resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/1152131 (owner: 10Andrew Bogott) [19:55:45] I can clean up if there are any remnants causing problems. I'll be testing post-downgrade as well. [19:55:56] great [19:56:02] (03CR) 10Dzahn: [C:03+2] Revert "Use buildkit wmf-v0.21.1 on WMCS and trusted runners" [puppet] - 10https://gerrit.wikimedia.org/r/1152138 (https://phabricator.wikimedia.org/T393856) (owner: 10Ahmon Dancy) [19:56:30] a clean revert makes it easier to merge, yea [19:56:45] one moment, we have another merge happening in parallel [19:57:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T395241)', diff saved to https://phabricator.wikimedia.org/P76697 and previous config saved to /var/cache/conftool/dbconfig/20250529-195729-fceratto.json [19:58:44] dancy: done on prod puppetserver. may have to sync to cloud puppetmaster now. [19:59:38] thx [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T2000) [20:00:05] cscott, Krinkle, cjming, ebernhardson, and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] \o [20:00:20] o/ [20:00:37] o/ i would like to try out spiderpig for my 2 changes if that's okay but as i'm new to deploying wouldn't feel comfortable doing all the changes in the window [20:00:38] i can deploy unless someone else is vying to [20:00:56] (i have not used spiderpig before) [20:01:01] me neither! [20:01:16] i'm curious to take it for a spin [20:01:38] Who's going first? [20:02:24] I don't mind if we are okay with me jumping the queue? [20:02:34] fine by me [20:02:51] ok let me try this out. Entering one time password now.. [20:03:00] i can do ebernhardson's and my patches thereafter [20:03:19] I have a beta cluster and test wiki change - I assume it's okay for them to go out together? [20:03:23] i'm here, and proficient in spiderpig. (also I broke spiderpig: https://phabricator.wikimedia.org/T395575 ) [20:03:24] cjming: If you haven't requested spiderpig access yet, you'll need to do that first. [20:03:30] i'm in ! [20:03:36] ah, sweet! [20:03:48] I'm going to hit "start backport" if that's okay on those 2 patches. [20:03:55] any objections? [20:03:56] So you'll be able to monitor Jdlrobson's deployment. [20:04:01] Go for it [20:04:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151812 (https://phabricator.wikimedia.org/T380510) (owner: 10Jdlrobson) [20:04:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151313 (owner: 10Jdlrobson) [20:04:42] so cool - the rave reviews are legit [20:05:19] exciting stuff [20:05:22] Jdlrobson: go for it:) [20:05:32] (03Merged) 10jenkins-bot: Enable Minerva typeahead search on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151812 (https://phabricator.wikimedia.org/T380510) (owner: 10Jdlrobson) [20:05:34] (03Merged) 10jenkins-bot: Enable ReadingList special page on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151313 (owner: 10Jdlrobson) [20:05:47] !log jdlrobson@deploy1003 Started scap sync-world: Backport for [[gerrit:1151812|Enable Minerva typeahead search on beta cluster (T380510)]], [[gerrit:1151313|Enable ReadingList special page on test wiki]] [20:05:52] T380510: Update Minerva to use new core TypeaheadSearch - https://phabricator.wikimedia.org/T380510 [20:05:57] also this is must be a record for longest time from WMF hire to backport to production: 13 years and 3 months [20:06:14] lol [20:06:31] cjming: so you can see also the status right? (building container images) [20:06:36] i can! [20:06:40] that's awesome [20:06:44] o/ [20:06:46] love the way I can also see who did the last backport [20:07:45] !log jdlrobson@deploy1003 jdlrobson: Backport for [[gerrit:1151812|Enable Minerva typeahead search on beta cluster (T380510)]], [[gerrit:1151313|Enable ReadingList special page on test wiki]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:54] ok checking the test servers now [20:08:16] I can confirm the test wiki change is working as expected. [20:08:49] I am continuing with sync now. [20:08:51] !log jdlrobson@deploy1003 jdlrobson: Continuing with sync [20:08:59] i'm assuming we still have to proceed in a queue? no simultaneous scap backports? [20:09:11] I guess I don't even need to document what I'm doing now as the bot seems pretty good at that. [20:09:12] that is correct [20:09:19] (at least not the fact I'm syncing :)) [20:09:55] yes, you will be able to link with a single URL to the whole thing and use it elsewhere:) https://spiderpig.wikimedia.org/jobs/126 [20:10:14] ("that is correct" in response to cjming and simul-pigs, not w/r/t documenting things) [20:10:23] ty [20:12:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P76698 and previous config saved to /var/cache/conftool/dbconfig/20250529-201236-fceratto.json [20:14:08] (03CR) 10Scott French: [C:03+1] "Thanks, @dzahn@wikimedia.org - I think this is useful context to include in the commit message, together with capturing what the overall g" [dns] - 10https://gerrit.wikimedia.org/r/1152136 (https://phabricator.wikimedia.org/T393803) (owner: 10Jasmine) [20:14:32] ok almost there. [20:14:50] @cjming I'll hand over to you after it's done, right? [20:15:07] sure [20:15:12] I will do more next time, promise.. just want to ease myself in :) [20:15:19] nw! [20:16:01] !log jdlrobson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151812|Enable Minerva typeahead search on beta cluster (T380510)]], [[gerrit:1151313|Enable ReadingList special page on test wiki]] (duration: 10m 13s) [20:16:06] T380510: Update Minerva to use new core TypeaheadSearch - https://phabricator.wikimedia.org/T380510 [20:16:07] ok done! over to you cjming [20:16:08] Jdlrobson: congrats [20:16:10] so if it's ok, i'll do in this order: erik's, cscott, timo, then mine [20:16:12] \o/ [20:16:21] kk [20:16:22] that was effortless and amazing. [20:16:27] ^^ [20:16:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149720 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [20:17:52] (03Merged) 10jenkins-bot: Turn on glent m1 AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1149720 (https://phabricator.wikimedia.org/T262612) (owner: 10Ebernhardson) [20:18:06] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1149720|Turn on glent m1 AB test (T262612)]] [20:18:11] T262612: Run an A/B test using suggestions generated using glent Method 1 - https://phabricator.wikimedia.org/T262612 [20:20:00] !log cjming@deploy1003 ebernhardson, cjming: Backport for [[gerrit:1149720|Turn on glent m1 AB test (T262612)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:20:21] ebernhardson: ^^ [20:21:24] cjming: all looks good [20:21:29] !log cjming@deploy1003 ebernhardson, cjming: Continuing with sync [20:22:46] (03PS5) 10Dzahn: mediawiki/apache: redirect tj.*.org to tg.*.org for all projects [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) [20:24:24] cscott: do you want to self-deploy after current job is done? [20:24:30] Krinkle: same Q ^^ [20:24:33] sure [20:24:53] cjming: I'm happy to have you do it if that's alright. [20:24:57] np! [20:26:24] (i'm still following along) [20:27:14] mutante question: What if I forget to logout after a session? Will my session auto expire ? [20:27:43] Jdlrobson: you timed it perfectly (for becoming a bonafide deployer) - spiderpig is a game changer [20:27:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P76699 and previous config saved to /var/cache/conftool/dbconfig/20250529-202743-fceratto.json [20:28:26] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1149720|Turn on glent m1 AB test (T262612)]] (duration: 10m 19s) [20:28:30] T262612: Run an A/B test using suggestions generated using glent Method 1 - https://phabricator.wikimedia.org/T262612 [20:28:34] cscott: all you [20:28:39] Jdlrobson: pretty sure it will. yes [20:28:43] at some point [20:28:45] whee [20:30:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy1003 using scap backport" [extensions/UploadWizard] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152080 (https://phabricator.wikimedia.org/T395023) (owner: 10C. Scott Ananian) [20:33:43] Jdlrobson: If you forget that you have a deployment in progress, someone will eventually notice and complain to you. If you don't respond, any other spiderpig user can cancel your deployment. [20:35:50] it was finished. it was just about logout after being done i think. [20:39:43] out of curiosity, for time-consuming backports (10+ minutes), presumably it's still ok to manually merge those and in the meantime run other config patches thru spiderpig? [20:40:31] (03Merged) 10jenkins-bot: Campaign: Ensure `
` wrapper is removed [extensions/UploadWizard] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152080 (https://phabricator.wikimedia.org/T395023) (owner: 10C. Scott Ananian) [20:40:42] ok, here we go! [20:40:45] !log cscott@deploy1003 Started scap sync-world: Backport for [[gerrit:1152080|Campaign: Ensure `
` wrapper is removed (T395023)]] [20:40:50] T395023: UploadWizard campaigns field text parsing broken - https://phabricator.wikimedia.org/T395023 [20:42:38] !log cscott@deploy1003 cscott: Backport for [[gerrit:1152080|Campaign: Ensure `
` wrapper is removed (T395023)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:42:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T395241)', diff saved to https://phabricator.wikimedia.org/P76700 and previous config saved to /var/cache/conftool/dbconfig/20250529-204251-fceratto.json [20:43:11] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [20:45:27] (03CR) 10Scott French: [C:03+1] "Same whitespace nit, but otherwise LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151208 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [20:45:34] testing [20:48:49] !log cscott@deploy1003 cscott: Continuing with sync [20:52:46] cjming: Yes, use any optimizations that you understand and feel comfortable with. [20:53:00] Really it's just understanding that what is merged is what will be deployed. [20:53:36] sounds good [20:54:12] (03PS2) 10Krinkle: noc: Fix invalid `max-age: 300` syntax to `max-age=300` in fileserve.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151236 (https://phabricator.wikimedia.org/T341859) [20:55:03] (03PS5) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [20:55:46] !log cscott@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152080|Campaign: Ensure `
` wrapper is removed (T395023)]] (duration: 15m 01s) [20:55:51] T395023: UploadWizard campaigns field text parsing broken - https://phabricator.wikimedia.org/T395023 [20:56:11] done! [20:56:24] thx [20:56:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151236 (https://phabricator.wikimedia.org/T341859) (owner: 10Krinkle) [20:56:31] (03CR) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [20:57:01] (03CR) 10Scott French: [C:03+1] "As long as Id0d907e554ee28f096ada2ca325b64a77f55e7af is merged, the superfluous exclude for the geoip policy is removed, and the handful o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147787 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [20:57:03] (03PS1) 10Dzahn: zuul: create profile to setup system user and group [puppet] - 10https://gerrit.wikimedia.org/r/1152145 [20:57:12] (03Merged) 10jenkins-bot: noc: Fix invalid `max-age: 300` syntax to `max-age=300` in fileserve.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151236 (https://phabricator.wikimedia.org/T341859) (owner: 10Krinkle) [20:57:25] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1151236|noc: Fix invalid `max-age: 300` syntax to `max-age=300` in fileserve.php (T341859)]] [20:57:31] T341859: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 [20:57:50] Krinkle: is your patch testable? ok to sync when ready? [20:58:00] Testable, yes. [20:59:20] !log cjming@deploy1003 cjming, krinkle: Backport for [[gerrit:1151236|noc: Fix invalid `max-age: 300` syntax to `max-age=300` in fileserve.php (T341859)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:59:25] then please test [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250529T2100) [21:00:09] apparently not, I thought post-k8s that noc.wikimedia.org supports routing to mw-debug instead of mw-misc, apparenlty not. [21:00:17] This patch is for https://noc.wikimedia.org/conf/ [21:00:29] so very low risk, feel free to roll out cjming , I'll have to test it afterward. [21:00:35] this domain doesnt' support WikimediaDebug yet [21:00:39] ok [21:00:43] !log cjming@deploy1003 cjming, krinkle: Continuing with sync [21:01:32] (03CR) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [21:03:03] (03CR) 10Scott French: [C:03+1] profile::kubernetes::deployment_server: add new mw-experimental release (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148300 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [21:07:12] (03CR) 10Dzahn: mediawiki/apache: redirect tj.*.org to tg.*.org for all projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [21:07:48] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151236|noc: Fix invalid `max-age: 300` syntax to `max-age=300` in fileserve.php (T341859)]] (duration: 10m 22s) [21:07:52] T341859: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 [21:08:10] Krinkle: should be live :) [21:08:16] moving onto my patches [21:08:35] (03PS2) 10Phuedx: EventStreamConfig: Remove xLab development streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151148 (https://phabricator.wikimedia.org/T393918) [21:09:15] ack [21:09:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151148 (https://phabricator.wikimedia.org/T393918) (owner: 10Phuedx) [21:10:09] (03Merged) 10jenkins-bot: EventStreamConfig: Remove xLab development streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151148 (https://phabricator.wikimedia.org/T393918) (owner: 10Phuedx) [21:10:25] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1151148|EventStreamConfig: Remove xLab development streams (T393918)]] [21:10:30] T393918: Instrumentation for Synthetic A/A Test (SDS 2.4.11) - https://phabricator.wikimedia.org/T393918 [21:10:37] (03PS2) 10Dzahn: zuul: create profile to setup system user and group [puppet] - 10https://gerrit.wikimedia.org/r/1152145 [21:12:11] (03PS3) 10Dzahn: zuul: create profile to setup system user and group [puppet] - 10https://gerrit.wikimedia.org/r/1152145 [21:12:18] !log cjming@deploy1003 cjming, phuedx: Backport for [[gerrit:1151148|EventStreamConfig: Remove xLab development streams (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:12:58] !log cjming@deploy1003 cjming, phuedx: Continuing with sync [21:13:20] (03CR) 10Dzahn: "starting with the part that seems already clear.. reducing the size of future changes :)" [puppet] - 10https://gerrit.wikimedia.org/r/1152145 (owner: 10Dzahn) [21:20:00] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151148|EventStreamConfig: Remove xLab development streams (T393918)]] (duration: 09m 34s) [21:20:06] T393918: Instrumentation for Synthetic A/A Test (SDS 2.4.11) - https://phabricator.wikimedia.org/T393918 [21:20:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152115 (https://phabricator.wikimedia.org/T393918) (owner: 10Clare Ming) [21:23:07] (03PS1) 10Andrew Bogott: ldap::client::config: manage /etc/ldap dir [puppet] - 10https://gerrit.wikimedia.org/r/1152155 [21:23:29] (03CR) 10CI reject: [V:04-1] ldap::client::config: manage /etc/ldap dir [puppet] - 10https://gerrit.wikimedia.org/r/1152155 (owner: 10Andrew Bogott) [21:24:06] (03PS2) 10Andrew Bogott: ldap::client::config: manage /etc/ldap dir [puppet] - 10https://gerrit.wikimedia.org/r/1152155 [21:24:28] (03CR) 10CI reject: [V:04-1] ldap::client::config: manage /etc/ldap dir [puppet] - 10https://gerrit.wikimedia.org/r/1152155 (owner: 10Andrew Bogott) [21:24:38] (03CR) 10Scott French: mw-experimental: create new service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150762 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [21:24:44] (03PS3) 10Andrew Bogott: ldap::client::config: manage /etc/ldap dir [puppet] - 10https://gerrit.wikimedia.org/r/1152155 [21:24:58] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1152155 (owner: 10Andrew Bogott) [21:28:18] (03Merged) 10jenkins-bot: ext.wikimediaEvents: Add XLab PageVisit instrument [extensions/WikimediaEvents] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152115 (https://phabricator.wikimedia.org/T393918) (owner: 10Clare Ming) [21:28:32] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1152115|ext.wikimediaEvents: Add XLab PageVisit instrument (T393918 T392313)]] [21:28:38] T393918: Instrumentation for Synthetic A/A Test (SDS 2.4.11) - https://phabricator.wikimedia.org/T393918 [21:28:39] T392313: [Epic] SDS 2.4.11 Run a Synthetic A/A Experiment - https://phabricator.wikimedia.org/T392313 [21:29:04] (03CR) 10Andrew Bogott: [C:03+2] ldap::client::config: manage /etc/ldap dir [puppet] - 10https://gerrit.wikimedia.org/r/1152155 (owner: 10Andrew Bogott) [21:30:25] !log cjming@deploy1003 cjming: Backport for [[gerrit:1152115|ext.wikimediaEvents: Add XLab PageVisit instrument (T393918 T392313)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:31:04] !log cjming@deploy1003 cjming: Continuing with sync [21:33:02] (03CR) 10Scott French: "Thanks, Effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150769 (https://phabricator.wikimedia.org/T395284) (owner: 10Effie Mouzeli) [21:38:26] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152115|ext.wikimediaEvents: Add XLab PageVisit instrument (T393918 T392313)]] (duration: 09m 54s) [21:38:32] T393918: Instrumentation for Synthetic A/A Test (SDS 2.4.11) - https://phabricator.wikimedia.org/T393918 [21:38:33] T392313: [Epic] SDS 2.4.11 Run a Synthetic A/A Experiment - https://phabricator.wikimedia.org/T392313 [21:41:04] !log end of UTC late backport window [21:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:05] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:28:15] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [22:28:34] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:30:13] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:30:17] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:31:03] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:34:05] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:34:59] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:38:05] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [22:38:15] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 555 bytes in 7.222 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:38:39] (03PS1) 10Arlolra: Remove wgParserEnableLegacyHeadingDOM option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152165 (https://phabricator.wikimedia.org/T371756) [22:39:03] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:39:26] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:42:55] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:43:05] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:43:34] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:49:17] (03PS1) 10Bvibber: Validation fix for saving Data: .chart pages with transforms [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152167 (https://phabricator.wikimedia.org/T395631) [22:50:10] i have a fix for another small regression i introduced with my deploy today, anyone mind giviing a +2 so i can shove it out today and not wait til monday? :D https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Chart/+/1152166 [22:54:43] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:55:53] (03PS5) 10BCornwall: varnish: Replace X-Page-ID with variable [puppet] - 10https://gerrit.wikimedia.org/r/1152118 (https://phabricator.wikimedia.org/T373550) [22:55:55] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:56:05] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:56:19] (03CR) 10BCornwall: "`" [puppet] - 10https://gerrit.wikimedia.org/r/1152118 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [22:56:22] (03CR) 10BCornwall: [V:03+1] varnish: Replace X-Page-ID with variable [puppet] - 10https://gerrit.wikimedia.org/r/1152118 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [22:59:43] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:04:43] FIRING: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:09:43] RESOLVED: [4x] RipeAtlasAnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold for measurement 95145503 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:10:49] (03PS6) 10BCornwall: varnish: Replace X-Page-ID with variable [puppet] - 10https://gerrit.wikimedia.org/r/1152118 (https://phabricator.wikimedia.org/T373550) [23:13:20] (03CR) 10Jdlrobson: [C:03+1] Remove wgParserEnableLegacyHeadingDOM option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152165 (https://phabricator.wikimedia.org/T371756) (owner: 10Arlolra) [23:15:01] (03CR) 10BCornwall: [V:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1152118 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [23:15:12] FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:18:21] (03PS3) 10BCornwall: varnish: Implement translation analytics vars [puppet] - 10https://gerrit.wikimedia.org/r/1152114 (https://phabricator.wikimedia.org/T373550) [23:22:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy1003 using scap backport" [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152167 (https://phabricator.wikimedia.org/T395631) (owner: 10Bvibber) [23:23:44] (03Merged) 10jenkins-bot: Validation fix for saving Data: .chart pages with transforms [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1152167 (https://phabricator.wikimedia.org/T395631) (owner: 10Bvibber) [23:23:58] !log bvibber@deploy1003 Started scap sync-world: Backport for [[gerrit:1152167|Validation fix for saving Data: .chart pages with transforms (T395631)]] [23:24:02] T395631: Chart validation goes into infinite loop on saving chart descriptions with transform args - https://phabricator.wikimedia.org/T395631 [23:25:54] !log bvibber@deploy1003 bvibber: Backport for [[gerrit:1152167|Validation fix for saving Data: .chart pages with transforms (T395631)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:27:03] !log bvibber@deploy1003 bvibber: Continuing with sync [23:27:06] confirmed good [23:33:58] !log bvibber@deploy1003 Finished scap sync-world: Backport for [[gerrit:1152167|Validation fix for saving Data: .chart pages with transforms (T395631)]] (duration: 10m 00s) [23:34:09] T395631: Chart validation goes into infinite loop on saving chart descriptions with transform args - https://phabricator.wikimedia.org/T395631 [23:34:17] whee [23:39:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1152172 [23:39:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1152172 (owner: 10TrainBranchBot) [23:49:04] (03CR) 10Bartosz Dziewoński: [C:03+1] Remove wgParserEnableLegacyHeadingDOM option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1152165 (https://phabricator.wikimedia.org/T371756) (owner: 10Arlolra) [23:54:22] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1152172 (owner: 10TrainBranchBot) [23:55:50] (03CR) 10Cwhite: [C:03+1] "The rsyslog config looks right, but I haven't tested it. Maybe worth testing manually if you haven't already. 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse)