[00:00:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133575 (https://phabricator.wikimedia.org/T389734) (owner: 10Tim Starling) [00:00:58] (03Merged) 10jenkins-bot: Temporarily disable Lua profiler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133575 (https://phabricator.wikimedia.org/T389734) (owner: 10Tim Starling) [00:01:36] !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1133575|Temporarily disable Lua profiler (T389734)]] [00:01:39] T389734: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" or "Wikimedia\Rdbms\DBUnexpectedError" errors - https://phabricator.wikimedia.org/T389734 [00:01:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2060:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2060 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:08:28] !log tstarling@deploy1003 tstarling: Backport for [[gerrit:1133575|Temporarily disable Lua profiler (T389734)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:08:31] T389734: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" or "Wikimedia\Rdbms\DBUnexpectedError" errors - https://phabricator.wikimedia.org/T389734 [00:09:36] !log tstarling@deploy1003 tstarling: Continuing with sync [00:12:43] (03CR) 10C. Scott Ananian: [C:03+1] Enable Parsoid Read Views to incubator and dagwiki mobile frontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133141 (https://phabricator.wikimedia.org/T380768) (owner: 10Isabelle Hurbain-Palatin) [00:15:36] !log zabe@mwmaint1002:~$ cat group2.dblist | xargs -I{} bash -c "echo {}; mwscript extensions/AbuseFilter/maintenance/MigrateESRefToAflTable.php {} --deletedump /home/zabe/afl_text_table_deletedump/{} --dump /home/zabe/afl_text_table_dump/{} --sleep 0.4" # T381599 [00:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:39] T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599 [00:16:41] !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133575|Temporarily disable Lua profiler (T389734)]] (duration: 15m 04s) [00:16:43] T389734: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" or "Wikimedia\Rdbms\DBUnexpectedError" errors - https://phabricator.wikimedia.org/T389734 [00:32:12] FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:37:12] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [00:39:34] !log starting `nodetool garbagecollect` on Cassandra/sessionstore2006 [00:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:31] (03CR) 10Subramanya Sastry: [C:03+1] Parsoid Fragment Support v3: make mStripExtTags a persistent Parser property [core] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133581 (https://phabricator.wikimedia.org/T390420) (owner: 10C. Scott Ananian) [01:00:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2026:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2026 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:56:35] (03PS1) 10Andrew Bogott: backy2: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1133588 [01:56:43] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133588 (owner: 10Andrew Bogott) [01:56:59] (03CR) 10CI reject: [V:04-1] backy2: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1133588 (owner: 10Andrew Bogott) [01:58:25] (03PS2) 10Andrew Bogott: backy2: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1133588 [01:59:47] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133588 (owner: 10Andrew Bogott) [02:02:29] (03CR) 10Andrew Bogott: [C:03+2] backy2: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1133588 (owner: 10Andrew Bogott) [02:04:28] jhathaway: trying a different patch, I get a bunch of password prompts when trying to puppet merge [02:19:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706528 (10phaultfinder) [02:19:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10706529 (10phaultfinder) [02:24:33] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922 (10phaultfinder) 03NEW [02:39:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10706540 (10phaultfinder) [02:42:12] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:59:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706542 (10phaultfinder) [03:10:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2026:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2026 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:28:56] (03PS1) 10JHathaway: test [puppet] - 10https://gerrit.wikimedia.org/r/1133591 [03:29:19] andrewbogott: strange, running a test merge now [03:29:38] (03CR) 10JHathaway: [C:03+2] test [puppet] - 10https://gerrit.wikimedia.org/r/1133591 (owner: 10JHathaway) [03:29:44] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10706572 (10Krinkle) [03:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706576 (10phaultfinder) [03:36:47] (03PS1) 10HMonroy: Enable Codex and Multiblocks in German wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) [03:39:30] jhathaway: I'm about to go to bed, but, do you see it? [03:40:01] andrewbogott: thanks for catching it, I see what the issue is, should be fairly easy to fix thanks [03:40:21] great! Will my phantom patches get merged as a side-effect? [03:42:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [03:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10706579 (10phaultfinder) [03:46:17] good question, I didn't see them when I tried to merge my patch [03:49:43] (03PS1) 10JHathaway: puppetserver: fix sudo user for deploy [puppet] - 10https://gerrit.wikimedia.org/r/1133593 (https://phabricator.wikimedia.org/T385995) [03:49:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [03:49:57] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133593 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [03:52:57] (03CR) 10JHathaway: [C:03+2] puppetserver: fix sudo user for deploy [puppet] - 10https://gerrit.wikimedia.org/r/1133593 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [03:59:22] (03PS1) 10JHathaway: Revert "test" [puppet] - 10https://gerrit.wikimedia.org/r/1133594 [04:01:22] (03CR) 10JHathaway: [C:03+2] Revert "test" [puppet] - 10https://gerrit.wikimedia.org/r/1133594 (owner: 10JHathaway) [04:02:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [04:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706601 (10phaultfinder) [04:34:00] FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:37:12] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [04:57:12] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:19:00] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:19:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:27:12] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:29:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T0600) [06:00:05] marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T0600) [06:07:48] (03PS1) 10Kevin Bazira: EventStreamConfig: Add RRLA prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179) [06:20:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr4-ulsfo and Hurricane Electric (2001:504:0:1::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706634 (10phaultfinder) [06:40:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr4-ulsfo and Hurricane Electric (2001:504:0:1::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:45:08] (03CR) 10Slyngshede: [C:03+2] Permission: Prevent request of unconfigured permission [software/bitu] - 10https://gerrit.wikimedia.org/r/1133365 (https://phabricator.wikimedia.org/T390837) (owner: 10Slyngshede) [06:45:56] (03PS1) 10Elukey: role::deployment_server::kubernetes: limit Docker concurrent uploads [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) [06:46:20] (03PS1) 10Muehlenhoff: Switch ganeti3007 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133741 [06:47:45] (03Merged) 10jenkins-bot: Permission: Prevent request of unconfigured permission [software/bitu] - 10https://gerrit.wikimedia.org/r/1133365 (https://phabricator.wikimedia.org/T390837) (owner: 10Slyngshede) [06:48:57] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5202/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [06:49:49] (03PS2) 10Elukey: role::deployment_server::kubernetes: limit Docker concurrent uploads [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) [06:52:28] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5203/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [06:52:41] (03CR) 10Elukey: role::deployment_server::kubernetes: limit Docker concurrent uploads [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [06:53:47] (03CR) 10Elukey: "I think it is a reasonable test to do, we can easily revert in case it is too slow or not suitable for scap use cases." [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [06:54:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3007.esams.wmnet [06:55:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:57:05] (03PS4) 10Slyngshede: Release version 0.1.9 [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357 [06:57:18] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti3007 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133741 (owner: 10Muehlenhoff) [07:00:04] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:00:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3007.esams.wmnet [07:01:14] (03PS1) 10Kevin Bazira: ml-services: update RRLA prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133742 (https://phabricator.wikimedia.org/T326179) [07:02:08] (03CR) 10Slyngshede: [C:03+2] Release version 0.1.9 [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357 (owner: 10Slyngshede) [07:02:49] (03PS2) 10Elukey: services: update eqiad changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132039 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [07:02:49] (03PS3) 10Elukey: services: update codfw changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [07:02:49] (03PS3) 10Elukey: services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [07:04:58] (03Merged) 10jenkins-bot: Release version 0.1.9 [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357 (owner: 10Slyngshede) [07:05:15] (03CR) 10Elukey: [C:03+2] services: update eqiad changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132039 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [07:05:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:07:10] jouncebot: next [07:07:10] In 0 hour(s) and 52 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T0800) [07:07:19] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync [07:07:55] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [07:08:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3007.esams.wmnet [07:08:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3007.esams.wmnet [07:08:49] (03CR) 10Alexandros Kosiaris: [C:03+1] "LGTM, worthy of a test" [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [07:09:00] FIRING: [9x] ProbeDown: Service ganeti3007:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:09:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706674 (10phaultfinder) [07:10:07] (03CR) 10Vgutierrez: [C:03+1] "it looks like you forgot to push the 9.2.10 tag:" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1133553 (https://phabricator.wikimedia.org/T390912) (owner: 10Ssingh) [07:10:37] (03PS1) 10Marostegui: installserver: Do not reimage db2241 [puppet] - 10https://gerrit.wikimedia.org/r/1133744 [07:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:12:25] (03CR) 10Elukey: [C:03+2] role::deployment_server::kubernetes: limit Docker concurrent uploads [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [07:12:56] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2241 [puppet] - 10https://gerrit.wikimedia.org/r/1133744 (owner: 10Marostegui) [07:13:39] (03PS1) 10Alexandros Kosiaris: admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) [07:14:58] (03CR) 10CI reject: [V:04-1] admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris) [07:18:07] (03PS1) 10Muehlenhoff: Switch ganeti3008 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133749 [07:22:27] !log restart docker on deploy1003 to pick up max-concurrent-uploads=1 - T390251 [07:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:30] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [07:24:57] (03CR) 10DCausse: [C:03+2] "thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1133556 (owner: 10Ryan Kemper) [07:26:16] (03CR) 10Fabfur: [C:03+2] hiera: enable TLS on volatile storage in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1133405 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [07:26:30] (03Merged) 10jenkins-bot: ElevatedMaxLagWDQS: operate only on wdqs traffic [alerts] - 10https://gerrit.wikimedia.org/r/1133556 (owner: 10Ryan Kemper) [07:27:17] !log disabling puppet on A:cp-ulsfo to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133405 (T384227) [07:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:20] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [07:27:24] (03PS2) 10Alexandros Kosiaris: admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) [07:27:48] (03CR) 10Joely Rooke WMDE: [C:03+1] "Ready for BACON I think!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [07:28:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3008.esams.wmnet [07:31:18] !log applying patch to use TLS on tmpfs on A:cp-ulsfo (T384227) [07:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:30] (03CR) 10Ilias Sarantopoulos: [C:03+1] admin-ng/mlserve: Remove ratelimit in istio sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133381 (https://phabricator.wikimedia.org/T388817) (owner: 10Klausman) [07:32:59] (03CR) 10CI reject: [V:04-1] admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris) [07:33:08] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti3008 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133749 (owner: 10Muehlenhoff) [07:36:34] (03PS2) 10Kevin Bazira: ml-services: update RRLA prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133742 (https://phabricator.wikimedia.org/T326179) [07:36:58] !log added spiderpig-access LDAP group T390338 [07:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:00] T390338: Create 'spiderpig-access' ldap group - https://phabricator.wikimedia.org/T390338 [07:37:33] (03PS2) 10Muehlenhoff: Add a canonical list of sensitive LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1133325 [07:38:07] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1044.eqiad.wmnet [07:38:14] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2044.codfw.wmnet [07:38:36] (03PS1) 10Volans: dnsdisc: make it compatible with bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808 [07:39:44] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update RRLA prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133742 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [07:40:54] (03PS1) 10Muehlenhoff: Bitu: Add approval role for spiderpig-access LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1133810 (https://phabricator.wikimedia.org/T390338) [07:41:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3008.esams.wmnet [07:41:54] (03CR) 10Muehlenhoff: [C:03+2] Add a canonical list of sensitive LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1133325 (owner: 10Muehlenhoff) [07:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:44:01] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1044.eqiad.wmnet [07:44:59] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2044.codfw.wmnet [07:47:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [07:47:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3008.esams.wmnet [07:47:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3008.esams.wmnet [07:49:00] FIRING: [9x] ProbeDown: Service ganeti3008:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:54:07] (03PS1) 10Alexandros Kosiaris: Revert^4 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133812 [07:54:27] (03PS2) 10Alexandros Kosiaris: Revert^4 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133812 [07:54:32] (03CR) 10CI reject: [V:04-1] Revert^4 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133812 (owner: 10Alexandros Kosiaris) [07:54:49] !log failover ganeti masters in esams to ganeti3007/3008 [07:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:26] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update RRLA prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133742 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [07:57:34] (03PS3) 10Alexandros Kosiaris: Revert^4 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133812 (https://phabricator.wikimedia.org/T384944) [07:57:53] (03Merged) 10jenkins-bot: ml-services: update RRLA prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133742 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [08:00:05] dancy and andre: MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T0800). Please do the needful. [08:00:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706791 (10phaultfinder) [08:02:16] (03CR) 10Slyngshede: [C:03+1] "neat" [puppet] - 10https://gerrit.wikimedia.org/r/1133810 (https://phabricator.wikimedia.org/T390338) (owner: 10Muehlenhoff) [08:04:19] (03CR) 10Vgutierrez: [C:03+1] Revert^4 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133812 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [08:04:54] (03PS3) 10Alexandros Kosiaris: admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) [08:05:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3005.esams.wmnet [08:05:15] (03PS1) 10Muehlenhoff: Switch ganeti3005 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133814 [08:06:09] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:07:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [08:09:36] (03PS1) 10Slyngshede: IDM: upgrade to Bitu version 0.1.9 [dns] - 10https://gerrit.wikimedia.org/r/1133815 [08:10:32] (03CR) 10Elukey: admin_ng: Preserve Server header in ingressgateway (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris) [08:12:05] (03CR) 10Slyngshede: [C:03+2] IDM: upgrade to Bitu version 0.1.9 [dns] - 10https://gerrit.wikimedia.org/r/1133815 (owner: 10Slyngshede) [08:12:15] !log slyngshede@dns1004 START - running authdns-update [08:12:21] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:16:49] 06SRE: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10706825 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:18:01] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti3005 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133814 (owner: 10Muehlenhoff) [08:18:19] !log slyngshede@dns1004 START - running authdns-update [08:19:21] (03CR) 10Alexandros Kosiaris: admin_ng: Preserve Server header in ingressgateway (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris) [08:19:31] (03PS4) 10Alexandros Kosiaris: admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) [08:20:06] (03CR) 10Elukey: [C:03+2] services: update codfw changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [08:20:08] (03CR) 10Alexandros Kosiaris: [C:03+2] Revert^4 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133812 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [08:20:39] jouncebot: next [08:20:39] In 1 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1000) [08:20:42] !log slyngshede@dns1004 END - running authdns-update [08:21:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet [08:21:49] !log Upgrading CI Jenkins [08:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:48] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync [08:23:54] (03PS1) 10Filippo Giunchedi: pontoon: allocate all role prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/1133817 [08:24:07] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync [08:24:48] (03PS2) 10Volans: dnsdisc: make it compatible with bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808 (https://phabricator.wikimedia.org/T389380) [08:24:56] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: allocate all role prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/1133817 (owner: 10Filippo Giunchedi) [08:25:31] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: enable inference batching for requests in edit-check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133364 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos) [08:26:56] (03Merged) 10jenkins-bot: ml-services: enable inference batching for requests in edit-check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133364 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos) [08:26:58] (03CR) 10Muehlenhoff: [C:03+2] Bitu: Add approval role for spiderpig-access LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1133810 (https://phabricator.wikimedia.org/T390338) (owner: 10Muehlenhoff) [08:28:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3005.esams.wmnet [08:29:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3005.esams.wmnet [08:31:52] (03PS1) 10Alexandros Kosiaris: wikifunctions: Move to lvs_setup, disabling paging [puppet] - 10https://gerrit.wikimedia.org/r/1133821 (https://phabricator.wikimedia.org/T384944) [08:32:12] FIRING: [9x] ProbeDown: Service ganeti3005:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:37:12] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [08:39:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706882 (10phaultfinder) [08:41:50] (03CR) 10Btullis: [C:03+2] mediawiki-dumps-legacy: render a test config file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133482 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [08:42:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3006.esams.wmnet [08:43:10] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: render a test config file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133482 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis) [08:44:48] (03PS1) 10Muehlenhoff: Switch ganeti3006 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133830 [08:45:56] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided) [08:46:48] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided) (duration: 00m 54s) [08:47:52] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided) [08:48:54] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided) (duration: 01m 03s) [08:50:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10706910 (10phaultfinder) [08:50:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10706911 (10phaultfinder) [08:52:21] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti3006 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133830 (owner: 10Muehlenhoff) [08:53:19] !log secure deleting certificates in /etc/ssl/private from A:cp-magru (T384227) [08:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:22] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [08:56:56] (03PS1) 10Joal: Update GobblinLastSuccessfulRunTooLongAgo [alerts] - 10https://gerrit.wikimedia.org/r/1133847 (https://phabricator.wikimedia.org/T386177) [08:58:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [08:58:51] (03PS1) 10Elukey: profile::service_proxy::envoy: add data-gateway-staging [puppet] - 10https://gerrit.wikimedia.org/r/1133848 [08:59:29] <_joe_> here [09:00:15] <_joe_> XioNoX / topranks are you doign anything with that switch? [09:00:19] what's happening ? [09:00:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet [09:00:38] <_joe_> fabfur: just excessive network traffic [09:00:42] <_joe_> !incidents [09:00:43] 5939 (ACKED) Host pfw1-eqiad - PING - Packet loss = 100% [09:00:43] 5945 (UNACKED) Primary outbound port utilisation over 80% (paged) network noc (asw2-a-eqiad.mgmt.eqiad.wmnet) [09:00:43] 5944 (RESOLVED) [3x] ProbeDown sre (ip4 ncredir-https:443 probes/service http_ncredir-https_ip4) [09:00:43] 5942 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [09:00:43] 5943 (RESOLVED) [2x] Primary inbound port utilisation over 80% (paged) network noc () [09:00:56] <_joe_> !ack 5945 [09:00:57] 5945 (ACKED) Primary outbound port utilisation over 80% (paged) network noc (asw2-a-eqiad.mgmt.eqiad.wmnet) [09:01:02] (03CR) 10Elukey: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris) [09:01:17] (03CR) 10Brouberol: Update GobblinLastSuccessfulRunTooLongAgo (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1133847 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [09:01:38] _joe_: no, and ar zel not working today [09:01:43] * topranks looking [09:01:46] <_joe_> oh ok [09:01:48] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5204/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey) [09:01:53] <_joe_> yeah I'm in librenms [09:02:11] (03PS2) 10Joal: Update GobblinLastSuccessfulRunTooLongAgo [alerts] - 10https://gerrit.wikimedia.org/r/1133847 (https://phabricator.wikimedia.org/T386177) [09:02:30] (03CR) 10Joal: Update GobblinLastSuccessfulRunTooLongAgo (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1133847 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [09:02:32] _joe_ it may be a big analytics job running on hadoop, it happened in the past [09:02:41] pfw1-eqiad we had problems with yesterday [09:02:51] <_joe_> yeah I would think that's the case, heh [09:03:24] !log secure deleting certificates in /etc/ssl/private from A:cp-ulsfo (T384227) [09:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:26] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [09:03:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [09:03:45] <_joe_> heh resolved while I was investigating [09:04:31] it's analytics traffic [09:04:32] https://grafana.wikimedia.org/goto/DrnxVR0HR?orgId=1 [09:04:47] I found https://yarn.wikimedia.org/proxy/application_1741864027385_464026/ that could be the culprit, not 100% sure though [09:04:52] the job is really huge [09:05:44] joal: o/ [09:06:00] if you have a moment, we got a page for a switch link almost saturated (10G) [09:06:15] nothing broken atm, but I am wondering if there is a huge job that runs on hadoop [09:06:29] it may also be somebody fetching data from presto, from what Cathal found [09:06:51] I noticed https://yarn.wikimedia.org/proxy/application_1741864027385_464026/ that is big, but you know best :) [09:07:18] that's data being fetched from an-workers (running hadoop), going towards presto afaik [09:08:00] one help is we have that profiled in qos now, so I can see it didn't squeeze out other data on the links where we have stats [09:08:01] https://grafana.wikimedia.org/goto/7RH8VR0NR?orgId=1 [09:08:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3006.esams.wmnet [09:08:10] (no stats from the asw2 devices as they don't export this for us) [09:08:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3006.esams.wmnet [09:09:00] FIRING: [9x] ProbeDown: Service ganeti3006:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:09:15] (03CR) 10Elukey: [V:03+1] "Hey folks! Lemme know if it is something that could work, I am not 100% sure, it seems the first of its kind (but it could be useful in th" [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey) [09:10:20] topranks: ah we have qos for hadoop workers now? [09:10:40] yeah we added an iptables rule last week to de-prioritise it [09:11:00] that doesn't mean it can't push the usage on a link to maximum [09:11:10] but it does mean when that happens the other traffic gets priority [09:11:30] via iptables, interesting [09:11:31] so link maxed out, but hopefully traffic for other services unaffected, or at least impact mitigated significantly [09:11:50] iptables just does the marking of the packets on the host, the network then treats them different [09:12:56] okok, totally ignorant about how it is implemented, I'll check it later [09:15:16] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [09:15:25] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [09:18:18] (03PS1) 10MVernon: install-server: also run configure_swift_disks for apus-* [puppet] - 10https://gerrit.wikimedia.org/r/1133849 (https://phabricator.wikimedia.org/T390578) [09:20:20] (03CR) 10Marostegui: [C:03+1] install-server: also run configure_swift_disks for apus-* [puppet] - 10https://gerrit.wikimedia.org/r/1133849 (https://phabricator.wikimedia.org/T390578) (owner: 10MVernon) [09:21:21] (03PS1) 10Fabfur: hiera: enable TLS on volatile storage in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1133850 (https://phabricator.wikimedia.org/T384227) [09:21:25] (03CR) 10MVernon: [C:03+2] install-server: also run configure_swift_disks for apus-* [puppet] - 10https://gerrit.wikimedia.org/r/1133849 (https://phabricator.wikimedia.org/T390578) (owner: 10MVernon) [09:22:23] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133850 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [09:24:42] (03CR) 10Btullis: [C:03+2] Update GobblinLastSuccessfulRunTooLongAgo [alerts] - 10https://gerrit.wikimedia.org/r/1133847 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [09:25:58] (03Merged) 10jenkins-bot: Update GobblinLastSuccessfulRunTooLongAgo [alerts] - 10https://gerrit.wikimedia.org/r/1133847 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [09:26:00] (03PS1) 10Muehlenhoff: Add spiderpig-access to list of sensitive LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1133852 (https://phabricator.wikimedia.org/T390338) [09:27:12] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:37:12] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:37:12] FIRING: [10x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:39:14] (03CR) 10Fabfur: [C:03+1] wmflib,liberica: Add support for DNS healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1129326 (https://phabricator.wikimedia.org/T389211) (owner: 10Vgutierrez) [09:42:12] FIRING: [10x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:43:42] (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris) [09:45:25] (03CR) 10Slyngshede: [C:03+1] Add spiderpig-access to list of sensitive LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1133852 (https://phabricator.wikimedia.org/T390338) (owner: 10Muehlenhoff) [09:45:36] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab_runner: increase job output_limit to 20MB [puppet] - 10https://gerrit.wikimedia.org/r/1133316 (https://phabricator.wikimedia.org/T390816) (owner: 10Jelto) [09:48:39] (03Merged) 10jenkins-bot: admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris) [09:49:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10707127 (10phaultfinder) [09:51:10] (03PS1) 10Stevemunene: hdfs: Remove disk space checks for hadoop worker [puppet] - 10https://gerrit.wikimedia.org/r/1133853 (https://phabricator.wikimedia.org/T390875) [09:51:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:51:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [09:51:44] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [09:51:55] !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:52:08] !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:52:14] !log lvextend --resizefs --size +1TB vg0/srv on mwlog[12]002 [09:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:35] !log akosiaris@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:52:52] !log akosiaris@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:52:54] (03CR) 10Vgutierrez: [C:03+1] hiera: enable TLS on volatile storage in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1133850 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [09:54:51] !log deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1133745 in all k8s ingresses to stop ingressgateway from forcefully setting the HTTP server header in the responses to "istio-envoy" [09:54:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [09:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [09:56:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:58:59] !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133850 to use TLS on tmpfs on A:cp-eqsin (T384227) [09:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:02] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [09:59:40] !log disable puppet on A:cp-eqsin [09:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [09:59:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1000) [10:00:10] (03CR) 10Muehlenhoff: [C:03+2] Add spiderpig-access to list of sensitive LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1133852 (https://phabricator.wikimedia.org/T390338) (owner: 10Muehlenhoff) [10:02:14] !log fabfur@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15 days, 0:00:00 on cp4047.ulsfo.wmnet with reason: HW errors [10:02:19] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10707144 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=93385548-b505-4318-a69f-9b083dad822a) set by fabfur@cumin1002 for 15 days, 0:00:00 on 1 host(s) and their services with reason... [10:02:50] (03PS1) 10Alexandros Kosiaris: admin_ng: Fix indentation of EnvoyFilter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133855 [10:04:12] (03CR) 10Fabfur: [C:03+2] hiera: enable TLS on volatile storage in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1133850 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [10:05:34] (03PS1) 10Muehlenhoff: Default the ganeti role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133856 (https://phabricator.wikimedia.org/T389178) [10:08:27] (03CR) 10Alexandros Kosiaris: [C:03+2] admin_ng: Fix indentation of EnvoyFilter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133855 (owner: 10Alexandros Kosiaris) [10:10:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host apus-fe2003.codfw.wmnet with OS bookworm [10:10:21] (03CR) 10Superpes15: "Couldn't you add itwiki in the same patch to avoid double work? Being a very light change in the code there shouldn't be any issue imho" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [10:10:23] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10707179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host apus-fe2003.codfw.wmnet with OS bookworm [10:13:40] (03Merged) 10jenkins-bot: admin_ng: Fix indentation of EnvoyFilter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133855 (owner: 10Alexandros Kosiaris) [10:14:34] !log akosiaris@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:14:51] !log akosiaris@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:16:23] !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:16:37] !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:17:07] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [10:17:16] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:17:42] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:17:45] (03CR) 10Muehlenhoff: [C:03+2] Default the ganeti role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133856 (https://phabricator.wikimedia.org/T389178) (owner: 10Muehlenhoff) [10:17:48] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:18:42] !log akosiaris@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:18:48] !log akosiaris@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:20:27] (03PS1) 10Filippo Giunchedi: pontoon: fix prometheus instances_override [puppet] - 10https://gerrit.wikimedia.org/r/1133859 [10:20:50] !log akosiaris@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [10:20:55] !log akosiaris@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [10:21:17] !log akosiaris@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:21:20] !log akosiaris@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [10:22:07] !log akosiaris@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [10:22:10] !log akosiaris@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:22:25] !log akosiaris@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [10:22:30] !log akosiaris@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [10:22:52] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix prometheus instances_override [puppet] - 10https://gerrit.wikimedia.org/r/1133859 (owner: 10Filippo Giunchedi) [10:25:01] 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#10707237 (10cmooney) [10:25:55] (03PS1) 10Jelto: gitlab_runner: add profile::gitlab::runner::output_limit to wmcs projects [puppet] - 10https://gerrit.wikimedia.org/r/1133860 (https://phabricator.wikimedia.org/T390816) [10:26:08] (03CR) 10Filippo Giunchedi: hdfs: Remove disk space checks for hadoop worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133853 (https://phabricator.wikimedia.org/T390875) (owner: 10Stevemunene) [10:27:00] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-fe2003.codfw.wmnet with reason: host reimage [10:27:55] (03CR) 10Jelto: gitlab_runner: add profile::gitlab::runner::output_limit to wmcs projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133860 (https://phabricator.wikimedia.org/T390816) (owner: 10Jelto) [10:30:59] (03Abandoned) 10Jelto: gitlab_runner: add profile::gitlab::runner::output_limit to wmcs projects [puppet] - 10https://gerrit.wikimedia.org/r/1133860 (https://phabricator.wikimedia.org/T390816) (owner: 10Jelto) [10:32:06] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-fe2003.codfw.wmnet with reason: host reimage [10:35:03] 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#10707267 (10cmooney) >>! In T374614#10147994, @ayounsi wrote: > Short term I think if you add `[4Gbps]` to the interface description, LibreNMS will [[ https://docs... [10:38:51] (03PS1) 10Ladsgroup: Bump thumbnail steps to 65% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133862 (https://phabricator.wikimedia.org/T360589) [10:40:10] jouncebot: nowandnext [10:40:10] For the next 0 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1000) [10:40:10] In 1 hour(s) and 19 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1200) [10:44:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133862 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:45:32] (03Merged) 10jenkins-bot: Bump thumbnail steps to 65% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133862 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:45:50] 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#10707299 (10cmooney) [10:46:10] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1133862|Bump thumbnail steps to 65% (T360589)]] [10:46:13] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:48:41] !log remove nodejs from aqs* hosts, no longer used/needed and spares us needless security rollouts T350143 [10:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:43] T350143: Write AQS 1 deprecation announcement - https://phabricator.wikimedia.org/T350143 [10:50:00] !log drain transport circuits to eqord (Chicago network pop) to prep for Junos upgrade cr2-eqord T364092 [10:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:03] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [10:51:20] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [10:51:55] (03PS12) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [10:53:18] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1133862|Bump thumbnail steps to 65% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:53:21] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:54:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [10:54:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-fe2003.codfw.wmnet with OS bookworm [10:54:58] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10707330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host apus-fe2003.codfw.wmnet with OS bookworm completed: - apus-fe2003 (**PA... [10:55:03] (03PS5) 10Tiziano Fogli: ripe atlas anchors: icmp to http check [puppet] - 10https://gerrit.wikimedia.org/r/1127552 (https://phabricator.wikimedia.org/T388419) [10:55:07] sorry elukey, I was AFK when you pinged [10:55:19] I read the backlog and it was presto again IIUC [10:55:52] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [10:56:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10707332 (10MatthewVernon) 05Open→03Resolved OK, this is fixed, sorry about that (I'd done most of the necessary preseed changes, but had missed one). [10:57:49] (03PS6) 10Tiziano Fogli: ripe atlas anchors: icmp to http check [puppet] - 10https://gerrit.wikimedia.org/r/1127552 (https://phabricator.wikimedia.org/T388419) [10:58:06] joal: np! I was wondering if any big job was ongoing, or if somebody was querying data.. [10:58:58] It's not the first time we have issues with presto. It's mostly due to people querying datasets too big. We (DPE) need to better at not making thoses datasets available... [11:00:56] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10707339 (10elukey) @Papaul @Jhancock.wm is it worth to perform another swap test like in T388684 to see if the controller does its job... [11:02:26] joal: is there any way to track if a query is being executed? [11:02:30] on presto I mean [11:02:45] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133862|Bump thumbnail steps to 65% (T360589)]] (duration: 16m 34s) [11:02:47] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:03:26] (03PS3) 10Hnowlan: jobrnuner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791) [11:03:58] elukey: https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1 [11:04:51] elukey: I also posted a message for the DE team to discuss possible solutions soon (rather than late or never :) [11:05:09] (03PS1) 10Clément Goubert: mw::periodic_jobs: Pass command through untouched [puppet] - 10https://gerrit.wikimedia.org/r/1133864 (https://phabricator.wikimedia.org/T341555) [11:05:29] When there are spikes on the presto graph, you need to tunnel into the presto coord to get access to the UI to monitor what's running (https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Presto/Administration#View_the_Presto_UI) [11:05:31] (03PS4) 10Hnowlan: jobrunner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791) [11:05:32] (03PS1) 10Clément Goubert: mediawiki: Fix mwcron command invocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133865 (https://phabricator.wikimedia.org/T341555) [11:05:44] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133864 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [11:06:55] !log pre-pend as paths announced to codfw/eqiad from eqord to prep for JunOS upgrade T364092 [11:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:58] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [11:07:32] !log installing nodejs security updates [11:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:02] (03PS1) 10Bartosz Dziewoński: Temporary debugging code for T389728 [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133868 [11:09:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133868 (owner: 10Bartosz Dziewoński) [11:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:25] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707360 (10Jelto) [11:14:37] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707361 (10Jelto) 05Open→03In progress p:05Triage... [11:16:54] (03PS1) 10Clément Goubert: mwcron: Import all periodic_jobs resources [puppet] - 10https://gerrit.wikimedia.org/r/1133872 (https://phabricator.wikimedia.org/T341555) [11:17:55] (03CR) 10CI reject: [V:04-1] Temporary debugging code for T389728 [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133868 (owner: 10Bartosz Dziewoński) [11:23:42] (03PS2) 10Muehlenhoff: Create insetup role for WMCS with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133422 (https://phabricator.wikimedia.org/T389825) [11:27:25] (03PS2) 10Clément Goubert: mwcron: Import all periodic_jobs resources [puppet] - 10https://gerrit.wikimedia.org/r/1133872 (https://phabricator.wikimedia.org/T341555) [11:28:01] (03CR) 10Clément Goubert: [C:03+1] jobrunner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [11:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10707422 (10phaultfinder) [11:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10707423 (10phaultfinder) [11:30:34] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr2-codfw,cr2-eqiad,cr2-eqord,cr2-eqord IPv6,cr3-ulsfo with reason: Upgrade cr2-eqord JunOS [11:31:32] !log disable EBGP sessions to internet peers on cr2-eqord to prep for JunOS upgrade T364092 [11:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:35] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [11:33:17] !log reboot cr2-eqord to complete JunOS upgrade T364092 [11:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:01] !log installing Python 3.9 security updates [11:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:43:03] (03PS2) 10Alexandros Kosiaris: wikifunctions: Disable paging [puppet] - 10https://gerrit.wikimedia.org/r/1133821 (https://phabricator.wikimedia.org/T384944) [11:43:39] FIRING: CoreBGPDown: Core BGP session down between cr4-ulsfo and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr4-ulsfo:9804&var-bgp_group=Confed_eqord&var-bgp_neighbor=cr2-eqord - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:44:00] (03PS1) 10Alexandros Kosiaris: mw-wikifunctions: Switch DNS to use ingress [dns] - 10https://gerrit.wikimedia.org/r/1133878 (https://phabricator.wikimedia.org/T384944) [11:46:34] (03CR) 10Alexandros Kosiaris: jobrunner: reimage the three remaining eqiad in-warranty jobrunners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [11:46:48] !log installing Django security updates on Bullseye [11:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:48:41] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707471 (10Jelto) a:03Jelto This need approval from:... [11:48:56] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707473 (10Jelto) [11:50:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet [11:52:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [11:52:44] (03PS2) 10Bartosz Dziewoński: Temporary debugging code for T389728 [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133868 [11:53:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:56:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet [11:58:05] !log installing Intel microcode security updates [11:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1200) [12:04:12] (03PS3) 10Majavah: dynamicproxy: Add dependency on acme-chief cert [puppet] - 10https://gerrit.wikimedia.org/r/1133448 [12:05:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10707558 (10phaultfinder) [12:05:41] (03CR) 10Majavah: [C:03+2] dynamicproxy: Add dependency on acme-chief cert [puppet] - 10https://gerrit.wikimedia.org/r/1133448 (owner: 10Majavah) [12:06:54] (03PS7) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [12:07:08] (03CR) 10Kamila Součková: [C:03+1] mediawiki: Fix mwcron command invocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133865 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:08:29] (03CR) 10Kamila Součková: [C:03+1] mw::periodic_jobs: Pass command through untouched [puppet] - 10https://gerrit.wikimedia.org/r/1133864 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:11:07] (03CR) 10Klausman: [C:03+2] admin-ng/mlserve: Remove ratelimit in istio sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133381 (https://phabricator.wikimedia.org/T388817) (owner: 10Klausman) [12:12:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [12:13:24] (03CR) 10CI reject: [V:04-1] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [12:13:30] (03PS1) 10Volans: .wmfconfig: add Debian bookworm build [software/cumin] - 10https://gerrit.wikimedia.org/r/1133884 [12:13:31] (03PS1) 10Volans: cli: fine-tune CLI logging [software/cumin] - 10https://gerrit.wikimedia.org/r/1133885 [12:14:04] (03PS1) 10Volans: logging: rotate files [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133886 [12:16:04] (03Merged) 10jenkins-bot: admin-ng/mlserve: Remove ratelimit in istio sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133381 (https://phabricator.wikimedia.org/T388817) (owner: 10Klausman) [12:16:47] !log installing libxslt security updates [12:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:51] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707587 (10isarantopoulos) I approve [12:19:19] (03CR) 10Hnowlan: [C:03+1] mediawiki: Fix mwcron command invocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133865 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:21:49] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-test [12:22:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-test [12:23:23] (03PS1) 10Majavah: openstack: wikireplica_dns: Alias upcoming x3 cluster to s8 [puppet] - 10https://gerrit.wikimedia.org/r/1133892 (https://phabricator.wikimedia.org/T390954) [12:24:22] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:25:27] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:28:41] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all [12:30:29] (03CR) 10Federico Ceratto: "This should be ready for final review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [12:33:36] (03CR) 10Marostegui: [C:04-1] "Please see my previous comment about downtime" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [12:34:37] (03PS1) 10Fabfur: hiera: enable TLS on volatile storage in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1133897 (https://phabricator.wikimedia.org/T384227) [12:35:21] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133897 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [12:37:12] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [12:42:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all [12:43:03] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707701 (10Jelto) [12:43:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10707702 (10Jhancock.wm) All good! thank you for your help! [12:45:31] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10707705 (10cmooney) [12:45:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10707706 (10phaultfinder) [12:46:04] (03PS8) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [12:47:49] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:48:54] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:48:59] (03CR) 10Ladsgroup: [C:03+1] openstack: wikireplica_dns: Alias upcoming x3 cluster to s8 [puppet] - 10https://gerrit.wikimedia.org/r/1133892 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [12:49:14] (03PS2) 10Alexandros Kosiaris: mw-wikifunctions: Switch DNS to use ingress [dns] - 10https://gerrit.wikimedia.org/r/1133878 (https://phabricator.wikimedia.org/T384944) [12:49:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10707716 (10phaultfinder) [12:50:28] (03CR) 10Alexandros Kosiaris: [C:03+2] wikifunctions: Disable paging [puppet] - 10https://gerrit.wikimedia.org/r/1133821 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [12:52:40] (03PS1) 10Jelto: admin: add ozge shell user and groups [puppet] - 10https://gerrit.wikimedia.org/r/1133900 (https://phabricator.wikimedia.org/T390855) [12:52:49] (03CR) 10Muehlenhoff: [C:03+2] testreduce: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/1129878 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [12:53:26] jouncebot: now and next [12:53:26] For the next 0 hour(s) and 6 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1200) [12:53:38] joal: ack thanks! [12:53:51] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:53:51] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [12:54:25] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:55:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [12:55:42] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: move k8s prometheus1006 -> 1008 [puppet] - 10https://gerrit.wikimedia.org/r/1131302 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [12:55:58] !log move k8s instances from prometheus1006 to prometheus1008 - T383232 [12:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:01] T383232: Move k8s Prometheus instances to new Prometheus hw in eqiad/codfw - https://phabricator.wikimedia.org/T383232 [12:56:40] (03CR) 10Hnowlan: jobrunner: reimage the three remaining eqiad in-warranty jobrunners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [12:56:50] !log prune now obsolete nginx packages from testreduce1002 T329529 [12:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:53] T329529: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529 [12:57:18] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529#10707755 (10MoritzMuehlenhoff) [12:58:04] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team, 13Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707756 (10Jelto) I reached out... [12:58:32] (03CR) 10Jelto: [C:04-1] "approval from @tcipriani is still needed" [puppet] - 10https://gerrit.wikimedia.org/r/1133900 (https://phabricator.wikimedia.org/T390855) (owner: 10Jelto) [12:59:19] (03CR) 10Ozge: "looks great! thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1133900 (https://phabricator.wikimedia.org/T390855) (owner: 10Jelto) [12:59:51] (03CR) 10Ozge: [C:03+1] admin: add ozge shell user and groups [puppet] - 10https://gerrit.wikimedia.org/r/1133900 (https://phabricator.wikimedia.org/T390855) (owner: 10Jelto) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1300) [13:00:05] ihurbain, cscott, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] o/ [13:00:16] o/ [13:00:19] o/ [13:20:57] (03CR) 10Majavah: [C:03+2] openstack: wikireplica_dns: Alias upcoming x3 cluster to s8 [puppet] - 10https://gerrit.wikimedia.org/r/1133892 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah) [13:23:19] (03PS2) 10Giuseppe Lavagetto: Add mediawiki-common to mw-cron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133902 [13:23:22] win 5 [13:25:13] (03PS1) 10Filippo Giunchedi: prometheus: cleanup k8s instances from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133909 (https://phabricator.wikimedia.org/T383232) [13:25:14] (03PS1) 10Filippo Giunchedi: prometheus: cleanup k8s instances from prometheus100[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232) [13:25:33] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-wikifunctions: Switch DNS to use ingress [dns] - 10https://gerrit.wikimedia.org/r/1133878 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [13:25:46] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [13:25:48] !log akosiaris@dns1004 START - running authdns-update [13:25:49] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [13:27:13] !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133113|Enable Parsoid Read Views on 13 wiktionaries (T390680)]], [[gerrit:1133141|Enable Parsoid Read Views to incubator and dagwiki mobile frontend (T380768 T381002)]] (duration: 19m 40s) [13:27:17] T390680: Wiktionary deploy from April ~3rd 2025 - https://phabricator.wikimedia.org/T390680 [13:27:18] T380768: Deploy Parsoid Read Views to incubator (week of ????-??-??) - https://phabricator.wikimedia.org/T380768 [13:27:18] T381002: Turn on Parsoid Read Views for Mobile Front End on dagwiki - https://phabricator.wikimedia.org/T381002 [13:27:39] finally [13:27:50] woo! [13:27:52] thanks taavi :) [13:28:10] !log akosiaris@dns1004 END - running authdns-update [13:28:22] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1133581|Parsoid Fragment Support v3: make mStripExtTags a persistent Parser property (T390420)]] [13:28:25] T390420: "indicator" tag not parsed properly - https://phabricator.wikimedia.org/T390420 [13:28:54] i'm up, whee [13:29:00] FIRING: [10x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:12] !log taavi@deploy1003 scap failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.44.0-wmf.22,1.44.0-wmf.23 --multiversion-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion --multiversion-debug-image-name docker-registry.discovery.wmnet/ [13:30:12] restricted/mediawiki-multiversion-debug --multiversion-cli-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.148.0 --label vnd.wikimedia.mediawiki.versions=1.44.0-wmf.22,1.44.0-wmf.23 --label vnd.wikimedia.sc [13:30:12] ap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawiki-staging/scap/image-build --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080' returned non-zero exit status 1. (scap version: 4.148.0) (duration: 01m 49s) [13:30:24] huh, scap backport crashed [13:30:38] latest_mw_image = mw_images_by_flavour["publish"]["image"] [13:30:38] KeyError: 'publish' [13:31:06] (03CR) 10Clément Goubert: [C:03+1] Add mediawiki-common to mw-cron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133902 (owner: 10Giuseppe Lavagetto) [13:31:12] this is why i leave backporting to the professionals [13:31:18] claime: _joe_: ^ rings any bell? [13:31:33] hmmm [13:32:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [13:32:15] ah, scrolling up reveals the true error [13:32:15] 13:30:12 [mediawiki-publish-81] Err:1 http://apt.wikimedia.org/wikimedia bullseye-wikimedia/component/php81 amd64 php8.1-tidewa [13:32:15] ys amd64 5.0.4-16+wmf11u1 [13:32:15] 13:30:12 [mediawiki-publish-81] Could not connect to webproxy:8080 (208.80.154.74), connection timed out [13:32:18] let me retry [13:32:56] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1133581|Parsoid Fragment Support v3: make mStripExtTags a persistent Parser property (T390420)]] [13:34:00] <_joe_> maybe someone was restarting sqid :) [13:34:42] already blaming someone else :-) [13:34:43] !log taavi@deploy1003 scap failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.44.0-wmf.22,1.44.0-wmf.23 --multiversion-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion --multiversion-debug-image-name docker-registry.discovery.wmnet/ [13:34:43] restricted/mediawiki-multiversion-debug --multiversion-cli-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.148.0 --label vnd.wikimedia.mediawiki.versions=1.44.0-wmf.22,1.44.0-wmf.23 --label vnd.wikimedia.sc [13:34:43] ap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawiki-staging/scap/image-build --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080' returned non-zero exit status 1. (scap version: 4.148.0) (duration: 01m 46s) [13:34:51] it did it again [13:35:56] the active webproxy is install1004 which is maxing its cpu [13:36:44] (03Abandoned) 10Alexandros Kosiaris: Add group{0,1,2} and pretrain releases in mw-api-int staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115889 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [13:38:27] !log install1004: kill a dead `/usr/bin/apt-mark showmanual` process holding puppet runs [13:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:19] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1133581|Parsoid Fragment Support v3: make mStripExtTags a persistent Parser property (T390420)]] [13:39:21] T390420: "indicator" tag not parsed properly - https://phabricator.wikimedia.org/T390420 [13:39:31] third time's the charm [13:40:34] (03CR) 10Ssingh: "Thanks, pushed the tag and updated the commit to fix the typo." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1133553 (https://phabricator.wikimedia.org/T390912) (owner: 10Ssingh) [13:42:11] (03CR) 10Clément Goubert: "One optional nit, otherwise I think that should work." [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey) [13:42:12] FIRING: [10x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:16] (03CR) 10Clément Goubert: [C:03+1] profile::service_proxy::envoy: add data-gateway-staging [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey) [13:44:31] [13:45:01] !log imported imposm3 0.14.1-1 to apt.wikimedia.org for bookworm-wikimedia T389780 T381565 [13:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:05] T389780: Build and import imposm 0.14.1 plus latest bugfix - https://phabricator.wikimedia.org/T389780 [13:45:05] T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565 [13:45:16] (03CR) 10Hnowlan: [C:03+2] jobrunner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [13:46:10] cscott: please test [13:46:21] ok, thanks! [13:46:46] !log taavi@deploy1003 cscott, taavi: Backport for [[gerrit:1133581|Parsoid Fragment Support v3: make mStripExtTags a persistent Parser property (T390420)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:46:49] T390420: "indicator" tag not parsed properly - https://phabricator.wikimedia.org/T390420 [13:46:51] 07SRE-Unowned, 10Maps: Build and import imposm 0.14.1 plus latest bugfix - https://phabricator.wikimedia.org/T389780#10708012 (10MoritzMuehlenhoff) 05Open→03Resolved The latest imposm release plus a cherrypick of @Jgiannelos' patch has been built as 0.14.1-1 and imported to apt.wikimedia.org [13:47:44] (03PS1) 10Muehlenhoff: Reapply maps_bookworm role to maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1133915 (https://phabricator.wikimedia.org/T381565) [13:49:01] taavi: do you have time for my patch, or should i reschedule? [13:49:05] jouncebot: next [13:49:05] In 1 hour(s) and 10 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1500) [13:49:07] (03PS1) 10Jelto: Ceph: add types for S3 credential and account [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922) [13:49:12] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 06serviceops: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10708019 (10ABran-WMF) a:03ABran-WMF [13:49:24] MatmaRex: there's nothing after the window so we should be fine [13:49:33] ok. thanks [13:50:09] (03CR) 10Filippo Giunchedi: "To be merged next week" [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [13:50:21] (03CR) 10Filippo Giunchedi: "To be merged next week" [puppet] - 10https://gerrit.wikimedia.org/r/1133909 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [13:50:28] (03CR) 10Jelto: "sounds good to me! See Id8979165b96d737addc676f3abf3f088a48eda48." [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:50:34] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2045.codfw.wmnet [13:50:41] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1044.eqiad.wmnet [13:51:31] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1420 to wikikube-worker1166 [13:51:51] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [13:51:58] (03PS1) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) [13:52:17] taavi: still testing, thanks! [13:54:10] taavi: ok, looks good, ok to proceed [13:54:28] thanks [13:54:29] !log taavi@deploy1003 cscott, taavi: Continuing with sync [13:55:39] !log taavi@deploy1003 scap failed: Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'write-values', '--output-file-template', '/tmp/tmp1ws3xaaw']' returned non-zero exit status 1. (scap version: 4.148.0) (duration: 16m 20s) [13:55:57] (03CR) 10FNegri: Create insetup role for WMCS with nftables and rename existing one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133422 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [13:56:02] claime: now it's failing with a helm values yaml parsing issue [13:56:13] on mwcron? [13:56:18] https://phabricator.wikimedia.org/P74591 [13:56:19] yeah [13:56:22] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1044.eqiad.wmnet [13:56:24] ffs [13:56:38] lemme fix that [13:57:22] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2045.codfw.wmnet [13:58:19] (03CR) 10Muehlenhoff: Create insetup role for WMCS with nftables and rename existing one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133422 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [13:58:28] (03PS1) 10Clément Goubert: mw::periodic_jobs: Fix serviceops test job [puppet] - 10https://gerrit.wikimedia.org/r/1133919 [13:58:44] (03CR) 10Clément Goubert: [V:03+2 C:03+2] mw::periodic_jobs: Fix serviceops test job [puppet] - 10https://gerrit.wikimedia.org/r/1133919 (owner: 10Clément Goubert) [13:59:34] (03CR) 10Ssingh: [V:03+2 C:03+2] "Merging because no code change since last +1: fixed typo in commit message." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1133553 (https://phabricator.wikimedia.org/T390912) (owner: 10Ssingh) [14:00:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10708076 (10phaultfinder) [14:00:58] (03PS1) 10Tiziano Fogli: jobrunner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1133922 (https://phabricator.wikimedia.org/T354791) [14:01:24] (03PS1) 10Majavah: P:mediawiki: periodic_jobs: Fix string quoting for good [puppet] - 10https://gerrit.wikimedia.org/r/1133923 [14:02:03] taavi: oh good catch [14:02:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [14:02:05] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1420 to wikikube-worker1166 - hnowlan@cumin1002" [14:02:09] (03Abandoned) 10Tiziano Fogli: jobrunner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1133922 (https://phabricator.wikimedia.org/T354791) (owner: 10Tiziano Fogli) [14:02:10] (03CR) 10FNegri: [C:03+1] Create insetup role for WMCS with nftables and rename existing one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133422 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [14:02:18] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [14:02:28] (03PS6) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) [14:02:56] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1420 to wikikube-worker1166 - hnowlan@cumin1002" [14:02:56] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:02:57] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1166 [14:03:01] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:03:06] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:03:23] taavi: helmfile applies cleanly with my temp fix, you can proceed with scap [14:03:44] !log taavi@deploy1003 Started scap sync-world: re-syncing 1133581 [14:03:52] thanks! see also https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133923 to maybe avoid that in the future [14:04:04] (03PS2) 10Clément Goubert: P:mediawiki: periodic_jobs: Fix string quoting for good [puppet] - 10https://gerrit.wikimedia.org/r/1133923 (owner: 10Majavah) [14:04:04] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133923 (owner: 10Majavah) [14:04:09] already running a pcc [14:04:40] ack [14:04:58] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5205/console" [puppet] - 10https://gerrit.wikimedia.org/r/1133923 (owner: 10Majavah) [14:05:03] taavi: yeah, that patch was the reason for my "good catch" earlier [14:05:09] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1166 [14:05:17] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1420 to wikikube-worker1166 [14:05:36] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw1420 to wikikube-worker1166 completed: - mw1420 (**PASS**) - ✔️ Down... [14:06:13] (03CR) 10Clément Goubert: [C:03+1] "Thanks, good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1133923 (owner: 10Majavah) [14:06:21] (03CR) 10CI reject: [V:04-1] P:mediawiki: periodic_jobs: Fix string quoting for good [puppet] - 10https://gerrit.wikimedia.org/r/1133923 (owner: 10Majavah) [14:06:24] (03CR) 10Bking: [C:03+2] cirrussearch: add second canary for OpenSearch migration [puppet] - 10https://gerrit.wikimedia.org/r/1133551 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:06:36] (03CR) 10Bking: [C:03+2] "self-merging in the interest of time" [puppet] - 10https://gerrit.wikimedia.org/r/1133551 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:06:38] (03PS1) 10Tiziano Fogli: auth_metrics: add recording rules for grafana widgets [puppet] - 10https://gerrit.wikimedia.org/r/1133924 (https://phabricator.wikimedia.org/T390672) [14:07:18] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [14:07:25] huh why is that failing CI [14:08:47] (03PS3) 10Majavah: P:mediawiki: periodic_jobs: Fix string quoting for good [puppet] - 10https://gerrit.wikimedia.org/r/1133923 [14:09:17] (03CR) 10Volans: upgrade.py: Depool, repool, update Phabricator (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [14:09:30] (03CR) 10Elukey: [C:03+1] Reapply maps_bookworm role to maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1133915 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:09:53] taavi: Host instead of Hosts, my fault [14:10:00] The CI message is wrong though [14:10:21] (03CR) 10CI reject: [V:04-1] elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [14:11:13] (03CR) 10Majavah: [C:03+2] P:mediawiki: periodic_jobs: Fix string quoting for good [puppet] - 10https://gerrit.wikimedia.org/r/1133923 (owner: 10Majavah) [14:11:19] FIRING: CloudCoreBGPDown: ... [14:11:19] Cloud (WMCS) BGP session down between cloudsw1-f4-eqiad and cloudsw1-c8 (2620:0:861:fe0d::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cloudsw1-f4-eqiad:9804&var-bgp_group=prod_ebgp6&var-bgp_neighbor=cloudsw1-c8 - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [14:11:52] (03CR) 10Muehlenhoff: [C:03+2] Create insetup role for WMCS with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133422 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [14:12:43] !log taavi@deploy1003 Finished scap sync-world: re-syncing 1133581 (duration: 08m 58s) [14:12:45] cscott: yours is finally live [14:12:48] (03CR) 10Clément Goubert: [C:03+1] Profile::Mediawiki_deployment: remove deprecated debug field [puppet] - 10https://gerrit.wikimedia.org/r/1131060 (https://phabricator.wikimedia.org/T389499) (owner: 10Scott French) [14:12:48] MatmaRex: still there? [14:13:22] yeah [14:13:43] cool, sorry for the wait [14:13:53] your patch is live on mwdebug1001, lmk when you're done and i'll revert [14:14:08] (03CR) 10Effie Mouzeli: [C:03+1] services: use the kafka svc endpoint for Tegola [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133142 (https://phabricator.wikimedia.org/T373115) (owner: 10Elukey) [14:15:22] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [14:15:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:16:45] taavi: looking [14:17:12] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test only - bking@cumin2002 - T388610 [14:17:15] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [14:17:56] (03PS1) 10Muehlenhoff: Create insetup role for ServiceOps with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133927 (https://phabricator.wikimedia.org/T389825) [14:18:19] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10708154 (10Jhancock.wm) i unfortunately cannot find a spare 8 TB drive. So we'd either need to try it with a 4 TB or source a disk. [14:18:41] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test only - bking@cumin2002 - T388610 [14:18:45] taavi: are you sure it's live? i'm not seeing the expected logs [14:19:05] MatmaRex: can you ping me when done? [14:19:24] ok [14:20:09] the code is definitely there [14:20:15] let me try manually restarting php-fpm for good measure [14:21:13] !incidents [14:21:14] 5945 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-a-eqiad.mgmt.eqiad.wmnet) [14:21:14] 5944 (RESOLVED) [3x] ProbeDown sre (ip4 ncredir-https:443 probes/service http_ncredir-https_ip4) [14:21:14] 5942 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [14:21:14] 5943 (RESOLVED) [2x] Primary inbound port utilisation over 80% (paged) network noc () [14:21:33] strange, just received a page for pfw1-eqiad [14:22:44] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:22:47] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test one - bking@cumin2002 - T388610 [14:22:50] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [14:23:37] (03PS1) 10Gergő Tisza: Enable EmailAuth enforcement on group 2 for short test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133928 (https://phabricator.wikimedia.org/T390437) [14:24:21] <_joe_> jhathaway: I think it was the ack expiring? [14:24:30] hmm, maybe i was testing it wrong. let me try something else [14:24:39] _joe_: yes you are correct, now resolved [14:24:57] MatmaRex: i think you're just being bit by caching [14:25:17] i needed to manually invalidate cache for my user for loadFromDatabase to be called [14:25:26] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:26:09] (03PS2) 10Alexandros Kosiaris: wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1132691 (https://phabricator.wikimedia.org/T384944) [14:26:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133928 (https://phabricator.wikimedia.org/T390437) (owner: 10Gergő Tisza) [14:26:09] (03CR) 10CI reject: [V:04-1] wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1132691 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [14:26:40] taavi: yes. okay, i see it now [14:26:49] one second [14:27:31] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test one - bking@cumin2002 - T388610 [14:27:44] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [14:28:04] (03PS2) 10Gergő Tisza: Enable EmailAuth enforcement on group 2 for short test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133928 (https://phabricator.wikimedia.org/T390662) [14:28:22] (03PS2) 10Elukey: profile::service_proxy::envoy: add data-gateway-staging [puppet] - 10https://gerrit.wikimedia.org/r/1133848 [14:28:23] jouncebot: nowandnext [14:28:23] No deployments scheduled for the next 0 hour(s) and 31 minute(s) [14:28:23] In 0 hour(s) and 31 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1500) [14:28:27] hi [14:29:00] (03CR) 10Elukey: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey) [14:29:00] hnowlan: i think t.gr_ is in the queue first [14:29:36] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10708223 (10Ladsgroup) `ms-be1070` will probably alert this weekend, it's already at 93.7%. How do we depool a backend? I can't find an... [14:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10708224 (10phaultfinder) [14:29:39] okay. I have some pending host renames that *shouldn't* impact scap (they're already depooled and decommissioned in confctl) but I'll hold [14:30:31] taavi: i think i have everything i need, thank you [14:30:46] please restore mwdebug to normal state :) [14:30:48] thanks, restoring then [14:30:50] and done [14:30:52] tgr_: your turn! [14:30:54] tgr_: ^ [14:30:58] thx [14:31:06] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5206/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey) [14:31:13] (03CR) 10Muehlenhoff: [C:03+2] Reapply maps_bookworm role to maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1133915 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:31:46] hnowlan: I'm just changing a config flag so if you think it's fine to do in parallel with a scap backport, feel free [14:32:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133928 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [14:32:23] nah go ahead, there's no huge rush [14:32:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:32:55] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [14:33:03] (03Abandoned) 10Bartosz Dziewoński: Temporary debugging code for T389728 [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133868 (owner: 10Bartosz Dziewoński) [14:33:06] (03Merged) 10jenkins-bot: Enable EmailAuth enforcement on group 2 for short test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133928 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [14:33:10] (03PS1) 10Bking: Revert "cirrussearch: add second canary for OpenSearch migration" [puppet] - 10https://gerrit.wikimedia.org/r/1133929 [14:33:20] (03CR) 10Bking: [V:03+2 C:03+2] Revert "cirrussearch: add second canary for OpenSearch migration" [puppet] - 10https://gerrit.wikimedia.org/r/1133929 (owner: 10Bking) [14:33:31] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1133928|Enable EmailAuth enforcement on group 2 for short test (T390662)]] [14:33:34] T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662 [14:36:56] (03PS1) 10Hnowlan: wmnet: remove jobrunner and videoscaler records [dns] - 10https://gerrit.wikimedia.org/r/1133931 (https://phabricator.wikimedia.org/T354791) [14:37:30] (03CR) 10CI reject: [V:04-1] wmnet: remove jobrunner and videoscaler records [dns] - 10https://gerrit.wikimedia.org/r/1133931 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [14:38:28] (03PS2) 10Hnowlan: wmnet: remove jobrunner and videoscaler records [dns] - 10https://gerrit.wikimedia.org/r/1133931 (https://phabricator.wikimedia.org/T354791) [14:39:02] !log tgr@deploy1003 tgr: Backport for [[gerrit:1133928|Enable EmailAuth enforcement on group 2 for short test (T390662)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:39:04] T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662 [14:41:42] (03PS1) 10Alexandros Kosiaris: wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1133932 (https://phabricator.wikimedia.org/T384944) [14:42:13] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2056 for ban node before reimaging - bking@cumin2002 - T388610 [14:42:13] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2056 for ban node before reimaging - bking@cumin2002 - T388610 [14:42:16] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [14:42:22] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2056* for ban node before reimaging - bking@cumin2002 - T388610 [14:42:27] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2056* for ban node before reimaging - bking@cumin2002 - T388610 [14:42:51] !log tgr@deploy1003 tgr: Continuing with sync [14:42:52] (03PS1) 10Hnowlan: service: remove videoscaler, jobrunner probes [puppet] - 10https://gerrit.wikimedia.org/r/1133934 (https://phabricator.wikimedia.org/T354791) [14:44:15] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1131060 (https://phabricator.wikimedia.org/T389499) (owner: 10Scott French) [14:44:20] (03CR) 10Scott French: [C:03+2] Profile::Mediawiki_deployment: remove deprecated debug field [puppet] - 10https://gerrit.wikimedia.org/r/1131060 (https://phabricator.wikimedia.org/T389499) (owner: 10Scott French) [14:45:11] (03CR) 10Volans: "Disclaimer: I'm not familiar with the RFCs but Sukhbir told me I could check the script without studying it :)" [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:45:19] (03PS2) 10Alexandros Kosiaris: wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1133932 (https://phabricator.wikimedia.org/T384944) [14:46:23] (03PS2) 10Hnowlan: service: remove videoscaler, jobrunner monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1133934 (https://phabricator.wikimedia.org/T354791) [14:47:14] (03CR) 10Vgutierrez: [C:03+1] hiera: acme_chief: add wikimedia-ech.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:48:00] (03CR) 10Elukey: [C:03+1] "LGTM, is there a reason to switch to log rotation? Ease of grepping logs etc.?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133886 (owner: 10Volans) [14:48:22] (03CR) 10Elukey: [C:03+1] .wmfconfig: add Debian bookworm build [software/cumin] - 10https://gerrit.wikimedia.org/r/1133884 (owner: 10Volans) [14:49:37] (03CR) 10Elukey: [C:03+1] cli: fine-tune CLI logging [software/cumin] - 10https://gerrit.wikimedia.org/r/1133885 (owner: 10Volans) [14:49:49] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133928|Enable EmailAuth enforcement on group 2 for short test (T390662)]] (duration: 16m 18s) [14:49:52] T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662 [14:50:10] hnowlan: ^ [14:51:14] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10708353 (10elukey) >>! In T384003#10708154, @Jhancock.wm wrote: > i unfortunately cannot find a spare 8 TB drive. So we'd either need t... [14:51:19] FIRING: CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-f4-eqiad and cloudsw1-c8 (172.31.255.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cloudsw1-f4-eqiad:9804&var-bgp_group=cloud_ebgp&var-bgp_neighbor=cloudsw1-c8 - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGP [14:52:49] (03PS1) 10Gergő Tisza: End EmailAuth enforcement group 2 test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133937 (https://phabricator.wikimedia.org/T390662) [14:53:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133937 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [14:54:27] (03CR) 10Kosta Harlan: [C:03+1] End EmailAuth enforcement group 2 test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133937 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [14:56:22] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: cleanup k8s instances from prometheus100[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:57:12] (03CR) 10Volans: "Ahem..." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133886 (owner: 10Volans) [14:57:35] (03CR) 10Alexandros Kosiaris: [C:03+2] wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1133932 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [14:58:49] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: cleanup k8s instances from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133909 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [15:00:04] dancy and andre: Time to snap out of that daydream and deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1500). [15:01:44] (03PS1) 10Alexandros Kosiaris: service: Cleanup of wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944) [15:02:38] tgr_: thanks [15:02:58] (03CR) 10Elukey: [C:03+1] "/me hides" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133886 (owner: 10Volans) [15:03:24] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1437 to wikikube-worker1167 [15:03:54] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [15:04:19] !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1438 to wikikube-worker1168 [15:06:30] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1166.eqiad.wmnet with OS bookworm [15:06:33] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1166 [15:06:33] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1166 [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:49] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker1166.eqiad.wmnet with OS bookworm [15:08:07] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1437 to wikikube-worker1167 - hnowlan@cumin1002" [15:08:50] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:09:12] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1437 to wikikube-worker1167 - hnowlan@cumin1002" [15:09:12] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:09:13] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1167 [15:09:14] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [15:09:16] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:09:18] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10708490 (10RobH) Confirmed engineer visit for Monday, April 7th and opened ticket 01044010. [15:09:39] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10708491 (10MatthewVernon) We don't, there's no equivalent context in swift. I can do a bulk-vacuum on that host, either tomorrow or M... [15:09:43] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:10:35] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:10:36] (03CR) 10Volans: [C:03+2] logging: rotate files [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133886 (owner: 10Volans) [15:10:40] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1167 [15:10:45] (03CR) 10Volans: [C:03+2] .wmfconfig: add Debian bookworm build [software/cumin] - 10https://gerrit.wikimedia.org/r/1133884 (owner: 10Volans) [15:10:48] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1437 to wikikube-worker1167 [15:10:53] (03CR) 10Volans: [C:03+2] cli: fine-tune CLI logging [software/cumin] - 10https://gerrit.wikimedia.org/r/1133885 (owner: 10Volans) [15:11:00] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw1437 to wikikube-worker1167 completed: - mw1437 (**PASS**) - ✔️ Down... [15:11:19] FIRING: CloudCoreBGPDown: ... [15:11:19] Cloud (WMCS) BGP session down between cloudsw1-f4-eqiad and cloudsw1-d5 (2620:0:861:fe0f::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cloudsw1-f4-eqiad:9804&var-bgp_group=prod_ebgp6&var-bgp_neighbor=cloudsw1-d5 - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [15:11:24] (03PS2) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) [15:11:26] (03CR) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH (037 comments) [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:13:26] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1438 to wikikube-worker1168 - hnowlan@cumin1002" [15:13:32] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1438 to wikikube-worker1168 - hnowlan@cumin1002" [15:13:32] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:13:33] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1168 [15:13:41] jouncebot: nowandnext [15:13:41] For the next 0 hour(s) and 46 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1500) [15:13:41] In 0 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1600) [15:14:32] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:14:39] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1168 [15:14:47] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1438 to wikikube-worker1168 [15:14:56] Seems like the scap for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1133928 somehow didn't work. Is there such a a thing as a scap log? [15:15:01] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708601 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw1438 to wikikube-worker1168 completed: - mw1438 (**PASS**) - ✔️ Down... [15:15:08] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:16:28] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1166.eqiad.wmnet wikikube-worker1167.eqiad.wmnet wikikube-worker1168.eqiad.wmnet on all recursors [15:16:32] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1166.eqiad.wmnet wikikube-worker1167.eqiad.wmnet wikikube-worker1168.eqiad.wmnet on all recursors [15:16:39] (03PS1) 10Reedy: Remove catching of db exception [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133944 (https://phabricator.wikimedia.org/T390956) [15:16:52] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1167.eqiad.wmnet with OS bookworm [15:16:53] (03CR) 10Reedy: "not in wmf_deploy; will deal with later" [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133944 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy) [15:16:56] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1167 [15:16:56] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1167 [15:17:07] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708611 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker1167.eqiad.wmnet with OS bookworm [15:17:15] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1168.eqiad.wmnet with OS bookworm [15:17:18] !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1168 [15:17:19] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1168 [15:17:23] (03CR) 10Reedy: [C:03+2] Remove catching of db exception [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133944 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy) [15:17:28] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708613 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker1168.eqiad.wmnet with OS bookworm [15:18:13] (03CR) 10Reedy: [C:03+2] "Oh it is. Ignore that then." [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133944 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy) [15:19:09] (03PS3) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) [15:20:24] (03Merged) 10jenkins-bot: logging: rotate files [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133886 (owner: 10Volans) [15:20:40] (03CR) 10Volans: "replies inline" [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:21:16] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1166.eqiad.wmnet with reason: host reimage [15:21:19] FIRING: [2x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-f4-eqiad and cloudsw1-d5 (10.64.147.6) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [15:21:52] (03Merged) 10jenkins-bot: Remove catching of db exception [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133944 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy) [15:22:38] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1133944|Remove catching of db exception (T390956)]] [15:22:40] T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956 [15:23:19] 06SRE, 06Traffic, 10Data-Engineering (Q3 2025 January 1st - March 31th), 13Patch-For-Review: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10708700 (10Ahoelzl) 05Open→03Resolved [15:24:11] (03PS4) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) [15:24:15] (03CR) 10Filippo Giunchedi: [C:03+1] auth_metrics: add recording rules for grafana widgets [puppet] - 10https://gerrit.wikimedia.org/r/1133924 (https://phabricator.wikimedia.org/T390672) (owner: 10Tiziano Fogli) [15:24:19] (03CR) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:24:38] (03CR) 10CI reject: [V:04-1] utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:24:54] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1166.eqiad.wmnet with reason: host reimage [15:25:49] (03PS5) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) [15:25:53] (03PS1) 10Jforrester: wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133948 [15:26:19] FIRING: [3x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-f4-eqiad and cloudsw1-d5 (10.64.147.6) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [15:27:15] (03Merged) 10jenkins-bot: .wmfconfig: add Debian bookworm build [software/cumin] - 10https://gerrit.wikimedia.org/r/1133884 (owner: 10Volans) [15:27:16] (03Merged) 10jenkins-bot: cli: fine-tune CLI logging [software/cumin] - 10https://gerrit.wikimedia.org/r/1133885 (owner: 10Volans) [15:27:22] (03CR) 10Ssingh: "I think I got all comments in. Thanks for the review and the suggestion on removing join. That was leftover from textwrap and I think I wi" [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:29:28] (03PS1) 10Gergő Tisza: Enable EmailAuth enforcement on group 2 for short test (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133949 (https://phabricator.wikimedia.org/T390662) [15:30:33] (03CR) 10Ladsgroup: [C:03+1] Enable EmailAuth enforcement on group 2 for short test (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133949 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [15:30:33] !log reedy@deploy1003 reedy: Backport for [[gerrit:1133944|Remove catching of db exception (T390956)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:30:36] T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956 [15:30:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10708792 (10phaultfinder) [15:30:45] (03CR) 10Volans: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:30:46] (03CR) 10Kosta Harlan: [C:03+1] Enable EmailAuth enforcement on group 2 for short test (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133949 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [15:30:52] I'll deploy a config bugfix, Reedy plz let me know when done [15:31:50] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1167.eqiad.wmnet with reason: host reimage [15:31:59] volans: thanks for the in-depth reviews as always <3 [15:32:09] anytime :) [15:32:30] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1168.eqiad.wmnet with reason: host reimage [15:33:10] !log reedy@deploy1003 reedy: Continuing with sync [15:33:21] (03PS3) 10Ssingh: hiera: acme_chief: add wikimedia-ech.org [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) [15:34:00] 😍 wikimedia-ech.org [15:34:07] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5208/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:34:33] <_joe_> jouncebot: now [15:34:33] For the next 0 hour(s) and 25 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1500) [15:34:42] <_joe_> jouncebot: nowandnext [15:34:42] For the next 0 hour(s) and 25 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1500) [15:34:42] In 0 hour(s) and 25 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1600) [15:34:50] Amir1: ! [15:34:52] <_joe_> ok, I can merge this change safely [15:34:57] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1167.eqiad.wmnet with reason: host reimage [15:35:14] _joe_: we are deploying a couple of UBNs right now [15:35:51] I mean fixes to UBNs, not causing them [15:35:53] <_joe_> Amir1: yeah these will not affect scap [15:35:58] ah okay [15:36:04] <_joe_> Amir1: who says you're not creating new ones [15:36:13] one way to find out! [15:36:41] <_joe_> Amir1: in any case, lmk when you're done, I will still need a lock on helmfile on a couple namespaces [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:00] sure thanks! [15:37:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2056-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:38:27] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1168.eqiad.wmnet with reason: host reimage [15:38:31] (03PS1) 10Ilias Sarantopoulos: ml-services: fix edit-check blubber image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133952 [15:39:43] (03PS1) 10Ssingh: hiera: acme_chief: fix ordering of DC [puppet] - 10https://gerrit.wikimedia.org/r/1133953 [15:39:52] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10708946 (10Ladsgroup) Thanks. Let me know if I can help on anything! [15:40:06] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133944|Remove catching of db exception (T390956)]] (duration: 17m 28s) [15:40:08] T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956 [15:41:13] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1166.eqiad.wmnet with OS bookworm [15:41:31] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker1166.eqiad.wmnet with OS bookworm completed: - wikik... [15:41:59] Reedy: shall we the mw config tgr_ and I go ahead or there are other patches for CN needs to be created and deployed? [15:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:55] Amir1: Your patch doesn't fix it, just shows a more useful error [15:44:08] You're GTG, but https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/1133954 needs reviewing :) [15:45:03] awesome, the config patch will be quick to deploy [15:45:25] (03CR) 10Ladsgroup: [C:03+2] Enable EmailAuth enforcement on group 2 for short test (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133949 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [15:45:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133949 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [15:46:17] (03Merged) 10jenkins-bot: Enable EmailAuth enforcement on group 2 for short test (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133949 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [15:46:43] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1133949|Enable EmailAuth enforcement on group 2 for short test (#2) (T390662)]] [15:46:46] T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662 [15:49:37] (03PS9) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [15:50:50] (03CR) 10Federico Ceratto: "Updated: switched from `--slow` pool-in to default speed (4 steps), also switched to use `wait_for_replication`" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [15:52:03] !log ladsgroup@deploy1003 tgr, ladsgroup: Backport for [[gerrit:1133949|Enable EmailAuth enforcement on group 2 for short test (#2) (T390662)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:52:04] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1167.eqiad.wmnet with OS bookworm [15:52:06] T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662 [15:52:18] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709035 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker1167.eqiad.wmnet with OS bookworm completed: - wikik... [15:52:21] (03PS2) 10HMonroy: Enable Codex and Multiblocks in German and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) [15:52:38] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on elastic2056.codfw.wmnet with reason: adding net-new role [15:53:13] (03CR) 10CI reject: [V:04-1] Enable Codex and Multiblocks in German and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [15:53:48] !log ladsgroup@deploy1003 tgr, ladsgroup: Continuing with sync [15:54:35] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5210/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133953 (owner: 10Ssingh) [15:55:12] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1168.eqiad.wmnet with OS bookworm [15:55:29] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker1168.eqiad.wmnet with OS bookworm completed: - wikik... [15:55:33] (03CR) 10Kevin Bazira: [C:03+1] ml-services: fix edit-check blubber image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133952 (owner: 10Ilias Sarantopoulos) [15:56:49] (03CR) 10CI reject: [V:04-1] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [15:57:32] (03CR) 10Ssingh: hiera: acme_chief: fix ordering of DC [puppet] - 10https://gerrit.wikimedia.org/r/1133953 (owner: 10Ssingh) [15:58:03] !log running homer 'cr*eqiad*' commit for new wikikube workers [15:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:05] jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1600). Please do the needful. [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:58] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133949|Enable EmailAuth enforcement on group 2 for short test (#2) (T390662)]] (duration: 14m 15s) [16:01:01] T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662 [16:05:51] Reedy: that patch I have is now deployed, shall we backport to wmf_deploy? I'm actually not sure how CN code is backported. Just deploy on wmf_deploy branch? [16:06:07] !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1166-1168].eqiad.wmnet [16:06:09] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1166-1168].eqiad.wmnet [16:06:22] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709114 (10ops-monitoring-bot) pool host wikikube-worker[1166-1168].eqiad.wmnet by hnowlan@cumin1002 with reason: None [16:06:30] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709115 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by hnowlan@cumin1002 pool for host wikikube-worker[1166-1168].eqiad.wmnet completed: - wik... [16:06:41] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T390998 (10hnowlan) 03NEW [16:07:34] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709138 (10hnowlan) [16:09:33] (03CR) 10Superpes15: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [16:09:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10709153 (10phaultfinder) [16:10:02] (03PS1) 10Reedy: Banner: Conditionally check for banner existence from primary db [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133959 (https://phabricator.wikimedia.org/T390956) [16:10:07] Amir1: Needs to go to .23 branch too [16:10:10] (03CR) 10Reedy: [C:03+2] Banner: Conditionally check for banner existence from primary db [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133959 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy) [16:10:20] ah, fun [16:11:19] we branch deployment branches from wmf_deploy etc [16:11:19] FIRING: [2x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-e4-eqiad and cloudsw1-d5 (10.64.147.4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [16:12:25] (03CR) 10Volans: upgrade.py: Depool, repool, update Phabricator (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [16:13:30] (03Merged) 10jenkins-bot: Banner: Conditionally check for banner existence from primary db [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133959 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy) [16:13:49] lets get that out [16:14:25] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1133959|Banner: Conditionally check for banner existence from primary db (T390956)]] [16:14:28] T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956 [16:15:11] thanks [16:15:42] (03PS1) 10Volans: spicerack: add Spicerack interactive shell [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329) [16:16:42] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329) (owner: 10Volans) [16:17:44] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync [16:17:47] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync [16:17:48] (03PS2) 10Volans: spicerack: add Spicerack interactive shell [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329) [16:19:29] (03PS3) 10Volans: spicerack: add Spicerack interactive shell [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329) [16:19:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10709229 (10phaultfinder) [16:21:40] !log reedy@deploy1003 reedy: Backport for [[gerrit:1133959|Banner: Conditionally check for banner existence from primary db (T390956)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:21:43] T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956 [16:22:28] !log reedy@deploy1003 reedy: Continuing with sync [16:22:32] !log decommissioning all but 1 eqiad jobrunner node in confctl [16:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:43] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329) (owner: 10Volans) [16:24:51] (03CR) 10Vgutierrez: [C:03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/1133953 (owner: 10Ssingh) [16:27:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [16:27:12] FIRING: [10x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:28:12] (03PS1) 10Alexandros Kosiaris: wikifunctions: Add an extra rule for internal Ingress endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133966 [16:29:35] (03PS1) 10Bking: WIP: more fine-grained shard status checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133967 (https://phabricator.wikimedia.org/T383811) [16:29:39] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133959|Banner: Conditionally check for banner existence from primary db (T390956)]] (duration: 15m 13s) [16:29:42] T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956 [16:30:38] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:30:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10709280 (10phaultfinder) [16:30:54] (03CR) 10Alexandros Kosiaris: [C:03+2] wikifunctions: Add an extra rule for internal Ingress endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133966 (owner: 10Alexandros Kosiaris) [16:31:19] FIRING: CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-e4-eqiad and cloudsw1-c8 (10.64.147.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cloudsw1-e4-eqiad:9804&var-bgp_group=prod_ebgp4&var-bgp_neighbor=cloudsw1-c8 - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPD [16:32:25] (03Merged) 10jenkins-bot: wikifunctions: Add an extra rule for internal Ingress endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133966 (owner: 10Alexandros Kosiaris) [16:32:56] (03PS1) 10Reedy: Banner: While saving, do exists() against primary [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133969 (https://phabricator.wikimedia.org/T390956) [16:33:00] (03CR) 10Reedy: [C:03+2] Banner: While saving, do exists() against primary [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133969 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy) [16:34:01] FIRING: [10x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:36:19] FIRING: [2x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-e4-eqiad and cloudsw1-c8 (10.64.147.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [16:36:23] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:36:27] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:36:33] (03Merged) 10jenkins-bot: Banner: While saving, do exists() against primary [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133969 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy) [16:36:41] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [16:36:52] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [16:37:03] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [16:37:07] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [16:37:10] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1133969|Banner: While saving, do exists() against primary (T390956)]] [16:37:12] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [16:37:13] T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956 [16:40:55] (03CR) 10CI reject: [V:04-1] WIP: more fine-grained shard status checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133967 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking) [16:44:25] !log reedy@deploy1003 reedy: Backport for [[gerrit:1133969|Banner: While saving, do exists() against primary (T390956)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:44:28] T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956 [16:51:19] FIRING: [2x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-e4-eqiad and cloudsw1-c8 (10.64.147.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [16:51:30] !log reedy@deploy1003 reedy: Continuing with sync [16:51:48] (03PS2) 10Esanders: Hide "Insert graph" tool in VE when graphs are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123620 (https://phabricator.wikimedia.org/T387501) [16:52:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123620 (https://phabricator.wikimedia.org/T387501) (owner: 10Esanders) [16:54:45] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:54:54] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:55:25] (03CR) 10Ssingh: [C:03+2] hiera: acme_chief: fix ordering of DC [puppet] - 10https://gerrit.wikimedia.org/r/1133953 (owner: 10Ssingh) [16:55:44] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1133909 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [16:56:18] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [16:57:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [16:58:42] (03PS1) 10Esanders: Enable DiscussionTools visual enhancements on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133972 (https://phabricator.wikimedia.org/T379264) [16:58:44] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133969|Banner: While saving, do exists() against primary (T390956)]] (duration: 21m 33s) [16:58:46] T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956 [16:59:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133972 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders) [16:59:50] (03CR) 10Ssingh: [C:03+2] utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:00:05] bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1700). [17:00:05] swfrench-wmf: MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1700). Please do the needful. [17:00:16] o/ [17:00:17] !log sukhe@dns1004 START - running authdns-update [17:00:46] Reedy: I see you had some backports going just recently. are you done for now? [17:00:57] swfrench-wmf: it's turtles [17:01:06] lol [17:01:08] I can take a break for a bit if you've stuff you need to do :) [17:01:18] I've got another to go out after it's master merged and backported [17:01:19] FIRING: [5x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-e4-eqiad and cloudsw1-c8 (10.64.147.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [17:02:01] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [17:02:08] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [17:02:17] Reedy: got it, thanks! yeah, I'll try to get through my change now - should take about 25-30m based on prior experience ... as long as the registry doesn't explode, that is. [17:02:38] !log sukhe@dns1004 END - running authdns-update [17:02:45] (03CR) 10Scott French: "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1133229 (owner: 10Scott French) [17:02:48] (03PS2) 10Esanders: Enable DiscussionTools visual enhancements on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133972 (https://phabricator.wikimedia.org/T379264) [17:03:37] (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-04-03-122108-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133973 [17:04:09] (03CR) 10Scott French: [V:03+2] "Built and verified locally." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1133229 (owner: 10Scott French) [17:04:11] (03PS1) 10Ssingh: [DO NOT MERGE] set MX records for dyna [dns] - 10https://gerrit.wikimedia.org/r/1133974 [17:04:12] (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: Rebuild to update Debian packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1133229 (owner: 10Scott French) [17:06:19] RESOLVED: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-e4-eqiad and cloudsw1-c8 (10.64.147.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [17:07:18] (03PS1) 10Esanders: Enable DiscussionTools visual enhancements everywhere except enwiki & ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133975 (https://phabricator.wikimedia.org/T379264) [17:07:45] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-04-03-122108-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133973 (owner: 10BryanDavis) [17:08:37] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133976 [17:09:17] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-04-03-122108-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133973 (owner: 10BryanDavis) [17:09:29] * swfrench-wmf offers words of encouragement to docker-registry [17:10:04] !log swfrench@deploy1003 Started scap sync-world: Deployment to pick up new PHP 8.1 production images [17:10:49] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:11:06] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:11:14] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:11:33] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:11:42] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:12:01] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:14:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10709474 (10phaultfinder) [17:15:37] (03PS1) 10Reedy: Banner: More reading from primary... [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133984 (https://phabricator.wikimedia.org/T390956) [17:20:42] (03CR) 10Reedy: [C:03+2] Banner: More reading from primary... [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133984 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy) [17:22:17] (03PS2) 10Gergő Tisza: End EmailAuth enforcement group 2 test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133937 (https://phabricator.wikimedia.org/T390662) [17:23:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10709517 (10VRiley-WMF) Due to power restraints, we will need to relocte an-worker1181 to an-worker1186 in racks E8 and F8. [17:23:24] (03Merged) 10jenkins-bot: Banner: More reading from primary... [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133984 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy) [17:23:28] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133948 (owner: 10Jforrester) [17:27:29] (03CR) 10Ssingh: [C:04-2] "Here is why I believe this will not work for what we are trying to do with HIBP:" [dns] - 10https://gerrit.wikimedia.org/r/1133974 (owner: 10Ssingh) [17:28:35] (03PS2) 10Superpes15: Create wikipedia-pl-arbcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1133988 (https://phabricator.wikimedia.org/T391009) [17:29:14] (03CR) 10Dzahn: [C:03+1] Create wikipedia-pl-arbcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1133988 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [17:29:40] (03CR) 10Dzahn: [C:03+1] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1133988 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [17:30:10] (03CR) 10Dzahn: [C:03+2] Create wikipedia-pl-arbcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1133988 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [17:30:20] !log dzahn@dns1004 START - running authdns-update [17:32:37] (03CR) 10Ladsgroup: [C:03+2] CommonSettings-labs: Update BounceHandler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133156 (owner: 10Reedy) [17:32:38] swfrench-wmf: I'm definitely interesting in seeing the time impact of serializing the pushes. [17:32:49] *interested [17:32:49] !log dzahn@dns1004 END - running authdns-update [17:33:49] (03Merged) 10jenkins-bot: CommonSettings-labs: Update BounceHandler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133156 (owner: 10Reedy) [17:34:07] dancy: alas, this one probably won't be super informative, as it only affects the 8.1 image "tree" (so there's really only one large blob to upload) [17:34:21] Gotcha [17:35:56] that said, the fact that this deployment appears to be working, albeit slowly (expected for a full rebuild), isn't incompatible with the idea that it's large concurrent uploads that are the trigger ... so, yay? [17:37:02] why would you want to do anything fast [17:37:15] Go fast and break registries [17:37:24] (03CR) 10Ssingh: [C:04-2] "^ Context on the above is that we are trying to add MX records for the subdomains, like enwiki, m editions, and all." [dns] - 10https://gerrit.wikimedia.org/r/1133974 (owner: 10Ssingh) [17:37:38] lol [17:37:48] "stop ddos-ing your own registries" [17:38:31] to be fair, we are throwing around some wildly large images [17:38:34] !log swfrench@deploy1003 Finished scap sync-world: Deployment to pick up new PHP 8.1 production images (duration: 28m 57s) [17:38:39] \o/ [17:38:48] I'm frequently impressed that it works at all ... [17:38:51] Reedy: all yours [17:38:51] 30 mins for a big image change... isn't bad [17:38:53] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[56] - https://phabricator.wikimedia.org/T387142#10709610 (10Jclark-ctr) [17:38:55] swfrench-wmf: you must be new here ;) [17:38:58] ta [17:39:34] I object to characterizing a few gigabytes as wildly large. It's just files. [17:39:46] !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1133984|Banner: More reading from primary... (T390956)]], [[gerrit:1133156|CommonSettings-labs: Update BounceHandler config]] [17:39:49] T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956 [17:40:03] We should be able to do gigabyte files in the 2020s. [17:40:06] Amir1: I'll just deploy that patch then, shall I [17:40:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T390998#10709614 (10VRiley-WMF) [17:40:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10709615 (10phaultfinder) [17:40:48] I rebased the beta cluster one and so it doesn't need deployment [17:40:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T390998#10709617 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This has been completed [17:40:56] but the backport? I'd be grateful [17:41:36] dancy: true, to be fair(er) it's more the "read side" that surprises me (i.e., distributing GiB of image to hundreds of worker nodes fairly quickly) :) [17:41:48] reading is easy, writing is hard [17:41:51] or something [17:43:32] (03PS4) 10Ssingh: hiera: acme_chief: add wikimedia-ech.org [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) [17:43:47] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10709638 (10VRiley-WMF) [17:43:59] (03PS1) 10Dzahn: hiera: cleanup gitlab-runner docker gc settings [puppet] - 10https://gerrit.wikimedia.org/r/1133992 (https://phabricator.wikimedia.org/T390948) [17:44:11] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5211/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:44:19] (03PS1) 10Superpes15: Add arbcom_plwiki to private wikis on hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1133993 (https://phabricator.wikimedia.org/T391009) [17:45:16] (03CR) 10Dzahn: "before this change:" [puppet] - 10https://gerrit.wikimedia.org/r/1133992 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn) [17:47:48] !log reedy@deploy1003 reedy: Backport for [[gerrit:1133984|Banner: More reading from primary... (T390956)]], [[gerrit:1133156|CommonSettings-labs: Update BounceHandler config]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:47:52] T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956 [17:48:12] !log reedy@deploy1003 reedy: Continuing with sync [17:51:09] 🎉 [17:56:11] (03PS1) 10Superpes15: Apache config for arbcom_plwiki [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009) [17:57:30] !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133984|Banner: More reading from primary... (T390956)]], [[gerrit:1133156|CommonSettings-labs: Update BounceHandler config]] (duration: 17m 43s) [17:57:32] T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956 [18:00:05] dancy and andre: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1800). [18:00:10] o/ [18:02:54] (03PS1) 10Dzahn: hiera: cleanup some gerrit and etherpad hiera values [puppet] - 10https://gerrit.wikimedia.org/r/1133996 (https://phabricator.wikimedia.org/T390948) [18:03:00] !log dancy@deploy1003 Installing scap version "4.149.0" for 2 host(s) [18:03:19] (03CR) 10CI reject: [V:04-1] hiera: cleanup some gerrit and etherpad hiera values [puppet] - 10https://gerrit.wikimedia.org/r/1133996 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn) [18:03:45] dancy: I think I'm clear now [18:04:03] T390956 should've been tagged a train blocker, but it is fixed nwo [18:04:04] T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956 [18:04:13] Reedy: thx [18:04:17] actually, let me do that for tracking purposes [18:04:19] (and then close it) [18:04:48] !log dancy@deploy1003 Installation of scap version "4.149.0" completed for 2 hosts [18:05:10] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133997 (https://phabricator.wikimedia.org/T386218) [18:05:12] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133997 (https://phabricator.wikimedia.org/T386218) (owner: 10TrainBranchBot) [18:05:24] (03CR) 10Dzahn: "There are 2 Change-Id footers here and I'm not sure which is the right one." [puppet] - 10https://gerrit.wikimedia.org/r/1133996 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn) [18:06:00] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test one - bking@cumin2002 - T388610 [18:06:00] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133997 (https://phabricator.wikimedia.org/T386218) (owner: 10TrainBranchBot) [18:06:02] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [18:06:22] (03PS2) 10Dzahn: hiera: cleanup some gerrit and etherpad hiera values [puppet] - 10https://gerrit.wikimedia.org/r/1133996 (https://phabricator.wikimedia.org/T390948) [18:08:49] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test one - bking@cumin2002 - T388610 [18:11:23] (03CR) 10Dzahn: [C:03+1] hiera: acme_chief: add wikimedia-ech.org [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:11:50] (03PS7) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) [18:12:10] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test one - bking@cumin2002 - T388610 [18:12:13] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [18:16:25] (03CR) 10Dzahn: [C:03+1] P:durum: add conditional to enable ECH [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:19:11] (03CR) 10CI reject: [V:04-1] elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:20:02] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.23 refs T386218 [18:20:05] T386218: 1.44.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T386218 [18:20:42] (03PS1) 10Bking: cirrussearch: add second canary for OpenSearch migration [puppet] - 10https://gerrit.wikimedia.org/r/1133999 (https://phabricator.wikimedia.org/T388610) [18:20:45] (03PS1) 10Dzahn: lists: send email to meta admin when steward list members are synced [puppet] - 10https://gerrit.wikimedia.org/r/1134000 (https://phabricator.wikimedia.org/T351202) [18:21:46] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test one - bking@cumin2002 - T388610 [18:21:47] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: add second canary for OpenSearch migration [puppet] - 10https://gerrit.wikimedia.org/r/1133999 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:21:48] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [18:22:51] (03CR) 10Bking: [C:03+2] cirrussearch: add second canary for OpenSearch migration [puppet] - 10https://gerrit.wikimedia.org/r/1133999 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:24:48] (03PS8) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) [18:25:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:25:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10709910 (10phaultfinder) [18:29:03] (03PS9) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) [18:30:15] (03PS1) 10Dzahn: mailman3: fix quoting in mail_cmd for sync_list_members [puppet] - 10https://gerrit.wikimedia.org/r/1134001 (https://phabricator.wikimedia.org/T351202) [18:31:12] (03CR) 10Dzahn: [C:03+2] mailman3: fix quoting in mail_cmd for sync_list_members [puppet] - 10https://gerrit.wikimedia.org/r/1134001 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:35:07] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1134000/5213/lists1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1134000 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:35:16] (03CR) 10Jforrester: "It looks like this might have broken the back-end of Wikifunctions: T391022 (though the reported timing of issues doesn't quite line up." [puppet] - 10https://gerrit.wikimedia.org/r/1133932 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [18:35:24] (03CR) 10CI reject: [V:04-1] elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:37:57] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10709960 (10Neobeta61) can you try updating storcli to 007.3305.0000.0000 please DCSG01809266 (Port Of Defect DCSG01804765) Differing responses for set personality with diff... [18:45:02] (03PS10) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) [18:45:49] (03PS11) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) [18:46:04] (03PS12) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) [18:53:22] (03PS1) 10Dzahn: mailman3: remove superfluous double quotes in sync_list_members [puppet] - 10https://gerrit.wikimedia.org/r/1134013 (https://phabricator.wikimedia.org/T351202) [18:53:45] (03CR) 10Dzahn: [C:03+2] mailman3: remove superfluous double quotes in sync_list_members [puppet] - 10https://gerrit.wikimedia.org/r/1134013 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn) [18:55:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10710158 (10phaultfinder) [18:58:47] (03PS1) 10Alexandros Kosiaris: mw-wikifunctions: Add a missing SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134017 (https://phabricator.wikimedia.org/T384944) [19:01:16] (03PS2) 10Alexandros Kosiaris: service: Cleanup of wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944) [19:01:16] (03PS1) 10Alexandros Kosiaris: mesh: Use sets_sni for mw-wikifuctions [puppet] - 10https://gerrit.wikimedia.org/r/1134020 (https://phabricator.wikimedia.org/T384944) [19:02:22] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch* for ban cirrus nodes to prevent replication problems - bking@cumin2002 - T388610 [19:02:25] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch* for ban cirrus nodes to prevent replication problems - bking@cumin2002 - T388610 [19:02:25] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [19:02:55] (03Abandoned) 10Alexandros Kosiaris: wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1132691 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [19:04:18] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-wikifunctions: Add a missing SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134017 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [19:05:00] (03CR) 10Jforrester: service: Cleanup of wikifunctions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [19:06:01] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch2055*,cirrussearch2056* for ban cirrus nodes to prevent replication problems - bking@cumin2002 - T388610 [19:06:03] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch2055*,cirrussearch2056* for ban cirrus nodes to prevent replication problems - bking@cumin2002 - T388610 [19:10:05] (03Merged) 10jenkins-bot: mw-wikifunctions: Add a missing SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134017 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [19:11:43] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [19:12:00] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:12:37] !log akosiaris@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [19:13:06] !log akosiaris@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [19:13:17] !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [19:13:19] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2056.codfw.wmnet with OS bullseye [19:13:31] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2056 [19:13:36] !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [19:13:43] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:13:48] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [19:13:53] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [19:14:37] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [19:14:45] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [19:14:51] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [19:15:09] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [19:16:03] (03PS1) 10Dzahn: Revert "lists: send email to meta admin when steward list members are synced" [puppet] - 10https://gerrit.wikimedia.org/r/1134028 [19:17:20] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [19:17:23] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [19:19:11] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2056 - bking@cumin2002" [19:19:17] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2056 - bking@cumin2002" [19:19:17] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:19:18] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2056.codfw.wmnet 181.0.192.10.in-addr.arpa 1.8.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:19:21] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2056.codfw.wmnet 181.0.192.10.in-addr.arpa 1.8.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:19:22] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2056 [19:19:44] (03CR) 10Alexandros Kosiaris: [C:03+2] mesh: Use sets_sni for mw-wikifuctions [puppet] - 10https://gerrit.wikimedia.org/r/1134020 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [19:19:58] (03PS2) 10Alexandros Kosiaris: mesh: Use sets_sni for mw-wikifuctions [puppet] - 10https://gerrit.wikimedia.org/r/1134020 (https://phabricator.wikimedia.org/T384944) [19:20:10] (03CR) 10Dzahn: [C:03+2] Revert "lists: send email to meta admin when steward list members are synced" [puppet] - 10https://gerrit.wikimedia.org/r/1134028 (owner: 10Dzahn) [19:20:24] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2056 [19:20:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2056 [19:20:51] (03PS3) 10Alexandros Kosiaris: service: Cleanup of mw-wikifunctions old LVS leftovers [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944) [19:20:58] (03CR) 10Alexandros Kosiaris: service: Cleanup of mw-wikifunctions old LVS leftovers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [19:24:43] (03Abandoned) 10Jforrester: [tests] Ensure each config has at most one value per wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947365 (owner: 10Urbanecm) [19:29:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10710308 (10phaultfinder) [19:30:56] 06SRE: NDA request coverage for KFrancis's PTO - https://phabricator.wikimedia.org/T391032#10710310 (10taavi) [19:32:01] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [19:32:54] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [19:33:04] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:33:16] 06SRE: NDA request coverage for KFrancis's PTO - https://phabricator.wikimedia.org/T391032#10710312 (10Dzahn) a:05Dzahn→03None fyi: @ayounsi (week of April 7th), @jijiki (week of April 12th) [19:33:18] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:34:04] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [19:34:50] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [19:35:47] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10710320 (10phaultfinder) [19:36:19] FIRING: [3x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-c8-eqiad and cloudsw1-e4 (10.64.146.254) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [19:39:47] akosiaris: Now I get `Unexpected token '<', \" So HTML rather an 'upstream connect failed' message. [19:40:20] one could call this an improvement! [19:40:43] Depends which HTML it is. :-) E.g. is that coming from an MW instance but the wrong one, or a non-MW. [19:41:19] RESOLVED: [3x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-c8-eqiad and cloudsw1-e4 (10.64.146.254) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [19:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:44:07] (03CR) 10Dzahn: [C:03+1] wikimedia-ech: add ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1122155 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [19:45:08] James_F: it's the unconfigured domain html [19:45:12] with a 404 [19:45:14] Aha. [19:45:25] heh, mediawiki is actually sending a proper http header here [19:45:26] So… are we not passing the header correctly? [19:46:01] Or is it getting munged somehow, I guess. [19:46:04] can't be. You are passing it previously [19:46:19] Yeah, I meant "we" including ingress or whatever. [19:56:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10710364 (10VRiley-WMF) Hey @Vgutierrez we have recieved the NIC. Is there a specific time for us to install it? [19:57:18] (03PS1) 10Bking: cirrussearch: add puppet 7 hieradata to DC-specific config [puppet] - 10https://gerrit.wikimedia.org/r/1134043 (https://phabricator.wikimedia.org/T388610) [19:57:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10710373 (10VRiley-WMF) a:03VRiley-WMF [19:58:18] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134043 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T2000). nyaa~ [20:00:05] tgr, edsanders, and James_F: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] akosiaris: Any thoughts for the next step? [20:00:27] James_F: I am looking at logstash logs [20:00:36] (03PS3) 10HMonroy: Enable Codex and Multiblocks in German and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) [20:01:10] (03PS4) 10HMonroy: Enable Codex and Multiblocks in German and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) [20:01:28] (03CR) 10HMonroy: Enable Codex and Multiblocks in German and Italian wiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [20:01:43] I can deploy the backport window things whilst I'm here. [20:01:47] edsanders: You OK to go? [20:02:53] tgr_: OK for me to push out the EmailAuth group2 disablement? [20:05:41] ok, got it: https://lounge.uname.gr/uploads/d87fdc1baa696f18/image.png [20:05:50] Well, I'll do mine alone to be getting on with. [20:05:51] envoy is apparently overriding the domain [20:05:54] James_F: appreciated, thanks! [20:05:57] tgr_: Cool. [20:06:02] (03CR) 10Superpes15: [C:03+1] Enable Codex and Multiblocks in German and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [20:06:02] I ended up in an unexpected meeting [20:06:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133937 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [20:06:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133948 (owner: 10Jforrester) [20:06:35] akosiaris: Is that 'easily' fixed? [20:06:55] looking [20:07:30] (03CR) 10Bking: [C:03+2] "self-merging so we can finish our reimage." [puppet] - 10https://gerrit.wikimedia.org/r/1134043 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:09:20] Eurgh, CI is backed up so much it's pending waiting for config patches. [20:09:36] We have test-prio but not gate-and-submit-prio, because this is not meant to happen. [20:09:51] James_F: yes [20:10:00] edsanders: Excellent, will do you second. [20:10:19] (03PS4) 10Alexandros Kosiaris: service: Cleanup of wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944) [20:10:19] (03PS1) 10Alexandros Kosiaris: mesh: Use http_host as well for mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1134050 (https://phabricator.wikimedia.org/T384944) [20:10:30] James_F: ok point taken, I 'll self merge https://gerrit.wikimedia.org/r/1134050 [20:10:51] jouncebot: nowandnext [20:10:51] For the next 0 hour(s) and 49 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T2000) [20:10:51] In 0 hour(s) and 49 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T2100) [20:10:58] akosiaris: :-( Won't that break requests to wikidata.org? [20:11:04] Reedy: Patience, padawan. [20:11:10] pfft [20:11:22] I won't review your patch then [20:11:27] (03PS4) 10Reedy: search-redirect: Handle $_GET potential vulnerability scanning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester) [20:11:29] Sorry. [20:11:31] (03PS5) 10Reedy: search-redirect: Handle $_GET potential vulnerability scanning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester) [20:11:37] oh, you got that too, talking via the same endpoint to wikidata.org [20:11:40] (03CR) 10Reedy: [C:03+1] "GTG in a backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester) [20:11:46] akosiaris: Yes. [20:11:48] damn [20:12:13] (03Merged) 10jenkins-bot: End EmailAuth enforcement group 2 test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133937 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza) [20:12:17] (03Merged) 10jenkins-bot: wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133948 (owner: 10Jforrester) [20:12:21] Finally. [20:12:24] tgr_: can you wait on syncing the EmailAuth one please [20:12:35] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1133937|End EmailAuth enforcement group 2 test (T390662)]], [[gerrit:1133948|wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase]] [20:12:35] cc Reedy [20:12:37] T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662 [20:12:38] kostajh: Wait as in I should stop? [20:12:49] kostajh: Or wait as in pause once it hits debug? [20:13:05] James_F: wait as in, some people are talking about not deploying this. I'm chatting with them now. Sorry. [20:13:08] !log jforrester@deploy1003 sync-world aborted: Backport for [[gerrit:1133937|End EmailAuth enforcement group 2 test (T390662)]], [[gerrit:1133948|wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase]] (duration: 00m 33s) [20:13:12] Ack, aborting. [20:13:32] Should we hold the whole window, or should I revert the disablement and proceed with the rest? [20:13:59] (03CR) 10Jforrester: [C:04-1] "This will work for calls to wikifunctions.org but not for those to wikidata.org." [puppet] - 10https://gerrit.wikimedia.org/r/1134050 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [20:14:30] James_F: you can proceed with the rest. Just don't sync https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1133937, please. [20:15:19] (03PS1) 10Jforrester: Revert "End EmailAuth enforcement group 2 test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134051 [20:15:23] (03CR) 10Jforrester: [C:03+2] Revert "End EmailAuth enforcement group 2 test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134051 (owner: 10Jforrester) [20:15:32] (03CR) 10Kosta Harlan: [C:03+1] "thanks, and sorry for the confusion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134051 (owner: 10Jforrester) [20:15:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10710462 (10phaultfinder) [20:16:03] kostajh: No worries! [20:16:34] (03Merged) 10jenkins-bot: Revert "End EmailAuth enforcement group 2 test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134051 (owner: 10Jforrester) [20:17:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123620 (https://phabricator.wikimedia.org/T387501) (owner: 10Esanders) [20:17:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133972 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders) [20:17:56] (03Merged) 10jenkins-bot: Hide "Insert graph" tool in VE when graphs are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123620 (https://phabricator.wikimedia.org/T387501) (owner: 10Esanders) [20:17:59] (03Merged) 10jenkins-bot: Enable DiscussionTools visual enhancements on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133972 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders) [20:18:02] 06SRE, 06Traffic, 13Patch-For-Review: Create provisioning and post-provisioning checks for Traffic hosts to confirm validity of varying hardware configurations - https://phabricator.wikimedia.org/T378724#10710479 (10CDobbins) On 4/2, we discussed the merits and pitfalls of the proposed implementation with @V... [20:18:15] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1133948|wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase]], [[gerrit:1123620|Hide "Insert graph" tool in VE when graphs are disabled (T387501)]], [[gerrit:1133972|Enable DiscussionTools visual enhancements on zhwiki (T379264)]], [[gerrit:1134051|Revert "End EmailAuth enforcement group 2 test"]] [20:18:19] T387501: Remove "Insert Graph" from VE for now - https://phabricator.wikimedia.org/T387501 [20:18:19] T379264: Offer Usability Improvements as default-on feature at English Wikipedia and remaining wikis - https://phabricator.wikimedia.org/T379264 [20:18:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester) [20:18:54] (03PS1) 10Esanders: Mobile insert menu: Exclude media and signature tools [extensions/VisualEditor] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134053 (https://phabricator.wikimedia.org/T385851) [20:19:05] edsanders: Do you need that backport happening too? [20:20:15] Yeah [20:22:51] along with a config change... [20:22:52] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1128374 [20:22:52] Ack. Please add to the window once it's ready. I'll do these 5(!) together first though. [20:22:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/VisualEditor] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134053 (https://phabricator.wikimedia.org/T385851) (owner: 10Esanders) [20:22:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388604) (owner: 10Esanders) [20:22:52] !log reprepro -C main include bullseye-wikimedia trafficserver_9.2.10-1wm1_amd64.changes: T379797 [20:22:52] SAL is down hmm [20:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:59] T379797: Package and deploy ATS 9.2.6 - https://phabricator.wikimedia.org/T379797 [20:41:59] !log jforrester@deploy1003 esanders, jforrester: Backport for [[gerrit:1133948|wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase]], [[gerrit:1123620|Hide "Insert graph" tool in VE when graphs are disabled (T387501)]], [[gerrit:1133972|Enable DiscussionTools visual enhancements on zhwiki (T379264)]], [[gerrit:1134051|Revert "End EmailAuth enforcement group 2 test"]] synced to the testservers (https://wi [20:41:59] kitech.wikimedia.org/wiki/Mwdebug) [20:41:59] edsanders: Please check on mw-debug if things look OK. [20:41:59] James_F: looking [20:41:59] zhwiki looks good (1/4) [20:41:59] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10710501 (10phaultfinder) [20:41:59] edsanders: How's the rest of the checking going? Can I help? [20:41:59] James_F: hotpatched [20:41:59] it now works [20:41:59] I'm not seeing the mobile insert menu... [20:41:59] edsanders: That's not in this deploy? It's waiting on finishing this one first. [20:41:59] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:41:59] but ... please don't deploy wikifunctions until I 've submitted the gerrit change to fix this [20:41:59] akosiaris: The WF deploy is just a config change, unrelated. [20:41:59] James_F: ah good - so this is just the zhwiki and the graph patch? [20:41:59] akosiaris: Nice! [20:41:59] edsanders: Yeah. OK to go. [20:41:59] edsanders: ? [20:41:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:41:59] James_F: yep - graph tool is hidden [20:41:59] akosiaris: Or does your hotpatch touch the MW-land code? [20:41:59] James_F: nope, not at all. it's only an envoy config change specifically at the orchestrator [20:41:59] akosiaris: OK, so can I proceed with the MW-config deploy? [20:41:59] yup, go ahead [20:41:59] Ack. [20:41:59] !log jforrester@deploy1003 esanders, jforrester: Continuing with sync [20:41:59] I am exhausted, it's 13 hours I am in front of a computer, need a break [20:41:59] akosiaris: <3 [20:41:59] akosiaris: Confirm we'll absolutely not be pushing anything to deployment-charts on a Thursday night / Friday. Get some reset. [20:41:59] Also rest. Freudian. [20:41:59] ❤️ [20:41:59] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [20:41:59] just to point out CI is unhappy currently due to cloud stuff [20:41:59] FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:41:59] It never rains but it pours. [20:41:59] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2056.codfw.wmnet with OS bullseye [20:41:59] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133948|wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase]], [[gerrit:1123620|Hide "Insert graph" tool in VE when graphs are disabled (T387501)]], [[gerrit:1133972|Enable DiscussionTools visual enhancements on zhwiki (T379264)]], [[gerrit:1134051|Revert "End EmailAuth enforcement group 2 test"]] (duration: 21m 39s) [20:41:59] edsanders: OK, next set is your two mobile menu ones. [20:41:59] James_F: looking [20:41:59] edsanders: No no, still merging. [20:41:59] ok [20:41:59] might be a while [20:42:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [20:42:52] Probably 12 mins. [20:42:58] !log Upload Varnish 7.1.1-1.1~bpo11+wmf2 to bullseye-wikimedia T389605 [20:44:36] scap is having connection issues to CI ("connection broken by 'RemoteDisconnected('Remote end closed connection without response')'"). Joy. [20:45:57] Apparently things are recovering [20:46:06] np [20:46:55] signs of life on zuul [20:53:21] * James_F twiddles more thumbs. [20:54:20] (03Merged) 10jenkins-bot: VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388604) (owner: 10Esanders) [20:54:24] Finally. [20:54:29] (03CR) 10Ebernhardson: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/838182 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:54:33] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10710628 (10phaultfinder) [20:54:40] That happened a while ago, but finally the WMCS network seems fixed. [20:55:09] (03PS1) 10Bking: cirrussearch: Add row/rack hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1134059 (https://phabricator.wikimedia.org/T388610) [20:55:36] (03CR) 10CI reject: [V:04-1] cirrussearch: Add row/rack hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1134059 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:56:39] (03Abandoned) 10Bking: WIP: more fine-grained shard status checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133967 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking) [20:57:56] Oy, both CI jobs are in the PostBuildScript stage and stuck. [20:59:59] (03Merged) 10jenkins-bot: Mobile insert menu: Exclude media and signature tools [extensions/VisualEditor] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134053 (https://phabricator.wikimedia.org/T385851) (owner: 10Esanders) [21:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T2100) [21:00:29] And we're off. [21:00:34] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1134053|Mobile insert menu: Exclude media and signature tools (T385851)]], [[gerrit:1128374|VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias (T388604)]] [21:00:38] Sorry, Web team. [21:00:38] (03PS2) 10Bking: cirrussearch: Add row/rack hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1134059 (https://phabricator.wikimedia.org/T388610) [21:00:45] T385851: Introduce additional tools within the mobile visual editor's "+" menu - https://phabricator.wikimedia.org/T385851 [21:00:45] T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604 [21:02:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [21:04:28] (03CR) 10Bking: [C:03+2] cirrussearch: Add row/rack hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1134059 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:04:44] (03CR) 10Bking: [C:03+2] "self-merging in the interest of time" [puppet] - 10https://gerrit.wikimedia.org/r/1134059 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:05:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:05:39] edsanders: Please test on mw-debug. [21:06:20] !log jforrester@deploy1003 esanders, jforrester: Backport for [[gerrit:1134053|Mobile insert menu: Exclude media and signature tools (T385851)]], [[gerrit:1128374|VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias (T388604)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:06:24] T385851: Introduce additional tools within the mobile visual editor's "+" menu - https://phabricator.wikimedia.org/T385851 [21:06:24] T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604 [21:06:53] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2056.codfw.wmnet with OS bullseye [21:06:57] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2056 [21:06:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2056 [21:07:17] James_F: we need to hold those two patches :/ [21:07:29] edsanders: Boo. Hold as in revert? [21:07:41] edsanders: Or just revert the config change? [21:08:04] (03PS1) 10Ebernhardson: Remove unused config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134064 (https://phabricator.wikimedia.org/T389429) [21:09:18] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2056.codfw.wmnet with reason: host reimage [21:09:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10710675 (10phaultfinder) [21:12:25] James_F: slack debate is occurring, just a minute [21:12:29] Kemayo: Ack. [21:12:49] Thankfully Web don't seem to have turned up to deploy. [21:13:12] I think their windows normally go unused in my recent memory, which is convenient. [21:13:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2056.codfw.wmnet with reason: host reimage [21:13:36] Kemayo: Yes, but given that to my count at least 6 things have gone wrong today, I'll take the success. [21:18:53] James_F: revert the config change [21:19:04] James_F: both if it's easier (the other one is a no-op with the ocnfig) [21:19:13] edsanders: It's easier to only revert the config change. [21:19:22] !log jforrester@deploy1003 Sync cancelled. [21:19:45] (03PS1) 10Jforrester: Revert "VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134067 [21:20:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134067 (owner: 10Jforrester) [21:21:11] (03Merged) 10jenkins-bot: Revert "VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134067 (owner: 10Jforrester) [21:21:24] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1134067|Revert "VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias"]] [21:24:12] edsanders: I guess there's no need to test the no-op config state? [21:24:40] James_F: hopefully not [21:24:45] Ack. [21:25:04] I see no changes at the moment [21:25:16] Yeah, it's still syncing to mw-debug. [21:27:34] James_F: Ah, yes I'm seeing the change [21:27:47] edsanders: It's currently on 10 of 12 servers. [21:27:51] So you'll get it sometimes. [21:28:43] Reedy: I'm not minded to sling out 1128050 at this point, sorry. [21:28:54] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1134067|Revert "VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:29:05] edsanders: All OK? [21:29:21] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch* for ban cirrus nodes to prevent replication problems - bking@cumin2002 - T388610 [21:29:22] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch* for ban cirrus nodes to prevent replication problems - bking@cumin2002 - T388610 [21:29:23] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [21:29:35] James_F: looks good - ca.wiki is back to normal [21:29:39] !log jforrester@deploy1003 jforrester: Continuing with sync [21:29:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [21:29:44] Deployment function-orchestrator-main-orchestrator in wikifunctions at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=wikifunctions&var-deployment=function-orchestrator-main-orchestrator - ... [21:29:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [21:30:08] Oh dear. [21:30:24] That was a.kosiaris's hotpatch target. [21:30:32] 😬 [21:31:11] (03CR) 10Cwhite: [C:03+1] prometheus: cleanup k8s instances from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133909 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [21:31:32] (03CR) 10Cwhite: [C:03+1] prometheus: cleanup k8s instances from prometheus100[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [21:31:38] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:35:13] (03PS1) 10Andrew Bogott: trove: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1134072 [21:36:17] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134072 (owner: 10Andrew Bogott) [21:36:52] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134067|Revert "VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias"]] (duration: 15m 28s) [21:37:16] !log Backport deploy done. [21:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134057 (owner: 10Jforrester) [21:38:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester) [21:38:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2056.codfw.wmnet with OS bullseye [21:40:12] (03PS2) 10Andrew Bogott: trove: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1134072 [21:40:14] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134072 (owner: 10Andrew Bogott) [21:42:38] (03CR) 10Andrew Bogott: [C:03+2] trove: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1134072 (owner: 10Andrew Bogott) [21:43:41] (03PS1) 10Bking: cirrussearch: Add cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1134078 (https://phabricator.wikimedia.org/T388610) [21:45:03] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: Add cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1134078 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:49:14] I've filed T391047 for the KubernetesDeploymentUnavailableReplicas for us and silenced it (I hope). [21:49:17] T391047: function-orchestrator-main-orchestrator pods down in codfw due to issue in envoy config(?) - https://phabricator.wikimedia.org/T391047 [21:52:50] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1185 - https://phabricator.wikimedia.org/T391049 (10ops-monitoring-bot) 03NEW [22:17:58] (03PS13) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) [22:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10710917 (10phaultfinder) [22:25:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:29:21] (03PS1) 10Andrew Bogott: Revert "trove: pin 1.x version of sqlalchemy" [puppet] - 10https://gerrit.wikimedia.org/r/1134087 [22:30:04] (03CR) 10Andrew Bogott: [C:03+2] Revert "trove: pin 1.x version of sqlalchemy" [puppet] - 10https://gerrit.wikimedia.org/r/1134087 (owner: 10Andrew Bogott) [22:36:21] 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10710939 (10toni.stoev) >>! In T214998#10676078, @bd808 wrote: > @toni.stoev Please read https://www.medi... [22:47:00] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10710968 (10Jdlrobson-WMF) @Ladsgroup let me know if and how I can help with this, but untagging web team. [23:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:13:10] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1185 - https://phabricator.wikimedia.org/T391049#10711091 (10Ladsgroup) It's a random s5 replica. I don't think we depool hosts with degraded RAID so I leave it as is until dc-ops handle it. Hot swap should be enough. [23:15:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2056-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:25:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [23:29:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10711171 (10phaultfinder) [23:29:49] (03Merged) 10jenkins-bot: Enable Codex and Multiblocks in German and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [23:30:03] !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1133592|Enable Codex and Multiblocks in German and Italian wiki (T377121)]] [23:30:06] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [23:32:30] FIRING: Traffic bill over quota: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota Has improved - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [23:35:37] !log tstarling@deploy1003 hmonroy, tstarling: Backport for [[gerrit:1133592|Enable Codex and Multiblocks in German and Italian wiki (T377121)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:35:40] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [23:38:37] !log tstarling@deploy1003 hmonroy, tstarling: Continuing with sync [23:40:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1134093 [23:40:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1134093 (owner: 10TrainBranchBot) [23:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:45:28] !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133592|Enable Codex and Multiblocks in German and Italian wiki (T377121)]] (duration: 15m 25s) [23:45:31] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [23:52:30] RESOLVED: Traffic bill over quota: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota Has improved - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [23:52:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1134093 (owner: 10TrainBranchBot)