[00:00:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133575 (https://phabricator.wikimedia.org/T389734) (owner: 10Tim Starling)
[00:00:58] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily disable Lua profiler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133575 (https://phabricator.wikimedia.org/T389734) (owner: 10Tim Starling)
[00:01:36] <logmsgbot>	 !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1133575|Temporarily disable Lua profiler (T389734)]]
[00:01:39] <stashbot>	 T389734: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" or "Wikimedia\Rdbms\DBUnexpectedError" errors - https://phabricator.wikimedia.org/T389734
[00:01:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2060:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2060 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[00:08:28] <logmsgbot>	 !log tstarling@deploy1003 tstarling: Backport for [[gerrit:1133575|Temporarily disable Lua profiler (T389734)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[00:08:31] <stashbot>	 T389734: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" or "Wikimedia\Rdbms\DBUnexpectedError" errors - https://phabricator.wikimedia.org/T389734
[00:09:36] <logmsgbot>	 !log tstarling@deploy1003 tstarling: Continuing with sync
[00:12:43] <wikibugs>	 (03CR) 10C. Scott Ananian: [C:03+1] Enable Parsoid Read Views to incubator and dagwiki mobile frontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133141 (https://phabricator.wikimedia.org/T380768) (owner: 10Isabelle Hurbain-Palatin)
[00:15:36] <zabe>	 !log zabe@mwmaint1002:~$ cat group2.dblist | xargs -I{} bash -c "echo {}; mwscript extensions/AbuseFilter/maintenance/MigrateESRefToAflTable.php {} --deletedump /home/zabe/afl_text_table_deletedump/{} --dump /home/zabe/afl_text_table_dump/{} --sleep 0.4" # T381599
[00:15:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:39] <stashbot>	 T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599
[00:16:41] <logmsgbot>	 !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133575|Temporarily disable Lua profiler (T389734)]] (duration: 15m 04s)
[00:16:43] <stashbot>	 T389734: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" or "Wikimedia\Rdbms\DBUnexpectedError" errors - https://phabricator.wikimedia.org/T389734
[00:32:12] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:37:12] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[00:39:34] <urandom>	 !log starting `nodetool garbagecollect` on Cassandra/sessionstore2006
[00:39:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:31] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Parsoid Fragment Support v3: make mStripExtTags a persistent Parser property [core] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133581 (https://phabricator.wikimedia.org/T390420) (owner: 10C. Scott Ananian)
[01:00:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2026:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2026 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[01:56:35] <wikibugs>	 (03PS1) 10Andrew Bogott: backy2: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1133588
[01:56:43] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133588 (owner: 10Andrew Bogott)
[01:56:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] backy2: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1133588 (owner: 10Andrew Bogott)
[01:58:25] <wikibugs>	 (03PS2) 10Andrew Bogott: backy2: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1133588
[01:59:47] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133588 (owner: 10Andrew Bogott)
[02:02:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] backy2: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1133588 (owner: 10Andrew Bogott)
[02:04:28] <andrewbogott>	 jhathaway: trying a different patch, I get a bunch of password prompts when trying to puppet merge
[02:19:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706528 (10phaultfinder)
[02:19:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10706529 (10phaultfinder)
[02:24:33] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922 (10phaultfinder) 03NEW
[02:39:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10706540 (10phaultfinder)
[02:42:12] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[02:59:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706542 (10phaultfinder)
[03:10:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2026:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2026 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:12:12] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:28:56] <wikibugs>	 (03PS1) 10JHathaway: test [puppet] - 10https://gerrit.wikimedia.org/r/1133591
[03:29:19] <jhathaway>	 andrewbogott: strange, running a test merge now
[03:29:38] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] test [puppet] - 10https://gerrit.wikimedia.org/r/1133591 (owner: 10JHathaway)
[03:29:44] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10706572 (10Krinkle)
[03:34:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706576 (10phaultfinder)
[03:36:47] <wikibugs>	 (03PS1) 10HMonroy: Enable Codex and Multiblocks in German wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121)
[03:39:30] <andrewbogott>	 jhathaway: I'm about to go to bed, but, do you see it?
[03:40:01] <jhathaway>	 andrewbogott: thanks for catching it, I see what the issue is, should be fairly easy to fix thanks
[03:40:21] <andrewbogott>	 great!  Will my phantom patches get merged as a side-effect?
[03:42:03] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[03:42:12] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:44:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10706579 (10phaultfinder)
[03:46:17] <jhathaway>	 good question, I didn't see them when I tried to merge my patch
[03:49:43] <wikibugs>	 (03PS1) 10JHathaway: puppetserver: fix sudo user for deploy [puppet] - 10https://gerrit.wikimedia.org/r/1133593 (https://phabricator.wikimedia.org/T385995)
[03:49:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy)
[03:49:57] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133593 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway)
[03:52:57] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] puppetserver: fix sudo user for deploy [puppet] - 10https://gerrit.wikimedia.org/r/1133593 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway)
[03:59:22] <wikibugs>	 (03PS1) 10JHathaway: Revert "test" [puppet] - 10https://gerrit.wikimedia.org/r/1133594
[04:01:22] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] Revert "test" [puppet] - 10https://gerrit.wikimedia.org/r/1133594 (owner: 10JHathaway)
[04:02:03] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[04:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706601 (10phaultfinder)
[04:34:00] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:37:12] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[04:57:12] <jinxer-wm>	 FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:19:00] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:19:39] <jinxer-wm>	 FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:27:12] <jinxer-wm>	 FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:29:39] <jinxer-wm>	 RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[05:56:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T0600)
[06:07:48] <wikibugs>	 (03PS1) 10Kevin Bazira: EventStreamConfig: Add RRLA prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179)
[06:20:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr4-ulsfo and Hurricane Electric (2001:504:0:1::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[06:29:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706634 (10phaultfinder)
[06:40:39] <jinxer-wm>	 RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr4-ulsfo and Hurricane Electric (2001:504:0:1::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[06:45:08] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Permission: Prevent request of unconfigured permission [software/bitu] - 10https://gerrit.wikimedia.org/r/1133365 (https://phabricator.wikimedia.org/T390837) (owner: 10Slyngshede)
[06:45:56] <wikibugs>	 (03PS1) 10Elukey: role::deployment_server::kubernetes: limit Docker concurrent uploads [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251)
[06:46:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti3007 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133741
[06:47:45] <wikibugs>	 (03Merged) 10jenkins-bot: Permission: Prevent request of unconfigured permission [software/bitu] - 10https://gerrit.wikimedia.org/r/1133365 (https://phabricator.wikimedia.org/T390837) (owner: 10Slyngshede)
[06:48:57] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5202/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey)
[06:49:49] <wikibugs>	 (03PS2) 10Elukey: role::deployment_server::kubernetes: limit Docker concurrent uploads [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251)
[06:52:28] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5203/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey)
[06:52:41] <wikibugs>	 (03CR) 10Elukey: role::deployment_server::kubernetes: limit Docker concurrent uploads [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey)
[06:53:47] <wikibugs>	 (03CR) 10Elukey: "I think it is a reasonable test to do, we can easily revert in case it is too slow or not suitable for scap use cases." [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey)
[06:54:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3007.esams.wmnet
[06:55:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:57:05] <wikibugs>	 (03PS4) 10Slyngshede: Release version 0.1.9 [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357
[06:57:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti3007 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133741 (owner: 10Muehlenhoff)
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[07:00:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3007.esams.wmnet
[07:01:14] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update RRLA prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133742 (https://phabricator.wikimedia.org/T326179)
[07:02:08] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Release version 0.1.9 [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357 (owner: 10Slyngshede)
[07:02:49] <wikibugs>	 (03PS2) 10Elukey: services: update eqiad changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132039 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[07:02:49] <wikibugs>	 (03PS3) 10Elukey: services: update codfw changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[07:02:49] <wikibugs>	 (03PS3) 10Elukey: services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[07:04:58] <wikibugs>	 (03Merged) 10jenkins-bot: Release version 0.1.9 [software/bitu] - 10https://gerrit.wikimedia.org/r/1133357 (owner: 10Slyngshede)
[07:05:15] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: update eqiad changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132039 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[07:05:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[07:07:10] <elukey>	 jouncebot: next
[07:07:10] <jouncebot>	 In 0 hour(s) and 52 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T0800)
[07:07:19] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[07:07:55] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[07:08:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3007.esams.wmnet
[07:08:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3007.esams.wmnet
[07:08:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] "LGTM, worthy of a test" [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey)
[07:09:00] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service ganeti3007:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:09:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706674 (10phaultfinder)
[07:10:07] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "it looks like you forgot to push the 9.2.10 tag:" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1133553 (https://phabricator.wikimedia.org/T390912) (owner: 10Ssingh)
[07:10:37] <wikibugs>	 (03PS1) 10Marostegui: installserver: Do not reimage db2241 [puppet] - 10https://gerrit.wikimedia.org/r/1133744
[07:12:12] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:12:25] <wikibugs>	 (03CR) 10Elukey: [C:03+2] role::deployment_server::kubernetes: limit Docker concurrent uploads [puppet] - 10https://gerrit.wikimedia.org/r/1133740 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey)
[07:12:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2241 [puppet] - 10https://gerrit.wikimedia.org/r/1133744 (owner: 10Marostegui)
[07:13:39] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854)
[07:14:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris)
[07:18:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti3008 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133749
[07:22:27] <elukey>	 !log restart docker on deploy1003 to pick up max-concurrent-uploads=1 - T390251
[07:22:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:30] <stashbot>	 T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251
[07:24:57] <wikibugs>	 (03CR) 10DCausse: [C:03+2] "thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1133556 (owner: 10Ryan Kemper)
[07:26:16] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: enable TLS on volatile storage in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1133405 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[07:26:30] <wikibugs>	 (03Merged) 10jenkins-bot: ElevatedMaxLagWDQS: operate only on wdqs traffic [alerts] - 10https://gerrit.wikimedia.org/r/1133556 (owner: 10Ryan Kemper)
[07:27:17] <fabfur>	 !log disabling puppet on A:cp-ulsfo to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133405 (T384227)
[07:27:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:20] <stashbot>	 T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227
[07:27:24] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854)
[07:27:48] <wikibugs>	 (03CR) 10Joely Rooke WMDE: [C:03+1] "Ready for BACON I think!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde)
[07:28:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3008.esams.wmnet
[07:31:18] <fabfur>	 !log applying patch to use TLS on tmpfs on A:cp-ulsfo (T384227)
[07:31:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:30] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] admin-ng/mlserve: Remove ratelimit in istio sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133381 (https://phabricator.wikimedia.org/T388817) (owner: 10Klausman)
[07:32:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris)
[07:33:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti3008 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133749 (owner: 10Muehlenhoff)
[07:36:34] <wikibugs>	 (03PS2) 10Kevin Bazira: ml-services: update RRLA prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133742 (https://phabricator.wikimedia.org/T326179)
[07:36:58] <moritzm>	 !log added spiderpig-access LDAP group T390338
[07:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:00] <stashbot>	 T390338: Create 'spiderpig-access' ldap group - https://phabricator.wikimedia.org/T390338
[07:37:33] <wikibugs>	 (03PS2) 10Muehlenhoff: Add a canonical list of sensitive LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1133325
[07:38:07] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1044.eqiad.wmnet
[07:38:14] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2044.codfw.wmnet
[07:38:36] <wikibugs>	 (03PS1) 10Volans: dnsdisc: make it compatible with bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808
[07:39:44] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update RRLA prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133742 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira)
[07:40:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Bitu: Add approval role for spiderpig-access LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1133810 (https://phabricator.wikimedia.org/T390338)
[07:41:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3008.esams.wmnet
[07:41:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add a canonical list of sensitive LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1133325 (owner: 10Muehlenhoff)
[07:42:12] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:44:01] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1044.eqiad.wmnet
[07:44:59] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2044.codfw.wmnet
[07:47:03] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[07:47:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3008.esams.wmnet
[07:47:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3008.esams.wmnet
[07:49:00] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service ganeti3008:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:54:07] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Revert^4 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133812
[07:54:27] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Revert^4 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133812
[07:54:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert^4 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133812 (owner: 10Alexandros Kosiaris)
[07:54:49] <moritzm>	 !log failover ganeti masters in esams to ganeti3007/3008
[07:54:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:26] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update RRLA prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133742 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira)
[07:57:34] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Revert^4 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133812 (https://phabricator.wikimedia.org/T384944)
[07:57:53] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update RRLA prod config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133742 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira)
[08:00:05] <jouncebot>	 dancy and andre: MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T0800). Please do the needful.
[08:00:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706791 (10phaultfinder)
[08:02:16] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "neat" [puppet] - 10https://gerrit.wikimedia.org/r/1133810 (https://phabricator.wikimedia.org/T390338) (owner: 10Muehlenhoff)
[08:04:19] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] Revert^4 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133812 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[08:04:54] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854)
[08:05:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3005.esams.wmnet
[08:05:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti3005 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133814
[08:06:09] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[08:07:03] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[08:09:36] <wikibugs>	 (03PS1) 10Slyngshede: IDM: upgrade to Bitu version 0.1.9 [dns] - 10https://gerrit.wikimedia.org/r/1133815
[08:10:32] <wikibugs>	 (03CR) 10Elukey: admin_ng: Preserve Server header in ingressgateway (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris)
[08:12:05] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDM: upgrade to Bitu version 0.1.9 [dns] - 10https://gerrit.wikimedia.org/r/1133815 (owner: 10Slyngshede)
[08:12:15] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[08:12:21] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[08:16:49] <wikibugs>	 06SRE: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10706825 (10MoritzMuehlenhoff) p:05Triage→03Medium
[08:18:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti3005 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133814 (owner: 10Muehlenhoff)
[08:18:19] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[08:19:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: admin_ng: Preserve Server header in ingressgateway (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris)
[08:19:31] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854)
[08:20:06] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: update codfw changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126216 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[08:20:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Revert^4 "cache::backend: Switch mw-wikifunctions to ingress" [puppet] - 10https://gerrit.wikimedia.org/r/1133812 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[08:20:39] <elukey>	 jouncebot: next
[08:20:39] <jouncebot>	 In 1 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1000)
[08:20:42] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[08:21:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3005.esams.wmnet
[08:21:49] <hashar>	 !log Upgrading CI Jenkins
[08:21:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:48] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
[08:23:54] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: allocate all role prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/1133817
[08:24:07] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
[08:24:48] <wikibugs>	 (03PS2) 10Volans: dnsdisc: make it compatible with bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808 (https://phabricator.wikimedia.org/T389380)
[08:24:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: allocate all role prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/1133817 (owner: 10Filippo Giunchedi)
[08:25:31] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: enable inference batching for requests in edit-check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133364 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos)
[08:26:56] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: enable inference batching for requests in edit-check [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133364 (https://phabricator.wikimedia.org/T386100) (owner: 10Ilias Sarantopoulos)
[08:26:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Bitu: Add approval role for spiderpig-access LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1133810 (https://phabricator.wikimedia.org/T390338) (owner: 10Muehlenhoff)
[08:28:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3005.esams.wmnet
[08:29:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3005.esams.wmnet
[08:31:52] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: wikifunctions: Move to lvs_setup, disabling paging [puppet] - 10https://gerrit.wikimedia.org/r/1133821 (https://phabricator.wikimedia.org/T384944)
[08:32:12] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service ganeti3005:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:37:12] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[08:39:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10706882 (10phaultfinder)
[08:41:50] <wikibugs>	 (03CR) 10Btullis: [C:03+2] mediawiki-dumps-legacy: render a test config file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133482 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis)
[08:42:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti3006.esams.wmnet
[08:43:10] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: render a test config file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133482 (https://phabricator.wikimedia.org/T388378) (owner: 10Btullis)
[08:44:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti3006 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133830
[08:45:56] <logmsgbot>	 !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided)
[08:46:48] <logmsgbot>	 !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided) (duration: 00m 54s)
[08:47:52] <logmsgbot>	 !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided)
[08:48:54] <logmsgbot>	 !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@c274545] (releasing): (no justification provided) (duration: 01m 03s)
[08:50:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10706910 (10phaultfinder)
[08:50:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10706911 (10phaultfinder)
[08:52:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti3006 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133830 (owner: 10Muehlenhoff)
[08:53:19] <fabfur>	 !log secure deleting certificates in /etc/ssl/private from A:cp-magru (T384227)
[08:53:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:22] <stashbot>	 T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227
[08:56:56] <wikibugs>	 (03PS1) 10Joal: Update GobblinLastSuccessfulRunTooLongAgo [alerts] - 10https://gerrit.wikimedia.org/r/1133847 (https://phabricator.wikimedia.org/T386177)
[08:58:30] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[08:58:51] <wikibugs>	 (03PS1) 10Elukey: profile::service_proxy::envoy: add data-gateway-staging [puppet] - 10https://gerrit.wikimedia.org/r/1133848
[08:59:29] <_joe_>	 here
[09:00:15] <_joe_>	 XioNoX / topranks are you doign anything with that switch?
[09:00:19] <fabfur>	 what's happening ?
[09:00:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3006.esams.wmnet
[09:00:38] <_joe_>	 fabfur: just excessive network traffic
[09:00:42] <_joe_>	 !incidents
[09:00:43] <sirenbot>	 5939 (ACKED)  Host pfw1-eqiad - PING  - Packet loss = 100%
[09:00:43] <sirenbot>	 5945 (UNACKED)  Primary outbound port utilisation over 80%  (paged) network noc (asw2-a-eqiad.mgmt.eqiad.wmnet)
[09:00:43] <sirenbot>	 5944 (RESOLVED)  [3x] ProbeDown sre (ip4 ncredir-https:443 probes/service http_ncredir-https_ip4)
[09:00:43] <sirenbot>	 5942 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet)
[09:00:43] <sirenbot>	 5943 (RESOLVED)  [2x] Primary inbound port utilisation over 80%  (paged) network noc ()
[09:00:56] <_joe_>	 !ack 5945
[09:00:57] <sirenbot>	 5945 (ACKED)  Primary outbound port utilisation over 80%  (paged) network noc (asw2-a-eqiad.mgmt.eqiad.wmnet)
[09:01:02] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris)
[09:01:17] <wikibugs>	 (03CR) 10Brouberol: Update GobblinLastSuccessfulRunTooLongAgo (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1133847 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal)
[09:01:38] <topranks>	 _joe_: no, and ar zel not working today
[09:01:43] * topranks looking
[09:01:46] <_joe_>	 oh ok
[09:01:48] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5204/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey)
[09:01:53] <_joe_>	 yeah I'm in librenms 
[09:02:11] <wikibugs>	 (03PS2) 10Joal: Update GobblinLastSuccessfulRunTooLongAgo [alerts] - 10https://gerrit.wikimedia.org/r/1133847 (https://phabricator.wikimedia.org/T386177)
[09:02:30] <wikibugs>	 (03CR) 10Joal: Update GobblinLastSuccessfulRunTooLongAgo (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1133847 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal)
[09:02:32] <elukey>	 _joe_ it may be a big analytics job running on hadoop, it happened in the past
[09:02:41] <topranks>	 pfw1-eqiad we had problems with yesterday 
[09:02:51] <_joe_>	 yeah I would think that's the case, heh
[09:03:24] <fabfur>	 !log secure deleting certificates in /etc/ssl/private from A:cp-ulsfo (T384227)
[09:03:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:26] <stashbot>	 T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227
[09:03:30] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[09:03:45] <_joe_>	 heh resolved while I was investigating
[09:04:31] <topranks>	 it's analytics traffic 
[09:04:32] <topranks>	 https://grafana.wikimedia.org/goto/DrnxVR0HR?orgId=1
[09:04:47] <elukey>	 I found https://yarn.wikimedia.org/proxy/application_1741864027385_464026/ that could be the culprit, not 100% sure though
[09:04:52] <elukey>	 the job is really huge
[09:05:44] <elukey>	 joal: o/
[09:06:00] <elukey>	 if you have a moment, we got a page for a switch link almost saturated (10G)
[09:06:15] <elukey>	 nothing broken atm, but I am wondering if there is a huge job that runs on hadoop
[09:06:29] <elukey>	 it may also be somebody fetching data from presto, from what Cathal found
[09:06:51] <elukey>	 I noticed https://yarn.wikimedia.org/proxy/application_1741864027385_464026/ that is big, but you know best :)
[09:07:18] <topranks>	 that's data being fetched from an-workers (running hadoop), going towards presto afaik 
[09:08:00] <topranks>	 one help is we have that profiled in qos now, so I can see it didn't squeeze out other data on the links where we have stats 
[09:08:01] <topranks>	 https://grafana.wikimedia.org/goto/7RH8VR0NR?orgId=1
[09:08:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3006.esams.wmnet
[09:08:10] <topranks>	 (no stats from the asw2 devices as they don't export this for us)
[09:08:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti3006.esams.wmnet
[09:09:00] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service ganeti3006:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:09:15] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "Hey folks! Lemme know if it is something that could work, I am not 100% sure, it seems the first of its kind (but it could be useful in th" [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey)
[09:10:20] <elukey>	 topranks: ah we have qos for hadoop workers now? 
[09:10:40] <topranks>	 yeah we added an iptables rule last week to de-prioritise it 
[09:11:00] <topranks>	 that doesn't mean it can't push the usage on a link to maximum 
[09:11:10] <topranks>	 but it does mean when that happens the other traffic gets priority 
[09:11:30] <elukey>	 via iptables, interesting
[09:11:31] <topranks>	 so link maxed out, but hopefully traffic for other services unaffected, or at least impact mitigated significantly 
[09:11:50] <topranks>	 iptables just does the marking of the packets on the host, the network then treats them different 
[09:12:56] <elukey>	 okok, totally ignorant about how it is implemented, I'll check it later
[09:15:16] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[09:15:25] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[09:18:18] <wikibugs>	 (03PS1) 10MVernon: install-server: also run configure_swift_disks for apus-* [puppet] - 10https://gerrit.wikimedia.org/r/1133849 (https://phabricator.wikimedia.org/T390578)
[09:20:20] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] install-server: also run configure_swift_disks for apus-* [puppet] - 10https://gerrit.wikimedia.org/r/1133849 (https://phabricator.wikimedia.org/T390578) (owner: 10MVernon)
[09:21:21] <wikibugs>	 (03PS1) 10Fabfur: hiera: enable TLS on volatile storage in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1133850 (https://phabricator.wikimedia.org/T384227)
[09:21:25] <wikibugs>	 (03CR) 10MVernon: [C:03+2] install-server: also run configure_swift_disks for apus-* [puppet] - 10https://gerrit.wikimedia.org/r/1133849 (https://phabricator.wikimedia.org/T390578) (owner: 10MVernon)
[09:22:23] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133850 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[09:24:42] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Update GobblinLastSuccessfulRunTooLongAgo [alerts] - 10https://gerrit.wikimedia.org/r/1133847 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal)
[09:25:58] <wikibugs>	 (03Merged) 10jenkins-bot: Update GobblinLastSuccessfulRunTooLongAgo [alerts] - 10https://gerrit.wikimedia.org/r/1133847 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal)
[09:26:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Add spiderpig-access to list of sensitive LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1133852 (https://phabricator.wikimedia.org/T390338)
[09:27:12] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:37:12] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[09:37:12] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:39:14] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] wmflib,liberica: Add support for DNS healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1129326 (https://phabricator.wikimedia.org/T389211) (owner: 10Vgutierrez)
[09:42:12] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:43:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris)
[09:45:25] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] Add spiderpig-access to list of sensitive LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1133852 (https://phabricator.wikimedia.org/T390338) (owner: 10Muehlenhoff)
[09:45:36] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab_runner: increase job output_limit to 20MB [puppet] - 10https://gerrit.wikimedia.org/r/1133316 (https://phabricator.wikimedia.org/T390816) (owner: 10Jelto)
[09:48:39] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Preserve Server header in ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133745 (https://phabricator.wikimedia.org/T390854) (owner: 10Alexandros Kosiaris)
[09:49:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10707127 (10phaultfinder)
[09:51:10] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: Remove disk space checks for hadoop worker [puppet] - 10https://gerrit.wikimedia.org/r/1133853 (https://phabricator.wikimedia.org/T390875)
[09:51:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:51:41] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[09:51:44] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[09:51:55] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[09:52:08] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[09:52:14] <godog>	 !log lvextend --resizefs --size +1TB vg0/srv on mwlog[12]002
[09:52:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:35] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[09:52:52] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[09:52:54] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: enable TLS on volatile storage in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1133850 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[09:54:51] <akosiaris>	 !log deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1133745 in all k8s ingresses to stop ingressgateway from forcefully setting the HTTP server header in the responses to "istio-envoy"
[09:54:52] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[09:54:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:54] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[09:56:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:58:59] <fabfur>	 !log applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133850 to use TLS on tmpfs on A:cp-eqsin (T384227)
[09:59:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:02] <stashbot>	 T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227
[09:59:40] <fabfur>	 !log disable puppet on A:cp-eqsin
[09:59:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:49] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[09:59:54] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1000)
[10:00:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add spiderpig-access to list of sensitive LDAP groups [puppet] - 10https://gerrit.wikimedia.org/r/1133852 (https://phabricator.wikimedia.org/T390338) (owner: 10Muehlenhoff)
[10:02:14] <logmsgbot>	 !log fabfur@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15 days, 0:00:00 on cp4047.ulsfo.wmnet with reason: HW errors
[10:02:19] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10707144 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=93385548-b505-4318-a69f-9b083dad822a) set by fabfur@cumin1002 for 15 days, 0:00:00 on 1 host(s) and their services with reason...
[10:02:50] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: admin_ng: Fix indentation of EnvoyFilter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133855
[10:04:12] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: enable TLS on volatile storage in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1133850 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[10:05:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Default the ganeti role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133856 (https://phabricator.wikimedia.org/T389178)
[10:08:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] admin_ng: Fix indentation of EnvoyFilter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133855 (owner: 10Alexandros Kosiaris)
[10:10:14] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host apus-fe2003.codfw.wmnet with OS bookworm
[10:10:21] <wikibugs>	 (03CR) 10Superpes15: "Couldn't you add itwiki in the same patch to avoid double work? Being a very light change in the code there shouldn't be any issue imho" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy)
[10:10:23] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10707179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host apus-fe2003.codfw.wmnet with OS bookworm
[10:13:40] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Fix indentation of EnvoyFilter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133855 (owner: 10Alexandros Kosiaris)
[10:14:34] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[10:14:51] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[10:16:23] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[10:16:37] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[10:17:07] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[10:17:16] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[10:17:42] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[10:17:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Default the ganeti role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133856 (https://phabricator.wikimedia.org/T389178) (owner: 10Muehlenhoff)
[10:17:48] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[10:18:42] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[10:18:48] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[10:20:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: fix prometheus instances_override [puppet] - 10https://gerrit.wikimedia.org/r/1133859
[10:20:50] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[10:20:55] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[10:21:17] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[10:21:20] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[10:22:07] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[10:22:10] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[10:22:25] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
[10:22:30] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'.
[10:22:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix prometheus instances_override [puppet] - 10https://gerrit.wikimedia.org/r/1133859 (owner: 10Filippo Giunchedi)
[10:25:01] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#10707237 (10cmooney)
[10:25:55] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: add profile::gitlab::runner::output_limit to wmcs projects [puppet] - 10https://gerrit.wikimedia.org/r/1133860 (https://phabricator.wikimedia.org/T390816)
[10:26:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: hdfs: Remove disk space checks for hadoop worker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133853 (https://phabricator.wikimedia.org/T390875) (owner: 10Stevemunene)
[10:27:00] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on apus-fe2003.codfw.wmnet with reason: host reimage
[10:27:55] <wikibugs>	 (03CR) 10Jelto: gitlab_runner: add profile::gitlab::runner::output_limit to wmcs projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133860 (https://phabricator.wikimedia.org/T390816) (owner: 10Jelto)
[10:30:59] <wikibugs>	 (03Abandoned) 10Jelto: gitlab_runner: add profile::gitlab::runner::output_limit to wmcs projects [puppet] - 10https://gerrit.wikimedia.org/r/1133860 (https://phabricator.wikimedia.org/T390816) (owner: 10Jelto)
[10:32:06] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apus-fe2003.codfw.wmnet with reason: host reimage
[10:35:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#10707267 (10cmooney) >>! In T374614#10147994, @ayounsi wrote: > Short term I think if you add `[4Gbps]` to the interface description, LibreNMS will [[ https://docs...
[10:38:51] <wikibugs>	 (03PS1) 10Ladsgroup: Bump thumbnail steps to 65% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133862 (https://phabricator.wikimedia.org/T360589)
[10:40:10] <Amir1>	 jouncebot: nowandnext
[10:40:10] <jouncebot>	 For the next 0 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1000)
[10:40:10] <jouncebot>	 In 1 hour(s) and 19 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1200)
[10:44:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133862 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[10:45:32] <wikibugs>	 (03Merged) 10jenkins-bot: Bump thumbnail steps to 65% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133862 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[10:45:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10Observability-Alerting: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#10707299 (10cmooney)
[10:46:10] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1133862|Bump thumbnail steps to 65% (T360589)]]
[10:46:13] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[10:48:41] <moritzm>	 !log remove nodejs from aqs* hosts, no longer used/needed and spares us needless security rollouts T350143
[10:48:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:43] <stashbot>	 T350143: Write AQS 1 deprecation announcement - https://phabricator.wikimedia.org/T350143
[10:50:00] <topranks>	 !log drain transport circuits to eqord (Chicago network pop) to prep for Junos upgrade cr2-eqord T364092
[10:50:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:03] <stashbot>	 T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092
[10:51:20] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002"
[10:51:55] <wikibugs>	 (03PS12) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389
[10:53:18] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1133862|Bump thumbnail steps to 65% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[10:53:21] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[10:54:50] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002"
[10:54:51] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host apus-fe2003.codfw.wmnet with OS bookworm
[10:54:58] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10707330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host apus-fe2003.codfw.wmnet with OS bookworm completed: - apus-fe2003 (**PA...
[10:55:03] <wikibugs>	 (03PS5) 10Tiziano Fogli: ripe atlas anchors: icmp to http check [puppet] - 10https://gerrit.wikimedia.org/r/1127552 (https://phabricator.wikimedia.org/T388419)
[10:55:07] <joal>	 sorry elukey, I was AFK when you pinged
[10:55:19] <joal>	 I read the backlog and it was presto again IIUC
[10:55:52] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[10:56:13] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10707332 (10MatthewVernon) 05Open→03Resolved OK, this is fixed, sorry about that (I'd done most of the necessary preseed changes, but had missed one).
[10:57:49] <wikibugs>	 (03PS6) 10Tiziano Fogli: ripe atlas anchors: icmp to http check [puppet] - 10https://gerrit.wikimedia.org/r/1127552 (https://phabricator.wikimedia.org/T388419)
[10:58:06] <elukey>	 joal: np! I was wondering if any big job was ongoing, or if somebody was querying data..
[10:58:58] <joal>	 It's not the first time we have issues with presto. It's mostly due to people querying datasets too big. We (DPE) need to better at not making thoses datasets available...
[11:00:56] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10707339 (10elukey) @Papaul @Jhancock.wm is it worth to perform another swap test like in T388684 to see if the controller does its job...
[11:02:26] <elukey>	 joal: is there any way to track if a query is being executed?
[11:02:30] <elukey>	 on presto I mean
[11:02:45] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133862|Bump thumbnail steps to 65% (T360589)]] (duration: 16m 34s)
[11:02:47] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[11:03:26] <wikibugs>	 (03PS3) 10Hnowlan: jobrnuner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791)
[11:03:58] <joal>	 elukey: https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1
[11:04:51] <joal>	 elukey: I also posted a message for the DE team to discuss possible solutions soon (rather than late or never :)
[11:05:09] <wikibugs>	 (03PS1) 10Clément Goubert: mw::periodic_jobs: Pass command through untouched [puppet] - 10https://gerrit.wikimedia.org/r/1133864 (https://phabricator.wikimedia.org/T341555)
[11:05:29] <joal>	 When there are spikes on the presto graph, you need to tunnel into the presto coord to get access to the UI to monitor what's running (https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Presto/Administration#View_the_Presto_UI)
[11:05:31] <wikibugs>	 (03PS4) 10Hnowlan: jobrunner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791)
[11:05:32] <wikibugs>	 (03PS1) 10Clément Goubert: mediawiki: Fix mwcron command invocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133865 (https://phabricator.wikimedia.org/T341555)
[11:05:44] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133864 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[11:06:55] <topranks>	 !log pre-pend as paths announced to codfw/eqiad from eqord to prep for JunOS upgrade T364092
[11:06:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:58] <stashbot>	 T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092
[11:07:32] <moritzm>	 !log installing nodejs security updates
[11:07:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:02] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Temporary debugging code for T389728 [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133868
[11:09:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133868 (owner: 10Bartosz Dziewoński)
[11:12:12] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:14:25] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707360 (10Jelto)
[11:14:37] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707361 (10Jelto) 05Open→03In progress p:05Triage...
[11:16:54] <wikibugs>	 (03PS1) 10Clément Goubert: mwcron: Import all periodic_jobs resources [puppet] - 10https://gerrit.wikimedia.org/r/1133872 (https://phabricator.wikimedia.org/T341555)
[11:17:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Temporary debugging code for T389728 [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133868 (owner: 10Bartosz Dziewoński)
[11:23:42] <wikibugs>	 (03PS2) 10Muehlenhoff: Create insetup role for WMCS with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133422 (https://phabricator.wikimedia.org/T389825)
[11:27:25] <wikibugs>	 (03PS2) 10Clément Goubert: mwcron: Import all periodic_jobs resources [puppet] - 10https://gerrit.wikimedia.org/r/1133872 (https://phabricator.wikimedia.org/T341555)
[11:28:01] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] jobrunner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[11:29:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10707422 (10phaultfinder)
[11:29:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10707423 (10phaultfinder)
[11:30:34] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr2-codfw,cr2-eqiad,cr2-eqord,cr2-eqord IPv6,cr3-ulsfo with reason: Upgrade cr2-eqord JunOS
[11:31:32] <topranks>	 !log disable EBGP sessions to internet peers on cr2-eqord to prep for JunOS upgrade T364092
[11:31:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:35] <stashbot>	 T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092
[11:33:17] <topranks>	 !log reboot cr2-eqord to complete JunOS upgrade T364092
[11:33:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:01] <moritzm>	 !log installing Python 3.9 security updates
[11:37:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:12] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:43:03] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: wikifunctions: Disable paging [puppet] - 10https://gerrit.wikimedia.org/r/1133821 (https://phabricator.wikimedia.org/T384944)
[11:43:39] <jinxer-wm>	 FIRING: CoreBGPDown: Core BGP session down between cr4-ulsfo and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr4-ulsfo:9804&var-bgp_group=Confed_eqord&var-bgp_neighbor=cr2-eqord - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[11:44:00] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mw-wikifunctions: Switch DNS to use ingress [dns] - 10https://gerrit.wikimedia.org/r/1133878 (https://phabricator.wikimedia.org/T384944)
[11:46:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: jobrunner: reimage the three remaining eqiad in-warranty jobrunners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[11:46:48] <moritzm>	 !log installing Django security updates on Bullseye
[11:46:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:39] <jinxer-wm>	 FIRING: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[11:48:41] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707471 (10Jelto) a:03Jelto This need approval from:...
[11:48:56] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707473 (10Jelto)
[11:50:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1001.eqiad.wmnet
[11:52:03] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[11:52:44] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Temporary debugging code for T389728 [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133868
[11:53:39] <jinxer-wm>	 RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
[11:56:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1001.eqiad.wmnet
[11:58:05] <moritzm>	 !log installing Intel microcode security updates
[11:58:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1200)
[12:04:12] <wikibugs>	 (03PS3) 10Majavah: dynamicproxy: Add dependency on acme-chief cert [puppet] - 10https://gerrit.wikimedia.org/r/1133448
[12:05:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10707558 (10phaultfinder)
[12:05:41] <wikibugs>	 (03CR) 10Majavah: [C:03+2] dynamicproxy: Add dependency on acme-chief cert [puppet] - 10https://gerrit.wikimedia.org/r/1133448 (owner: 10Majavah)
[12:06:54] <wikibugs>	 (03PS7) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805)
[12:07:08] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mediawiki: Fix mwcron command invocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133865 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[12:08:29] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw::periodic_jobs: Pass command through untouched [puppet] - 10https://gerrit.wikimedia.org/r/1133864 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[12:11:07] <wikibugs>	 (03CR) 10Klausman: [C:03+2] admin-ng/mlserve: Remove ratelimit in istio sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133381 (https://phabricator.wikimedia.org/T388817) (owner: 10Klausman)
[12:12:03] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[12:13:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[12:13:30] <wikibugs>	 (03PS1) 10Volans: .wmfconfig: add Debian bookworm build [software/cumin] - 10https://gerrit.wikimedia.org/r/1133884
[12:13:31] <wikibugs>	 (03PS1) 10Volans: cli: fine-tune CLI logging [software/cumin] - 10https://gerrit.wikimedia.org/r/1133885
[12:14:04] <wikibugs>	 (03PS1) 10Volans: logging: rotate files [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133886
[12:16:04] <wikibugs>	 (03Merged) 10jenkins-bot: admin-ng/mlserve: Remove ratelimit in istio sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133381 (https://phabricator.wikimedia.org/T388817) (owner: 10Klausman)
[12:16:47] <moritzm>	 !log installing libxslt security updates
[12:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:51] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707587 (10isarantopoulos) I approve
[12:19:19] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mediawiki: Fix mwcron command invocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133865 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[12:21:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-test
[12:22:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-test
[12:23:23] <wikibugs>	 (03PS1) 10Majavah: openstack: wikireplica_dns: Alias upcoming x3 cluster to s8 [puppet] - 10https://gerrit.wikimedia.org/r/1133892 (https://phabricator.wikimedia.org/T390954)
[12:24:22] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[12:25:27] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[12:28:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all
[12:30:29] <wikibugs>	 (03CR) 10Federico Ceratto: "This should be ready for final review." [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto)
[12:33:36] <wikibugs>	 (03CR) 10Marostegui: [C:04-1] "Please see my previous comment about downtime" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto)
[12:34:37] <wikibugs>	 (03PS1) 10Fabfur: hiera: enable TLS on volatile storage in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1133897 (https://phabricator.wikimedia.org/T384227)
[12:35:21] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133897 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[12:37:12] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[12:42:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all
[12:43:03] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707701 (10Jelto)
[12:43:33] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10707702 (10Jhancock.wm) All good! thank you for your help!
[12:45:31] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10707705 (10cmooney)
[12:45:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10707706 (10phaultfinder)
[12:46:04] <wikibugs>	 (03PS8) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805)
[12:47:49] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[12:48:54] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[12:48:59] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] openstack: wikireplica_dns: Alias upcoming x3 cluster to s8 [puppet] - 10https://gerrit.wikimedia.org/r/1133892 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah)
[12:49:14] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: mw-wikifunctions: Switch DNS to use ingress [dns] - 10https://gerrit.wikimedia.org/r/1133878 (https://phabricator.wikimedia.org/T384944)
[12:49:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10707716 (10phaultfinder)
[12:50:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] wikifunctions: Disable paging [puppet] - 10https://gerrit.wikimedia.org/r/1133821 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[12:52:40] <wikibugs>	 (03PS1) 10Jelto: admin: add ozge shell user and groups [puppet] - 10https://gerrit.wikimedia.org/r/1133900 (https://phabricator.wikimedia.org/T390855)
[12:52:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] testreduce: Select the custom nginx provider with no additional modules [puppet] - 10https://gerrit.wikimedia.org/r/1129878 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff)
[12:53:26] <godog>	 jouncebot: now and next
[12:53:26] <jouncebot>	 For the next 0 hour(s) and 6 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1200)
[12:53:38] <elukey>	 joal: ack thanks!
[12:53:51] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[12:53:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public
[12:54:25] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[12:55:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public
[12:55:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: move k8s prometheus1006 -> 1008 [puppet] - 10https://gerrit.wikimedia.org/r/1131302 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[12:55:58] <godog>	 !log move k8s instances from prometheus1006 to prometheus1008 - T383232
[12:56:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:01] <stashbot>	 T383232: Move k8s Prometheus instances to new Prometheus hw in eqiad/codfw - https://phabricator.wikimedia.org/T383232
[12:56:40] <wikibugs>	 (03CR) 10Hnowlan: jobrunner: reimage the three remaining eqiad in-warranty jobrunners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[12:56:50] <moritzm>	 !log prune now obsolete nginx packages from testreduce1002 T329529
[12:56:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:53] <stashbot>	 T329529: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529
[12:57:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529#10707755 (10MoritzMuehlenhoff)
[12:58:04] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team, 13Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10707756 (10Jelto) I reached out...
[12:58:32] <wikibugs>	 (03CR) 10Jelto: [C:04-1] "approval from @tcipriani is still needed" [puppet] - 10https://gerrit.wikimedia.org/r/1133900 (https://phabricator.wikimedia.org/T390855) (owner: 10Jelto)
[12:59:19] <wikibugs>	 (03CR) 10Ozge: "looks great! thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1133900 (https://phabricator.wikimedia.org/T390855) (owner: 10Jelto)
[12:59:51] <wikibugs>	 (03CR) 10Ozge: [C:03+1] admin: add ozge shell user and groups [puppet] - 10https://gerrit.wikimedia.org/r/1133900 (https://phabricator.wikimedia.org/T390855) (owner: 10Jelto)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1300)
[13:00:05] <jouncebot>	 ihurbain, cscott, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:12] <ihurbain>	 o/
[13:00:16] <Lucas_WMDE>	 o/
[13:00:19] <cscott>	 o/
[13:20:57] <wikibugs>	 (03CR) 10Majavah: [C:03+2] openstack: wikireplica_dns: Alias upcoming x3 cluster to s8 [puppet] - 10https://gerrit.wikimedia.org/r/1133892 (https://phabricator.wikimedia.org/T390954) (owner: 10Majavah)
[13:23:19] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add mediawiki-common to mw-cron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133902
[13:23:22] <papaul>	 win 5
[13:25:13] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: cleanup k8s instances from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133909 (https://phabricator.wikimedia.org/T383232)
[13:25:14] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: cleanup k8s instances from prometheus100[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232)
[13:25:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] mw-wikifunctions: Switch DNS to use ingress [dns] - 10https://gerrit.wikimedia.org/r/1133878 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[13:25:46] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[13:25:48] <logmsgbot>	 !log akosiaris@dns1004 START - running authdns-update
[13:25:49] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[13:27:13] <logmsgbot>	 !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133113|Enable Parsoid Read Views on 13 wiktionaries (T390680)]], [[gerrit:1133141|Enable Parsoid Read Views to incubator and dagwiki mobile frontend (T380768 T381002)]] (duration: 19m 40s)
[13:27:17] <stashbot>	 T390680: Wiktionary deploy from April ~3rd 2025 - https://phabricator.wikimedia.org/T390680
[13:27:18] <stashbot>	 T380768: Deploy Parsoid Read Views to incubator  (week of ????-??-??) - https://phabricator.wikimedia.org/T380768
[13:27:18] <stashbot>	 T381002: Turn on Parsoid Read Views for Mobile Front End on dagwiki - https://phabricator.wikimedia.org/T381002
[13:27:39] <taavi>	 finally
[13:27:50] <ihurbain>	 woo!
[13:27:52] <ihurbain>	 thanks taavi :)
[13:28:10] <logmsgbot>	 !log akosiaris@dns1004 END - running authdns-update
[13:28:22] <logmsgbot>	 !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1133581|Parsoid Fragment Support v3: make mStripExtTags a persistent Parser property (T390420)]]
[13:28:25] <stashbot>	 T390420: "indicator" tag not parsed properly - https://phabricator.wikimedia.org/T390420
[13:28:54] <cscott>	 i'm up, whee
[13:29:00] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:30:12] <logmsgbot>	 !log taavi@deploy1003 scap failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.44.0-wmf.22,1.44.0-wmf.23 --multiversion-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion --multiversion-debug-image-name docker-registry.discovery.wmnet/
[13:30:12] <logmsgbot>	 restricted/mediawiki-multiversion-debug --multiversion-cli-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.148.0 --label vnd.wikimedia.mediawiki.versions=1.44.0-wmf.22,1.44.0-wmf.23 --label vnd.wikimedia.sc
[13:30:12] <logmsgbot>	 ap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawiki-staging/scap/image-build --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080' returned non-zero exit status 1. (scap version: 4.148.0) (duration: 01m 49s)
[13:30:24] <taavi>	 huh, scap backport crashed
[13:30:38] <taavi>	     latest_mw_image = mw_images_by_flavour["publish"]["image"]
[13:30:38] <taavi>	 KeyError: 'publish'
[13:31:06] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Add mediawiki-common to mw-cron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133902 (owner: 10Giuseppe Lavagetto)
[13:31:12] <cscott>	 this is why i leave backporting to the professionals
[13:31:18] <taavi>	 claime: _joe_: ^ rings any bell?
[13:31:33] <claime>	 hmmm
[13:32:03] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[13:32:15] <taavi>	 ah, scrolling up reveals the true error
[13:32:15] <taavi>	 13:30:12 [mediawiki-publish-81] Err:1 http://apt.wikimedia.org/wikimedia bullseye-wikimedia/component/php81 amd64 php8.1-tidewa
[13:32:15] <taavi>	 ys amd64 5.0.4-16+wmf11u1                                                                                                      
[13:32:15] <taavi>	 13:30:12 [mediawiki-publish-81]   Could not connect to webproxy:8080 (208.80.154.74), connection timed out                     
[13:32:18] <taavi>	 let me retry
[13:32:56] <logmsgbot>	 !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1133581|Parsoid Fragment Support v3: make mStripExtTags a persistent Parser property (T390420)]]
[13:34:00] <_joe_>	 maybe someone was restarting sqid :)
[13:34:42] <taavi>	 already blaming someone else :-)
[13:34:43] <logmsgbot>	 !log taavi@deploy1003 scap failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.44.0-wmf.22,1.44.0-wmf.23 --multiversion-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion --multiversion-debug-image-name docker-registry.discovery.wmnet/
[13:34:43] <logmsgbot>	 restricted/mediawiki-multiversion-debug --multiversion-cli-image-name docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.148.0 --label vnd.wikimedia.mediawiki.versions=1.44.0-wmf.22,1.44.0-wmf.23 --label vnd.wikimedia.sc
[13:34:43] <logmsgbot>	 ap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawiki-staging/scap/image-build --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080' returned non-zero exit status 1. (scap version: 4.148.0) (duration: 01m 46s)
[13:34:51] <taavi>	 it did it again
[13:35:56] <taavi>	 the active webproxy is install1004 which is maxing its cpu
[13:36:44] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: Add group{0,1,2} and pretrain releases in mw-api-int staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1115889 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[13:38:27] <taavi>	 !log install1004: kill a dead `/usr/bin/apt-mark showmanual` process holding puppet runs
[13:38:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:19] <logmsgbot>	 !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1133581|Parsoid Fragment Support v3: make mStripExtTags a persistent Parser property (T390420)]]
[13:39:21] <stashbot>	 T390420: "indicator" tag not parsed properly - https://phabricator.wikimedia.org/T390420
[13:39:31] <taavi>	 third time's the charm
[13:40:34] <wikibugs>	 (03CR) 10Ssingh: "Thanks, pushed the tag and updated the commit to fix the typo." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1133553 (https://phabricator.wikimedia.org/T390912) (owner: 10Ssingh)
[13:42:11] <wikibugs>	 (03CR) 10Clément Goubert: "One optional nit, otherwise I think that should work." [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey)
[13:42:12] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:42:16] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] profile::service_proxy::envoy: add data-gateway-staging [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey)
[13:44:31] <cscott>	 <fingers crossed>
[13:45:01] <moritzm>	 !log imported imposm3 0.14.1-1 to apt.wikimedia.org for bookworm-wikimedia T389780 T381565
[13:45:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:05] <stashbot>	 T389780: Build and import imposm 0.14.1 plus latest bugfix - https://phabricator.wikimedia.org/T389780
[13:45:05] <stashbot>	 T381565: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565
[13:45:16] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] jobrunner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[13:46:10] <taavi>	 cscott: please test
[13:46:21] <cscott>	 ok, thanks!
[13:46:46] <logmsgbot>	 !log taavi@deploy1003 cscott, taavi: Backport for [[gerrit:1133581|Parsoid Fragment Support v3: make mStripExtTags a persistent Parser property (T390420)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:46:49] <stashbot>	 T390420: "indicator" tag not parsed properly - https://phabricator.wikimedia.org/T390420
[13:46:51] <wikibugs>	 07SRE-Unowned, 10Maps: Build and import imposm 0.14.1 plus latest bugfix - https://phabricator.wikimedia.org/T389780#10708012 (10MoritzMuehlenhoff) 05Open→03Resolved The latest imposm release plus a cherrypick of @Jgiannelos' patch has been built as 0.14.1-1 and imported to apt.wikimedia.org
[13:47:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Reapply maps_bookworm role to maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1133915 (https://phabricator.wikimedia.org/T381565)
[13:49:01] <MatmaRex>	 taavi: do you have time for my patch, or should i reschedule?
[13:49:05] <taavi>	 jouncebot: next
[13:49:05] <jouncebot>	 In 1 hour(s) and 10 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1500)
[13:49:07] <wikibugs>	 (03PS1) 10Jelto: Ceph: add types for S3 credential and account [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922)
[13:49:12] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 06serviceops: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10708019 (10ABran-WMF) a:03ABran-WMF
[13:49:24] <taavi>	 MatmaRex: there's nothing after the window so we should be fine
[13:49:33] <MatmaRex>	 ok. thanks
[13:50:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: "To be merged next week" [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[13:50:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: "To be merged next week" [puppet] - 10https://gerrit.wikimedia.org/r/1133909 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[13:50:28] <wikibugs>	 (03CR) 10Jelto: "sounds good to me! See Id8979165b96d737addc676f3abf3f088a48eda48." [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[13:50:34] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2045.codfw.wmnet
[13:50:41] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1044.eqiad.wmnet
[13:51:31] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1420 to wikikube-worker1166
[13:51:51] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[13:51:58] <wikibugs>	 (03PS1) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378)
[13:52:17] <cscott>	 taavi: still testing, thanks!
[13:54:10] <cscott>	 taavi: ok, looks good, ok to proceed
[13:54:28] <taavi>	 thanks
[13:54:29] <logmsgbot>	 !log taavi@deploy1003 cscott, taavi: Continuing with sync
[13:55:39] <logmsgbot>	 !log taavi@deploy1003 scap failed: <CalledProcessError> Command '['helmfile', '-e', 'eqiad', '--selector', 'name=main', 'write-values', '--output-file-template', '/tmp/tmp1ws3xaaw']' returned non-zero exit status 1. (scap version: 4.148.0) (duration: 16m 20s)
[13:55:57] <wikibugs>	 (03CR) 10FNegri: Create insetup role for WMCS with nftables and rename existing one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133422 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff)
[13:56:02] <taavi>	 claime: now it's failing with a helm values yaml parsing issue
[13:56:13] <claime>	 on mwcron?
[13:56:18] <taavi>	 https://phabricator.wikimedia.org/P74591
[13:56:19] <taavi>	 yeah
[13:56:22] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1044.eqiad.wmnet
[13:56:24] <claime>	 ffs
[13:56:38] <claime>	 lemme fix that
[13:57:22] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2045.codfw.wmnet
[13:58:19] <wikibugs>	 (03CR) 10Muehlenhoff: Create insetup role for WMCS with nftables and rename existing one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133422 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff)
[13:58:28] <wikibugs>	 (03PS1) 10Clément Goubert: mw::periodic_jobs: Fix serviceops test job [puppet] - 10https://gerrit.wikimedia.org/r/1133919
[13:58:44] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] mw::periodic_jobs: Fix serviceops test job [puppet] - 10https://gerrit.wikimedia.org/r/1133919 (owner: 10Clément Goubert)
[13:59:34] <wikibugs>	 (03CR) 10Ssingh: [V:03+2 C:03+2] "Merging because no code change since last +1: fixed typo in commit message." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1133553 (https://phabricator.wikimedia.org/T390912) (owner: 10Ssingh)
[14:00:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10708076 (10phaultfinder)
[14:00:58] <wikibugs>	 (03PS1) 10Tiziano Fogli: jobrunner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1133922 (https://phabricator.wikimedia.org/T354791)
[14:01:24] <wikibugs>	 (03PS1) 10Majavah: P:mediawiki: periodic_jobs: Fix string quoting for good [puppet] - 10https://gerrit.wikimedia.org/r/1133923
[14:02:03] <claime>	 taavi: oh good catch
[14:02:03] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[14:02:05] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1420 to wikikube-worker1166 - hnowlan@cumin1002"
[14:02:09] <wikibugs>	 (03Abandoned) 10Tiziano Fogli: jobrunner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1133922 (https://phabricator.wikimedia.org/T354791) (owner: 10Tiziano Fogli)
[14:02:10] <wikibugs>	 (03CR) 10FNegri: [C:03+1] Create insetup role for WMCS with nftables and rename existing one (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133422 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff)
[14:02:18] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[14:02:28] <wikibugs>	 (03PS6) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610)
[14:02:56] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1420 to wikikube-worker1166 - hnowlan@cumin1002"
[14:02:56] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:02:57] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1166
[14:03:01] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[14:03:06] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[14:03:23] <claime>	 taavi: helmfile applies cleanly with my temp fix, you can proceed with scap
[14:03:44] <logmsgbot>	 !log taavi@deploy1003 Started scap sync-world: re-syncing 1133581
[14:03:52] <taavi>	 thanks! see also https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133923 to maybe avoid that in the future
[14:04:04] <wikibugs>	 (03PS2) 10Clément Goubert: P:mediawiki: periodic_jobs: Fix string quoting for good [puppet] - 10https://gerrit.wikimedia.org/r/1133923 (owner: 10Majavah)
[14:04:04] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133923 (owner: 10Majavah)
[14:04:09] <taavi>	 already running a pcc
[14:04:40] <claime>	 ack
[14:04:58] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5205/console" [puppet] - 10https://gerrit.wikimedia.org/r/1133923 (owner: 10Majavah)
[14:05:03] <claime>	 taavi: yeah, that patch was the reason for my "good catch" earlier
[14:05:09] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1166
[14:05:17] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1420 to wikikube-worker1166
[14:05:36] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw1420 to wikikube-worker1166 completed: - mw1420 (**PASS**)   - ✔️ Down...
[14:06:13] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "Thanks, good catch!" [puppet] - 10https://gerrit.wikimedia.org/r/1133923 (owner: 10Majavah)
[14:06:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:mediawiki: periodic_jobs: Fix string quoting for good [puppet] - 10https://gerrit.wikimedia.org/r/1133923 (owner: 10Majavah)
[14:06:24] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: add second canary for OpenSearch migration [puppet] - 10https://gerrit.wikimedia.org/r/1133551 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[14:06:36] <wikibugs>	 (03CR) 10Bking: [C:03+2] "self-merging in the interest of time" [puppet] - 10https://gerrit.wikimedia.org/r/1133551 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[14:06:38] <wikibugs>	 (03PS1) 10Tiziano Fogli: auth_metrics: add recording rules for grafana widgets [puppet] - 10https://gerrit.wikimedia.org/r/1133924 (https://phabricator.wikimedia.org/T390672)
[14:07:18] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[14:07:25] <taavi>	 huh why is that failing CI
[14:08:47] <wikibugs>	 (03PS3) 10Majavah: P:mediawiki: periodic_jobs: Fix string quoting for good [puppet] - 10https://gerrit.wikimedia.org/r/1133923
[14:09:17] <wikibugs>	 (03CR) 10Volans: upgrade.py: Depool, repool, update Phabricator (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[14:09:30] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Reapply maps_bookworm role to maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1133915 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[14:09:53] <claime>	 taavi: Host instead of Hosts, my fault
[14:10:00] <claime>	 The CI message is wrong though
[14:10:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[14:11:13] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:mediawiki: periodic_jobs: Fix string quoting for good [puppet] - 10https://gerrit.wikimedia.org/r/1133923 (owner: 10Majavah)
[14:11:19] <jinxer-wm>	 FIRING: CloudCoreBGPDown: ...
[14:11:19] <jinxer-wm>	 Cloud (WMCS) BGP session down between cloudsw1-f4-eqiad and cloudsw1-c8 (2620:0:861:fe0d::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cloudsw1-f4-eqiad:9804&var-bgp_group=prod_ebgp6&var-bgp_neighbor=cloudsw1-c8 - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown
[14:11:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Create insetup role for WMCS with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133422 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff)
[14:12:43] <logmsgbot>	 !log taavi@deploy1003 Finished scap sync-world: re-syncing 1133581 (duration: 08m 58s)
[14:12:45] <taavi>	 cscott: yours is finally live
[14:12:48] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Profile::Mediawiki_deployment: remove deprecated debug field [puppet] - 10https://gerrit.wikimedia.org/r/1131060 (https://phabricator.wikimedia.org/T389499) (owner: 10Scott French)
[14:12:48] <taavi>	 MatmaRex: still there?
[14:13:22] <MatmaRex>	 yeah
[14:13:43] <taavi>	 cool, sorry for the wait
[14:13:53] <taavi>	 your patch is live on mwdebug1001, lmk when you're done and i'll revert
[14:14:08] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] services: use the kafka svc endpoint for Tegola [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133142 (https://phabricator.wikimedia.org/T373115) (owner: 10Elukey)
[14:15:22] <wikibugs>	 (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto)
[14:15:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:16:45] <MatmaRex>	 taavi: looking
[14:17:12] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test only - bking@cumin2002 - T388610
[14:17:15] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[14:17:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Create insetup role for ServiceOps with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133927 (https://phabricator.wikimedia.org/T389825)
[14:18:19] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10708154 (10Jhancock.wm) i unfortunately cannot find a spare 8 TB drive. So we'd either need to try it with a 4 TB or source a disk.
[14:18:41] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test only - bking@cumin2002 - T388610
[14:18:45] <MatmaRex>	 taavi: are you sure it's live? i'm not seeing the expected logs
[14:19:05] <tgr_>	 MatmaRex: can you ping me when done?
[14:19:24] <MatmaRex>	 ok
[14:20:09] <taavi>	 the code is definitely there
[14:20:15] <taavi>	 let me try manually restarting php-fpm for good measure
[14:21:13] <jhathaway>	 !incidents
[14:21:14] <sirenbot>	 5945 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (asw2-a-eqiad.mgmt.eqiad.wmnet)
[14:21:14] <sirenbot>	 5944 (RESOLVED)  [3x] ProbeDown sre (ip4 ncredir-https:443 probes/service http_ncredir-https_ip4)
[14:21:14] <sirenbot>	 5942 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet)
[14:21:14] <sirenbot>	 5943 (RESOLVED)  [2x] Primary inbound port utilisation over 80%  (paged) network noc ()
[14:21:33] <jhathaway>	 strange, just received a page for pfw1-eqiad
[14:22:44] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[14:22:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test one - bking@cumin2002 - T388610
[14:22:50] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[14:23:37] <wikibugs>	 (03PS1) 10Gergő Tisza: Enable EmailAuth enforcement on group 2 for short test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133928 (https://phabricator.wikimedia.org/T390437)
[14:24:21] <_joe_>	 jhathaway: I think it was the ack expiring?
[14:24:30] <MatmaRex>	 hmm, maybe i was testing it wrong. let me try something else
[14:24:39] <jhathaway>	 _joe_: yes you are correct, now resolved
[14:24:57] <taavi>	 MatmaRex: i think you're just being bit by caching
[14:25:17] <taavi>	 i needed to manually invalidate cache for my user for loadFromDatabase to be called
[14:25:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:26:09] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1132691 (https://phabricator.wikimedia.org/T384944)
[14:26:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133928 (https://phabricator.wikimedia.org/T390437) (owner: 10Gergő Tisza)
[14:26:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1132691 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[14:26:40] <MatmaRex>	 taavi: yes. okay, i see it now
[14:26:49] <MatmaRex>	 one second
[14:27:31] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test one - bking@cumin2002 - T388610
[14:27:44] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[14:28:04] <wikibugs>	 (03PS2) 10Gergő Tisza: Enable EmailAuth enforcement on group 2 for short test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133928 (https://phabricator.wikimedia.org/T390662)
[14:28:22] <wikibugs>	 (03PS2) 10Elukey: profile::service_proxy::envoy: add data-gateway-staging [puppet] - 10https://gerrit.wikimedia.org/r/1133848
[14:28:23] <hnowlan>	 jouncebot: nowandnext
[14:28:23] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 31 minute(s)
[14:28:23] <jouncebot>	 In 0 hour(s) and 31 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1500)
[14:28:27] <taavi>	 hi
[14:29:00] <wikibugs>	 (03CR) 10Elukey: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey)
[14:29:00] <taavi>	 hnowlan: i think t.gr_ is in the queue first
[14:29:36] <wikibugs>	 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10708223 (10Ladsgroup) `ms-be1070` will probably alert this weekend, it's already at 93.7%. How do we depool a backend? I can't find an...
[14:29:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10708224 (10phaultfinder)
[14:29:39] <hnowlan>	 okay. I have some pending host renames that *shouldn't* impact scap (they're already depooled and decommissioned in confctl) but I'll hold 
[14:30:31] <MatmaRex>	 taavi: i think i have everything i need, thank you
[14:30:46] <MatmaRex>	 please restore mwdebug to normal state :)
[14:30:48] <taavi>	 thanks, restoring then
[14:30:50] <taavi>	 and done
[14:30:52] <taavi>	 tgr_: your turn!
[14:30:54] <MatmaRex>	 tgr_: ^
[14:30:58] <tgr_>	 thx
[14:31:06] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5206/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey)
[14:31:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Reapply maps_bookworm role to maps-test2001 [puppet] - 10https://gerrit.wikimedia.org/r/1133915 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[14:31:46] <tgr_>	 hnowlan: I'm just changing a config flag so if you think it's fine to do in parallel with a scap backport, feel free
[14:32:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133928 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza)
[14:32:23] <hnowlan>	 nah go ahead, there's no huge rush 
[14:32:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[14:32:55] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[14:33:03] <wikibugs>	 (03Abandoned) 10Bartosz Dziewoński: Temporary debugging code for T389728 [extensions/CentralAuth] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133868 (owner: 10Bartosz Dziewoński)
[14:33:06] <wikibugs>	 (03Merged) 10jenkins-bot: Enable EmailAuth enforcement on group 2 for short test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133928 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza)
[14:33:10] <wikibugs>	 (03PS1) 10Bking: Revert "cirrussearch: add second canary for OpenSearch migration" [puppet] - 10https://gerrit.wikimedia.org/r/1133929
[14:33:20] <wikibugs>	 (03CR) 10Bking: [V:03+2 C:03+2] Revert "cirrussearch: add second canary for OpenSearch migration" [puppet] - 10https://gerrit.wikimedia.org/r/1133929 (owner: 10Bking)
[14:33:31] <logmsgbot>	 !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1133928|Enable EmailAuth enforcement on group 2 for short test (T390662)]]
[14:33:34] <stashbot>	 T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662
[14:36:56] <wikibugs>	 (03PS1) 10Hnowlan: wmnet: remove jobrunner and videoscaler records [dns] - 10https://gerrit.wikimedia.org/r/1133931 (https://phabricator.wikimedia.org/T354791)
[14:37:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmnet: remove jobrunner and videoscaler records [dns] - 10https://gerrit.wikimedia.org/r/1133931 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[14:38:28] <wikibugs>	 (03PS2) 10Hnowlan: wmnet: remove jobrunner and videoscaler records [dns] - 10https://gerrit.wikimedia.org/r/1133931 (https://phabricator.wikimedia.org/T354791)
[14:39:02] <logmsgbot>	 !log tgr@deploy1003 tgr: Backport for [[gerrit:1133928|Enable EmailAuth enforcement on group 2 for short test (T390662)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:39:04] <stashbot>	 T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662
[14:41:42] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1133932 (https://phabricator.wikimedia.org/T384944)
[14:42:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2056 for ban node before reimaging - bking@cumin2002 - T388610
[14:42:13] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2056 for ban node before reimaging - bking@cumin2002 - T388610
[14:42:16] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[14:42:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2056* for ban node before reimaging - bking@cumin2002 - T388610
[14:42:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2056* for ban node before reimaging - bking@cumin2002 - T388610
[14:42:51] <logmsgbot>	 !log tgr@deploy1003 tgr: Continuing with sync
[14:42:52] <wikibugs>	 (03PS1) 10Hnowlan: service: remove videoscaler, jobrunner probes [puppet] - 10https://gerrit.wikimedia.org/r/1133934 (https://phabricator.wikimedia.org/T354791)
[14:44:15] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1131060 (https://phabricator.wikimedia.org/T389499) (owner: 10Scott French)
[14:44:20] <wikibugs>	 (03CR) 10Scott French: [C:03+2] Profile::Mediawiki_deployment: remove deprecated debug field [puppet] - 10https://gerrit.wikimedia.org/r/1131060 (https://phabricator.wikimedia.org/T389499) (owner: 10Scott French)
[14:45:11] <wikibugs>	 (03CR) 10Volans: "Disclaimer: I'm not familiar with the RFCs but Sukhbir told me I could check the script without studying it :)" [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:45:19] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1133932 (https://phabricator.wikimedia.org/T384944)
[14:46:23] <wikibugs>	 (03PS2) 10Hnowlan: service: remove videoscaler, jobrunner monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1133934 (https://phabricator.wikimedia.org/T354791)
[14:47:14] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: acme_chief: add wikimedia-ech.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:48:00] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "LGTM, is there a reason to switch to log rotation? Ease of grepping logs etc.?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133886 (owner: 10Volans)
[14:48:22] <wikibugs>	 (03CR) 10Elukey: [C:03+1] .wmfconfig: add Debian bookworm build [software/cumin] - 10https://gerrit.wikimedia.org/r/1133884 (owner: 10Volans)
[14:49:37] <wikibugs>	 (03CR) 10Elukey: [C:03+1] cli: fine-tune CLI logging [software/cumin] - 10https://gerrit.wikimedia.org/r/1133885 (owner: 10Volans)
[14:49:49] <logmsgbot>	 !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133928|Enable EmailAuth enforcement on group 2 for short test (T390662)]] (duration: 16m 18s)
[14:49:52] <stashbot>	 T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662
[14:50:10] <tgr_>	 hnowlan: ^
[14:51:14] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10708353 (10elukey) >>! In T384003#10708154, @Jhancock.wm wrote: > i unfortunately cannot find a spare 8 TB drive. So we'd either need t...
[14:51:19] <jinxer-wm>	 FIRING: CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-f4-eqiad and cloudsw1-c8 (172.31.255.2) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cloudsw1-f4-eqiad:9804&var-bgp_group=cloud_ebgp&var-bgp_neighbor=cloudsw1-c8 - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGP
[14:52:49] <wikibugs>	 (03PS1) 10Gergő Tisza: End EmailAuth enforcement group 2 test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133937 (https://phabricator.wikimedia.org/T390662)
[14:53:41] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133937 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza)
[14:54:27] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] End EmailAuth enforcement group 2 test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133937 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza)
[14:56:22] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] prometheus: cleanup k8s instances from prometheus100[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[14:57:12] <wikibugs>	 (03CR) 10Volans: "Ahem..." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133886 (owner: 10Volans)
[14:57:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1133932 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[14:58:49] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] prometheus: cleanup k8s instances from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133909 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[15:00:04] <jouncebot>	 dancy and andre: Time to snap out of that daydream and deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1500).
[15:01:44] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: service: Cleanup of wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944)
[15:02:38] <hnowlan>	 tgr_: thanks 
[15:02:58] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "/me hides" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133886 (owner: 10Volans)
[15:03:24] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1437 to wikikube-worker1167
[15:03:54] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[15:04:19] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.rename from mw1438 to wikikube-worker1168
[15:06:30] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1166.eqiad.wmnet with OS bookworm
[15:06:33] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1166
[15:06:33] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1166
[15:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:49] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708462 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker1166.eqiad.wmnet with OS bookworm
[15:08:07] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1437 to wikikube-worker1167 - hnowlan@cumin1002"
[15:08:50] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:09:12] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1437 to wikikube-worker1167 - hnowlan@cumin1002"
[15:09:12] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:09:13] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1167
[15:09:14] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox
[15:09:16] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:09:18] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10708490 (10RobH) Confirmed engineer visit for Monday, April 7th and opened ticket 01044010.
[15:09:39] <wikibugs>	 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10708491 (10MatthewVernon) We don't, there's no equivalent context in swift.  I can do a bulk-vacuum on that host, either tomorrow or M...
[15:09:43] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:10:35] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:10:36] <wikibugs>	 (03CR) 10Volans: [C:03+2] logging: rotate files [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133886 (owner: 10Volans)
[15:10:40] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1167
[15:10:45] <wikibugs>	 (03CR) 10Volans: [C:03+2] .wmfconfig: add Debian bookworm build [software/cumin] - 10https://gerrit.wikimedia.org/r/1133884 (owner: 10Volans)
[15:10:48] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1437 to wikikube-worker1167
[15:10:53] <wikibugs>	 (03CR) 10Volans: [C:03+2] cli: fine-tune CLI logging [software/cumin] - 10https://gerrit.wikimedia.org/r/1133885 (owner: 10Volans)
[15:11:00] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708495 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw1437 to wikikube-worker1167 completed: - mw1437 (**PASS**)   - ✔️ Down...
[15:11:19] <jinxer-wm>	 FIRING: CloudCoreBGPDown: ...
[15:11:19] <jinxer-wm>	 Cloud (WMCS) BGP session down between cloudsw1-f4-eqiad and cloudsw1-d5 (2620:0:861:fe0f::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cloudsw1-f4-eqiad:9804&var-bgp_group=prod_ebgp6&var-bgp_neighbor=cloudsw1-d5 - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown
[15:11:24] <wikibugs>	 (03PS2) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378)
[15:11:26] <wikibugs>	 (03CR) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH (037 comments) [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:12:12] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:13:26] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1438 to wikikube-worker1168 - hnowlan@cumin1002"
[15:13:32] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1438 to wikikube-worker1168 - hnowlan@cumin1002"
[15:13:32] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:13:33] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1168
[15:13:41] <Reedy>	 jouncebot: nowandnext
[15:13:41] <jouncebot>	 For the next 0 hour(s) and 46 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1500)
[15:13:41] <jouncebot>	 In 0 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1600)
[15:14:32] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:14:39] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1168
[15:14:47] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1438 to wikikube-worker1168
[15:14:56] <tgr_>	 Seems like the scap for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1133928 somehow didn't work. Is there such a a thing as a scap log?
[15:15:01] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708601 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by hnowlan@cumin1002 from mw1438 to wikikube-worker1168 completed: - mw1438 (**PASS**)   - ✔️ Down...
[15:15:08] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:16:28] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1166.eqiad.wmnet wikikube-worker1167.eqiad.wmnet wikikube-worker1168.eqiad.wmnet on all recursors
[15:16:32] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1166.eqiad.wmnet wikikube-worker1167.eqiad.wmnet wikikube-worker1168.eqiad.wmnet on all recursors
[15:16:39] <wikibugs>	 (03PS1) 10Reedy: Remove catching of db exception [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133944 (https://phabricator.wikimedia.org/T390956)
[15:16:52] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1167.eqiad.wmnet with OS bookworm
[15:16:53] <wikibugs>	 (03CR) 10Reedy: "not in wmf_deploy; will deal with later" [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133944 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy)
[15:16:56] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1167
[15:16:56] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1167
[15:17:07] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708611 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker1167.eqiad.wmnet with OS bookworm
[15:17:15] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1168.eqiad.wmnet with OS bookworm
[15:17:18] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1168
[15:17:19] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1168
[15:17:23] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Remove catching of db exception [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133944 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy)
[15:17:28] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708613 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host wikikube-worker1168.eqiad.wmnet with OS bookworm
[15:18:13] <wikibugs>	 (03CR) 10Reedy: [C:03+2] "Oh it is. Ignore that then." [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133944 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy)
[15:19:09] <wikibugs>	 (03PS3) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378)
[15:20:24] <wikibugs>	 (03Merged) 10jenkins-bot: logging: rotate files [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133886 (owner: 10Volans)
[15:20:40] <wikibugs>	 (03CR) 10Volans: "replies inline" [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:21:16] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1166.eqiad.wmnet with reason: host reimage
[15:21:19] <jinxer-wm>	 FIRING: [2x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-f4-eqiad and cloudsw1-d5 (10.64.147.6) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown
[15:21:52] <wikibugs>	 (03Merged) 10jenkins-bot: Remove catching of db exception [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133944 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy)
[15:22:38] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1133944|Remove catching of db exception (T390956)]]
[15:22:40] <stashbot>	 T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956
[15:23:19] <wikibugs>	 06SRE, 06Traffic, 10Data-Engineering (Q3 2025 January 1st - March 31th), 13Patch-For-Review: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10708700 (10Ahoelzl) 05Open→03Resolved
[15:24:11] <wikibugs>	 (03PS4) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378)
[15:24:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] auth_metrics: add recording rules for grafana widgets [puppet] - 10https://gerrit.wikimedia.org/r/1133924 (https://phabricator.wikimedia.org/T390672) (owner: 10Tiziano Fogli)
[15:24:19] <wikibugs>	 (03CR) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:24:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:24:54] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1166.eqiad.wmnet with reason: host reimage
[15:25:49] <wikibugs>	 (03PS5) 10Ssingh: utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378)
[15:25:53] <wikibugs>	 (03PS1) 10Jforrester: wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133948
[15:26:19] <jinxer-wm>	 FIRING: [3x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-f4-eqiad and cloudsw1-d5 (10.64.147.6) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown
[15:27:15] <wikibugs>	 (03Merged) 10jenkins-bot: .wmfconfig: add Debian bookworm build [software/cumin] - 10https://gerrit.wikimedia.org/r/1133884 (owner: 10Volans)
[15:27:16] <wikibugs>	 (03Merged) 10jenkins-bot: cli: fine-tune CLI logging [software/cumin] - 10https://gerrit.wikimedia.org/r/1133885 (owner: 10Volans)
[15:27:22] <wikibugs>	 (03CR) 10Ssingh: "I think I got all comments in. Thanks for the review and the suggestion on removing join. That was leftover from textwrap and I think I wi" [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:29:28] <wikibugs>	 (03PS1) 10Gergő Tisza: Enable EmailAuth enforcement on group 2 for short test (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133949 (https://phabricator.wikimedia.org/T390662)
[15:30:33] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] Enable EmailAuth enforcement on group 2 for short test (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133949 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza)
[15:30:33] <logmsgbot>	 !log reedy@deploy1003 reedy: Backport for [[gerrit:1133944|Remove catching of db exception (T390956)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:30:36] <stashbot>	 T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956
[15:30:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10708792 (10phaultfinder)
[15:30:45] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:30:46] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Enable EmailAuth enforcement on group 2 for short test (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133949 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza)
[15:30:52] <tgr_>	 I'll deploy a config bugfix, Reedy plz let me know when done
[15:31:50] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1167.eqiad.wmnet with reason: host reimage
[15:31:59] <sukhe>	 volans: thanks for the in-depth reviews as always <3
[15:32:09] <volans>	 anytime :)
[15:32:30] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1168.eqiad.wmnet with reason: host reimage
[15:33:10] <logmsgbot>	 !log reedy@deploy1003 reedy: Continuing with sync
[15:33:21] <wikibugs>	 (03PS3) 10Ssingh: hiera: acme_chief: add wikimedia-ech.org [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378)
[15:34:00] <Amir1>	 😍 wikimedia-ech.org
[15:34:07] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5208/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:34:33] <_joe_>	 jouncebot: now 
[15:34:33] <jouncebot>	 For the next 0 hour(s) and 25 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1500)
[15:34:42] <_joe_>	 jouncebot: nowandnext
[15:34:42] <jouncebot>	 For the next 0 hour(s) and 25 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1500)
[15:34:42] <jouncebot>	 In 0 hour(s) and 25 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1600)
[15:34:50] <sukhe>	 Amir1: !
[15:34:52] <_joe_>	 ok, I can merge this change safely
[15:34:57] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1167.eqiad.wmnet with reason: host reimage
[15:35:14] <Amir1>	 _joe_: we are deploying a couple of UBNs right now 
[15:35:51] <Amir1>	 I mean fixes to UBNs, not causing them
[15:35:53] <_joe_>	 Amir1: yeah these will not affect scap
[15:35:58] <Amir1>	 ah okay
[15:36:04] <_joe_>	 Amir1: who says you're not creating new ones
[15:36:13] <Amir1>	 one way to find out!
[15:36:41] <_joe_>	 Amir1: in any case, lmk when you're done, I will still need a lock on helmfile on a couple namespaces
[15:36:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:37:00] <Amir1>	 sure thanks!
[15:37:39] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2056-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[15:38:27] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1168.eqiad.wmnet with reason: host reimage
[15:38:31] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: fix edit-check blubber image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133952
[15:39:43] <wikibugs>	 (03PS1) 10Ssingh: hiera: acme_chief: fix ordering of DC [puppet] - 10https://gerrit.wikimedia.org/r/1133953
[15:39:52] <wikibugs>	 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10708946 (10Ladsgroup) Thanks. Let me know if I can help on anything!
[15:40:06] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133944|Remove catching of db exception (T390956)]] (duration: 17m 28s)
[15:40:08] <stashbot>	 T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956
[15:41:13] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1166.eqiad.wmnet with OS bookworm
[15:41:31] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10708954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker1166.eqiad.wmnet with OS bookworm completed: - wikik...
[15:41:59] <Amir1>	 Reedy: shall we the mw config tgr_ and I go ahead or there are other patches for CN needs to be created and deployed?
[15:42:12] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:43:55] <Reedy>	 Amir1: Your patch doesn't fix it, just shows a more useful error
[15:44:08] <Reedy>	 You're GTG, but https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/1133954 needs reviewing :)
[15:45:03] <Amir1>	 awesome, the config patch will be quick to deploy
[15:45:25] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Enable EmailAuth enforcement on group 2 for short test (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133949 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza)
[15:45:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133949 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza)
[15:46:17] <wikibugs>	 (03Merged) 10jenkins-bot: Enable EmailAuth enforcement on group 2 for short test (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133949 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza)
[15:46:43] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1133949|Enable EmailAuth enforcement on group 2 for short test (#2) (T390662)]]
[15:46:46] <stashbot>	 T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662
[15:49:37] <wikibugs>	 (03PS9) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805)
[15:50:50] <wikibugs>	 (03CR) 10Federico Ceratto: "Updated: switched from `--slow` pool-in to default speed (4 steps), also switched to use `wait_for_replication`" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[15:52:03] <logmsgbot>	 !log ladsgroup@deploy1003 tgr, ladsgroup: Backport for [[gerrit:1133949|Enable EmailAuth enforcement on group 2 for short test (#2) (T390662)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:52:04] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1167.eqiad.wmnet with OS bookworm
[15:52:06] <stashbot>	 T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662
[15:52:18] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709035 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker1167.eqiad.wmnet with OS bookworm completed: - wikik...
[15:52:21] <wikibugs>	 (03PS2) 10HMonroy: Enable Codex and Multiblocks in German and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121)
[15:52:38] <logmsgbot>	 !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on elastic2056.codfw.wmnet with reason: adding net-new role
[15:53:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable Codex and Multiblocks in German and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy)
[15:53:48] <logmsgbot>	 !log ladsgroup@deploy1003 tgr, ladsgroup: Continuing with sync
[15:54:35] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5210/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133953 (owner: 10Ssingh)
[15:55:12] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1168.eqiad.wmnet with OS bookworm
[15:55:29] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host wikikube-worker1168.eqiad.wmnet with OS bookworm completed: - wikik...
[15:55:33] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: fix edit-check blubber image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133952 (owner: 10Ilias Sarantopoulos)
[15:56:49] <wikibugs>	 (03CR) 10CI reject: [V:04-1] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[15:57:32] <wikibugs>	 (03CR) 10Ssingh: hiera: acme_chief: fix ordering of DC [puppet] - 10https://gerrit.wikimedia.org/r/1133953 (owner: 10Ssingh)
[15:58:03] <hnowlan>	 !log running homer 'cr*eqiad*' commit for new wikikube workers 
[15:58:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:05] <jouncebot>	 jhathaway and rzl: Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1600). Please do the needful.
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:58] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133949|Enable EmailAuth enforcement on group 2 for short test (#2) (T390662)]] (duration: 14m 15s)
[16:01:01] <stashbot>	 T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662
[16:05:51] <Amir1>	 Reedy: that patch I have is now deployed, shall we backport to wmf_deploy? I'm actually not sure how CN code is backported. Just deploy on wmf_deploy branch?
[16:06:07] <logmsgbot>	 !log hnowlan@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1166-1168].eqiad.wmnet
[16:06:09] <logmsgbot>	 !log hnowlan@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1166-1168].eqiad.wmnet
[16:06:22] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709114 (10ops-monitoring-bot) pool host wikikube-worker[1166-1168].eqiad.wmnet by hnowlan@cumin1002 with reason: None
[16:06:30] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709115 (10ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by hnowlan@cumin1002 pool for host wikikube-worker[1166-1168].eqiad.wmnet completed: - wik...
[16:06:41] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T390998 (10hnowlan) 03NEW
[16:07:34] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10709138 (10hnowlan)
[16:09:33] <wikibugs>	 (03CR) 10Superpes15: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy)
[16:09:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10709153 (10phaultfinder)
[16:10:02] <wikibugs>	 (03PS1) 10Reedy: Banner: Conditionally check for banner existence from primary db [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133959 (https://phabricator.wikimedia.org/T390956)
[16:10:07] <Reedy>	 Amir1: Needs to go to .23 branch too
[16:10:10] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Banner: Conditionally check for banner existence from primary db [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133959 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy)
[16:10:20] <Amir1>	 ah, fun
[16:11:19] <Reedy>	 we branch deployment branches from wmf_deploy etc
[16:11:19] <jinxer-wm>	 FIRING: [2x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-e4-eqiad and cloudsw1-d5 (10.64.147.4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown
[16:12:25] <wikibugs>	 (03CR) 10Volans: upgrade.py: Depool, repool, update Phabricator (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[16:13:30] <wikibugs>	 (03Merged) 10jenkins-bot: Banner: Conditionally check for banner existence from primary db [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133959 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy)
[16:13:49] <Reedy>	 lets get that out
[16:14:25] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1133959|Banner: Conditionally check for banner existence from primary db (T390956)]]
[16:14:28] <stashbot>	 T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956
[16:15:11] <Amir1>	 thanks
[16:15:42] <wikibugs>	 (03PS1) 10Volans: spicerack: add Spicerack interactive shell [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329)
[16:16:42] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329) (owner: 10Volans)
[16:17:44] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync
[16:17:47] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync
[16:17:48] <wikibugs>	 (03PS2) 10Volans: spicerack: add Spicerack interactive shell [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329)
[16:19:29] <wikibugs>	 (03PS3) 10Volans: spicerack: add Spicerack interactive shell [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329)
[16:19:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10709229 (10phaultfinder)
[16:21:40] <logmsgbot>	 !log reedy@deploy1003 reedy: Backport for [[gerrit:1133959|Banner: Conditionally check for banner existence from primary db (T390956)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:21:43] <stashbot>	 T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956
[16:22:28] <logmsgbot>	 !log reedy@deploy1003 reedy: Continuing with sync
[16:22:32] <hnowlan>	 !log decommissioning all but 1 eqiad jobrunner node in confctl 
[16:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:43] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329) (owner: 10Volans)
[16:24:51] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/1133953 (owner: 10Ssingh)
[16:27:03] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[16:27:12] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:28:12] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: wikifunctions: Add an extra rule for internal Ingress endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133966
[16:29:35] <wikibugs>	 (03PS1) 10Bking: WIP: more fine-grained shard status checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133967 (https://phabricator.wikimedia.org/T383811)
[16:29:39] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133959|Banner: Conditionally check for banner existence from primary db (T390956)]] (duration: 15m 13s)
[16:29:42] <stashbot>	 T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956
[16:30:38] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:30:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10709280 (10phaultfinder)
[16:30:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] wikifunctions: Add an extra rule for internal Ingress endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133966 (owner: 10Alexandros Kosiaris)
[16:31:19] <jinxer-wm>	 FIRING: CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-e4-eqiad and cloudsw1-c8 (10.64.147.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cloudsw1-e4-eqiad:9804&var-bgp_group=prod_ebgp4&var-bgp_neighbor=cloudsw1-c8 - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPD
[16:32:25] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Add an extra rule for internal Ingress endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133966 (owner: 10Alexandros Kosiaris)
[16:32:56] <wikibugs>	 (03PS1) 10Reedy: Banner: While saving, do exists() against primary [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133969 (https://phabricator.wikimedia.org/T390956)
[16:33:00] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Banner: While saving, do exists() against primary [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133969 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy)
[16:34:01] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:36:19] <jinxer-wm>	 FIRING: [2x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-e4-eqiad and cloudsw1-c8 (10.64.147.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown
[16:36:23] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[16:36:27] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[16:36:33] <wikibugs>	 (03Merged) 10jenkins-bot: Banner: While saving, do exists() against primary [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133969 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy)
[16:36:41] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[16:36:52] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[16:37:03] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[16:37:07] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[16:37:10] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1133969|Banner: While saving, do exists() against primary (T390956)]]
[16:37:12] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[16:37:13] <stashbot>	 T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956
[16:40:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: more fine-grained shard status checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133967 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking)
[16:44:25] <logmsgbot>	 !log reedy@deploy1003 reedy: Backport for [[gerrit:1133969|Banner: While saving, do exists() against primary (T390956)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:44:28] <stashbot>	 T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956
[16:51:19] <jinxer-wm>	 FIRING: [2x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-e4-eqiad and cloudsw1-c8 (10.64.147.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown
[16:51:30] <logmsgbot>	 !log reedy@deploy1003 reedy: Continuing with sync
[16:51:48] <wikibugs>	 (03PS2) 10Esanders: Hide "Insert graph" tool in VE when graphs are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123620 (https://phabricator.wikimedia.org/T387501)
[16:52:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123620 (https://phabricator.wikimedia.org/T387501) (owner: 10Esanders)
[16:54:45] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[16:54:54] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[16:55:25] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] hiera: acme_chief: fix ordering of DC [puppet] - 10https://gerrit.wikimedia.org/r/1133953 (owner: 10Ssingh)
[16:55:44] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1133909 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[16:56:18] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[16:57:03] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[16:58:42] <wikibugs>	 (03PS1) 10Esanders: Enable DiscussionTools visual enhancements on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133972 (https://phabricator.wikimedia.org/T379264)
[16:58:44] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133969|Banner: While saving, do exists() against primary (T390956)]] (duration: 21m 33s)
[16:58:46] <stashbot>	 T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956
[16:59:08] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133972 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders)
[16:59:50] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] utils: add a script to generate HTTPS TYPE65 records for ECH [dns] - 10https://gerrit.wikimedia.org/r/1133917 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[17:00:05] <jouncebot>	 bd808: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1700).
[17:00:05] <jouncebot>	 swfrench-wmf: MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1700). Please do the needful.
[17:00:16] <swfrench-wmf>	 o/
[17:00:17] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[17:00:46] <swfrench-wmf>	 Reedy: I see you had some backports going just recently. are you done for now?
[17:00:57] <Reedy>	 swfrench-wmf: it's turtles
[17:01:06] <swfrench-wmf>	 lol
[17:01:08] <Reedy>	 I can take a break for a bit if you've stuff you need to do :)
[17:01:18] <Reedy>	 I've got another to go out after it's master merged and backported
[17:01:19] <jinxer-wm>	 FIRING: [5x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-e4-eqiad and cloudsw1-c8 (10.64.147.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown
[17:02:01] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply
[17:02:08] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply
[17:02:17] <swfrench-wmf>	 Reedy: got it, thanks! yeah, I'll try to get through my change now - should take about 25-30m based on prior experience ... as long as the registry doesn't explode, that is.
[17:02:38] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[17:02:45] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1133229 (owner: 10Scott French)
[17:02:48] <wikibugs>	 (03PS2) 10Esanders: Enable DiscussionTools visual enhancements on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133972 (https://phabricator.wikimedia.org/T379264)
[17:03:37] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-04-03-122108-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133973
[17:04:09] <wikibugs>	 (03CR) 10Scott French: [V:03+2] "Built and verified locally." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1133229 (owner: 10Scott French)
[17:04:11] <wikibugs>	 (03PS1) 10Ssingh: [DO NOT MERGE] set MX records for dyna [dns] - 10https://gerrit.wikimedia.org/r/1133974
[17:04:12] <wikibugs>	 (03CR) 10Scott French: [V:03+2 C:03+2] php8.1: Rebuild to update Debian packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1133229 (owner: 10Scott French)
[17:06:19] <jinxer-wm>	 RESOLVED: [4x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-e4-eqiad and cloudsw1-c8 (10.64.147.0) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown
[17:07:18] <wikibugs>	 (03PS1) 10Esanders: Enable DiscussionTools visual enhancements everywhere except enwiki & ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133975 (https://phabricator.wikimedia.org/T379264)
[17:07:45] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-04-03-122108-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133973 (owner: 10BryanDavis)
[17:08:37] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133976
[17:09:17] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-04-03-122108-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133973 (owner: 10BryanDavis)
[17:09:29] * swfrench-wmf offers words of encouragement to docker-registry
[17:10:04] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Deployment to pick up new PHP 8.1 production images
[17:10:49] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:11:06] <logmsgbot>	 !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:11:14] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:11:33] <logmsgbot>	 !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:11:42] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:12:01] <logmsgbot>	 !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:14:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10709474 (10phaultfinder)
[17:15:37] <wikibugs>	 (03PS1) 10Reedy: Banner: More reading from primary... [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133984 (https://phabricator.wikimedia.org/T390956)
[17:20:42] <wikibugs>	 (03CR) 10Reedy: [C:03+2] Banner: More reading from primary... [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133984 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy)
[17:22:17] <wikibugs>	 (03PS2) 10Gergő Tisza: End EmailAuth enforcement group 2 test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133937 (https://phabricator.wikimedia.org/T390662)
[17:23:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10709517 (10VRiley-WMF) Due to power restraints, we will need to relocte an-worker1181 to an-worker1186 in racks E8 and F8.
[17:23:24] <wikibugs>	 (03Merged) 10jenkins-bot: Banner: More reading from primary... [extensions/CentralNotice] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133984 (https://phabricator.wikimedia.org/T390956) (owner: 10Reedy)
[17:23:28] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133948 (owner: 10Jforrester)
[17:27:29] <wikibugs>	 (03CR) 10Ssingh: [C:04-2] "Here is why I believe this will not work for what we are trying to do with HIBP:" [dns] - 10https://gerrit.wikimedia.org/r/1133974 (owner: 10Ssingh)
[17:28:35] <wikibugs>	 (03PS2) 10Superpes15: Create wikipedia-pl-arbcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1133988 (https://phabricator.wikimedia.org/T391009)
[17:29:14] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] Create wikipedia-pl-arbcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1133988 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15)
[17:29:40] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1133988 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15)
[17:30:10] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Create wikipedia-pl-arbcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1133988 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15)
[17:30:20] <logmsgbot>	 !log dzahn@dns1004 START - running authdns-update
[17:32:37] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] CommonSettings-labs: Update BounceHandler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133156 (owner: 10Reedy)
[17:32:38] <dancy>	 swfrench-wmf: I'm definitely interesting in seeing the time impact of serializing the pushes.
[17:32:49] <dancy>	 *interested
[17:32:49] <logmsgbot>	 !log dzahn@dns1004 END - running authdns-update
[17:33:49] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings-labs: Update BounceHandler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133156 (owner: 10Reedy)
[17:34:07] <swfrench-wmf>	 dancy: alas, this one probably won't be super informative, as it only affects the 8.1 image "tree" (so there's really only one large blob to upload)
[17:34:21] <dancy>	 Gotcha
[17:35:56] <swfrench-wmf>	 that said, the fact that this deployment appears to be working, albeit slowly (expected for a full rebuild), isn't incompatible with the idea that it's large concurrent uploads that are the trigger ... so, yay?
[17:37:02] <Reedy>	 why would you want to do anything fast
[17:37:15] <dancy>	 Go fast and break registries
[17:37:24] <wikibugs>	 (03CR) 10Ssingh: [C:04-2] "^ Context on the above is that we are trying to add MX records for the subdomains, like enwiki, m editions, and all." [dns] - 10https://gerrit.wikimedia.org/r/1133974 (owner: 10Ssingh)
[17:37:38] <swfrench-wmf>	 lol
[17:37:48] <Reedy>	 "stop ddos-ing your own registries"
[17:38:31] <swfrench-wmf>	 to be fair, we are throwing around some wildly large images
[17:38:34] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Deployment to pick up new PHP 8.1 production images (duration: 28m 57s)
[17:38:39] <swfrench-wmf>	 \o/
[17:38:48] <swfrench-wmf>	 I'm frequently impressed that it works at all ...
[17:38:51] <swfrench-wmf>	 Reedy: all yours
[17:38:51] <Reedy>	 30 mins for a big image change... isn't bad
[17:38:53] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[56] - https://phabricator.wikimedia.org/T387142#10709610 (10Jclark-ctr)
[17:38:55] <Reedy>	 swfrench-wmf: you must be new here ;)
[17:38:58] <Reedy>	 ta
[17:39:34] <dancy>	 I object to characterizing a few gigabytes as wildly large.  It's just files.
[17:39:46] <logmsgbot>	 !log reedy@deploy1003 Started scap sync-world: Backport for [[gerrit:1133984|Banner: More reading from primary... (T390956)]], [[gerrit:1133156|CommonSettings-labs: Update BounceHandler config]]
[17:39:49] <stashbot>	 T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956
[17:40:03] <dancy>	 We should be able to do gigabyte files in the 2020s.
[17:40:06] <Reedy>	 Amir1: I'll just deploy that patch then, shall I
[17:40:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T390998#10709614 (10VRiley-WMF)
[17:40:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10709615 (10phaultfinder)
[17:40:48] <Amir1>	 I rebased the beta cluster one and so it doesn't need deployment 
[17:40:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T390998#10709617 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF This has been completed
[17:40:56] <Amir1>	 but the backport? I'd be grateful 
[17:41:36] <swfrench-wmf>	 dancy: true, to be fair(er) it's more the "read side" that surprises me (i.e., distributing GiB of image to hundreds of worker nodes fairly quickly) :)
[17:41:48] <Reedy>	 reading is easy, writing is hard
[17:41:51] <Reedy>	 or something
[17:43:32] <wikibugs>	 (03PS4) 10Ssingh: hiera: acme_chief: add wikimedia-ech.org [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378)
[17:43:47] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10709638 (10VRiley-WMF)
[17:43:59] <wikibugs>	 (03PS1) 10Dzahn: hiera: cleanup gitlab-runner docker gc settings [puppet] - 10https://gerrit.wikimedia.org/r/1133992 (https://phabricator.wikimedia.org/T390948)
[17:44:11] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5211/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[17:44:19] <wikibugs>	 (03PS1) 10Superpes15: Add arbcom_plwiki to private wikis on hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1133993 (https://phabricator.wikimedia.org/T391009)
[17:45:16] <wikibugs>	 (03CR) 10Dzahn: "before this change:" [puppet] - 10https://gerrit.wikimedia.org/r/1133992 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn)
[17:47:48] <logmsgbot>	 !log reedy@deploy1003 reedy: Backport for [[gerrit:1133984|Banner: More reading from primary... (T390956)]], [[gerrit:1133156|CommonSettings-labs: Update BounceHandler config]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:47:52] <stashbot>	 T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956
[17:48:12] <logmsgbot>	 !log reedy@deploy1003 reedy: Continuing with sync
[17:51:09] <Amir1>	 🎉
[17:56:11] <wikibugs>	 (03PS1) 10Superpes15: Apache config for arbcom_plwiki [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009)
[17:57:30] <logmsgbot>	 !log reedy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133984|Banner: More reading from primary... (T390956)]], [[gerrit:1133156|CommonSettings-labs: Update BounceHandler config]] (duration: 17m 43s)
[17:57:32] <stashbot>	 T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956
[18:00:05] <jouncebot>	 dancy and andre: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T1800).
[18:00:10] <dancy>	 o/
[18:02:54] <wikibugs>	 (03PS1) 10Dzahn: hiera: cleanup some gerrit and etherpad hiera values [puppet] - 10https://gerrit.wikimedia.org/r/1133996 (https://phabricator.wikimedia.org/T390948)
[18:03:00] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.149.0" for 2 host(s)
[18:03:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hiera: cleanup some gerrit and etherpad hiera values [puppet] - 10https://gerrit.wikimedia.org/r/1133996 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn)
[18:03:45] <Reedy>	 dancy: I think I'm clear now
[18:04:03] <Reedy>	 T390956 should've been tagged a train blocker, but it is fixed nwo
[18:04:04] <stashbot>	 T390956: Internal error when creating/cloning centralnotice banners (Wikimedia\Rdbms\DBUnexpectedError: Banner::save: Expected mass rollback of all peer transactions (DBO_TRX set)) - https://phabricator.wikimedia.org/T390956
[18:04:13] <dancy>	 Reedy: thx
[18:04:17] <Reedy>	 actually, let me do that for tracking purposes
[18:04:19] <Reedy>	 (and then close it)
[18:04:48] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.149.0" completed for 2 hosts
[18:05:10] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133997 (https://phabricator.wikimedia.org/T386218)
[18:05:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133997 (https://phabricator.wikimedia.org/T386218) (owner: 10TrainBranchBot)
[18:05:24] <wikibugs>	 (03CR) 10Dzahn: "There are 2 Change-Id footers here and I'm not sure which is the right one." [puppet] - 10https://gerrit.wikimedia.org/r/1133996 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn)
[18:06:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test one - bking@cumin2002 - T388610
[18:06:00] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133997 (https://phabricator.wikimedia.org/T386218) (owner: 10TrainBranchBot)
[18:06:02] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[18:06:22] <wikibugs>	 (03PS2) 10Dzahn: hiera: cleanup some gerrit and etherpad hiera values [puppet] - 10https://gerrit.wikimedia.org/r/1133996 (https://phabricator.wikimedia.org/T390948)
[18:08:49] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test one - bking@cumin2002 - T388610
[18:11:23] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] hiera: acme_chief: add wikimedia-ech.org [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:11:50] <wikibugs>	 (03PS7) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610)
[18:12:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test one - bking@cumin2002 - T388610
[18:12:13] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[18:16:25] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] P:durum: add conditional to enable ECH [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:19:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[18:20:02] <logmsgbot>	 !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.23  refs T386218
[18:20:05] <stashbot>	 T386218: 1.44.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T386218
[18:20:42] <wikibugs>	 (03PS1) 10Bking: cirrussearch: add second canary for OpenSearch migration [puppet] - 10https://gerrit.wikimedia.org/r/1133999 (https://phabricator.wikimedia.org/T388610)
[18:20:45] <wikibugs>	 (03PS1) 10Dzahn: lists: send email to meta admin when steward list members are synced [puppet] - 10https://gerrit.wikimedia.org/r/1134000 (https://phabricator.wikimedia.org/T351202)
[18:21:46] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: test one - bking@cumin2002 - T388610
[18:21:47] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: add second canary for OpenSearch migration [puppet] - 10https://gerrit.wikimedia.org/r/1133999 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[18:21:48] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[18:22:51] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: add second canary for OpenSearch migration [puppet] - 10https://gerrit.wikimedia.org/r/1133999 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[18:24:48] <wikibugs>	 (03PS8) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610)
[18:25:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:25:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10709910 (10phaultfinder)
[18:29:03] <wikibugs>	 (03PS9) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610)
[18:30:15] <wikibugs>	 (03PS1) 10Dzahn: mailman3: fix quoting in mail_cmd for sync_list_members [puppet] - 10https://gerrit.wikimedia.org/r/1134001 (https://phabricator.wikimedia.org/T351202)
[18:31:12] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] mailman3: fix quoting in mail_cmd for sync_list_members [puppet] - 10https://gerrit.wikimedia.org/r/1134001 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[18:35:07] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1134000/5213/lists1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1134000 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[18:35:16] <wikibugs>	 (03CR) 10Jforrester: "It looks like this might have broken the back-end of Wikifunctions: T391022 (though the reported timing of issues doesn't quite line up." [puppet] - 10https://gerrit.wikimedia.org/r/1133932 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[18:35:24] <wikibugs>	 (03CR) 10CI reject: [V:04-1] elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[18:37:57] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10709960 (10Neobeta61) can you try updating storcli to 007.3305.0000.0000 please  DCSG01809266 (Port Of Defect DCSG01804765) Differing responses for set personality with diff...
[18:45:02] <wikibugs>	 (03PS10) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610)
[18:45:49] <wikibugs>	 (03PS11) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610)
[18:46:04] <wikibugs>	 (03PS12) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610)
[18:53:22] <wikibugs>	 (03PS1) 10Dzahn: mailman3: remove superfluous double quotes in sync_list_members [puppet] - 10https://gerrit.wikimedia.org/r/1134013 (https://phabricator.wikimedia.org/T351202)
[18:53:45] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] mailman3: remove superfluous double quotes in sync_list_members [puppet] - 10https://gerrit.wikimedia.org/r/1134013 (https://phabricator.wikimedia.org/T351202) (owner: 10Dzahn)
[18:55:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10710158 (10phaultfinder)
[18:58:47] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mw-wikifunctions: Add a missing SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134017 (https://phabricator.wikimedia.org/T384944)
[19:01:16] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: service: Cleanup of wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944)
[19:01:16] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mesh: Use sets_sni for mw-wikifuctions [puppet] - 10https://gerrit.wikimedia.org/r/1134020 (https://phabricator.wikimedia.org/T384944)
[19:02:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch* for ban cirrus nodes to prevent replication problems - bking@cumin2002 - T388610
[19:02:25] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch* for ban cirrus nodes to prevent replication problems - bking@cumin2002 - T388610
[19:02:25] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[19:02:55] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1132691 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[19:04:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] mw-wikifunctions: Add a missing SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134017 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[19:05:00] <wikibugs>	 (03CR) 10Jforrester: service: Cleanup of wikifunctions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[19:06:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch2055*,cirrussearch2056* for ban cirrus nodes to prevent replication problems - bking@cumin2002 - T388610
[19:06:03] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch2055*,cirrussearch2056* for ban cirrus nodes to prevent replication problems - bking@cumin2002 - T388610
[19:10:05] <wikibugs>	 (03Merged) 10jenkins-bot: mw-wikifunctions: Add a missing SAN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134017 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[19:11:43] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[19:12:00] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[19:12:12] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:12:37] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[19:13:06] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[19:13:17] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[19:13:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2056.codfw.wmnet with OS bullseye
[19:13:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2056
[19:13:36] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[19:13:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:13:48] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[19:13:53] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[19:14:37] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply
[19:14:45] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply
[19:14:51] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[19:15:09] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[19:16:03] <wikibugs>	 (03PS1) 10Dzahn: Revert "lists: send email to meta admin when steward list members are synced" [puppet] - 10https://gerrit.wikimedia.org/r/1134028
[19:17:20] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply
[19:17:23] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply
[19:19:11] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2056 - bking@cumin2002"
[19:19:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2056 - bking@cumin2002"
[19:19:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:19:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2056.codfw.wmnet 181.0.192.10.in-addr.arpa 1.8.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[19:19:21] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2056.codfw.wmnet 181.0.192.10.in-addr.arpa 1.8.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[19:19:22] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2056
[19:19:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] mesh: Use sets_sni for mw-wikifuctions [puppet] - 10https://gerrit.wikimedia.org/r/1134020 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[19:19:58] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: mesh: Use sets_sni for mw-wikifuctions [puppet] - 10https://gerrit.wikimedia.org/r/1134020 (https://phabricator.wikimedia.org/T384944)
[19:20:10] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "lists: send email to meta admin when steward list members are synced" [puppet] - 10https://gerrit.wikimedia.org/r/1134028 (owner: 10Dzahn)
[19:20:24] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2056
[19:20:24] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2056
[19:20:51] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: service: Cleanup of mw-wikifunctions old LVS leftovers [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944)
[19:20:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: service: Cleanup of mw-wikifunctions old LVS leftovers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[19:24:43] <wikibugs>	 (03Abandoned) 10Jforrester: [tests] Ensure each config has at most one value per wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947365 (owner: 10Urbanecm)
[19:29:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10710308 (10phaultfinder)
[19:30:56] <wikibugs>	 06SRE: NDA request coverage for KFrancis's PTO - https://phabricator.wikimedia.org/T391032#10710310 (10taavi)
[19:32:01] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[19:32:54] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[19:33:04] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[19:33:16] <wikibugs>	 06SRE: NDA request coverage for KFrancis's PTO - https://phabricator.wikimedia.org/T391032#10710312 (10Dzahn) a:05Dzahn→03None fyi: @ayounsi (week of April 7th), @jijiki (week of April 12th)
[19:33:18] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[19:34:04] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[19:34:50] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[19:35:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10710320 (10phaultfinder)
[19:36:19] <jinxer-wm>	 FIRING: [3x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-c8-eqiad and cloudsw1-e4 (10.64.146.254) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown
[19:39:47] <James_F>	 akosiaris: Now I get `Unexpected token '<', \"<!DOCTYPE \"... is not valid JSON.`.
[19:40:09] <James_F>	 So HTML rather an 'upstream connect failed' message.
[19:40:20] <akosiaris>	 one could call this an improvement!
[19:40:43] <James_F>	 Depends which HTML it is. :-) E.g. is that coming from an MW instance but the wrong one, or a non-MW.
[19:41:19] <jinxer-wm>	 RESOLVED: [3x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-c8-eqiad and cloudsw1-e4 (10.64.146.254) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown
[19:42:12] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:44:07] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] wikimedia-ech: add ncredir-parking [dns] - 10https://gerrit.wikimedia.org/r/1122155 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[19:45:08] <akosiaris>	 James_F: it's the unconfigured domain html
[19:45:12] <akosiaris>	 with a 404
[19:45:14] <James_F>	 Aha.
[19:45:25] <akosiaris>	 heh, mediawiki is actually sending a proper http header here
[19:45:26] <James_F>	 So… are we not passing the header correctly?
[19:46:01] <James_F>	 Or is it getting munged somehow, I guess.
[19:46:04] <akosiaris>	 can't be. You are passing it previously
[19:46:19] <James_F>	 Yeah, I meant "we" including ingress or whatever.
[19:56:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10710364 (10VRiley-WMF) Hey @Vgutierrez we have recieved the NIC. Is there a specific time for us to install it?
[19:57:18] <wikibugs>	 (03PS1) 10Bking: cirrussearch: add puppet 7 hieradata to DC-specific config [puppet] - 10https://gerrit.wikimedia.org/r/1134043 (https://phabricator.wikimedia.org/T388610)
[19:57:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10710373 (10VRiley-WMF) a:03VRiley-WMF
[19:58:18] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134043 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T2000). nyaa~
[20:00:05] <jouncebot>	 tgr, edsanders, and James_F: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:07] <James_F>	 akosiaris: Any thoughts for the next step?
[20:00:27] <akosiaris>	 James_F: I am looking at logstash logs
[20:00:36] <wikibugs>	 (03PS3) 10HMonroy: Enable Codex and Multiblocks in German and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121)
[20:01:10] <wikibugs>	 (03PS4) 10HMonroy: Enable Codex and Multiblocks in German and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121)
[20:01:28] <wikibugs>	 (03CR) 10HMonroy: Enable Codex and Multiblocks in German and Italian wiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy)
[20:01:43] <James_F>	 I can deploy the backport window things whilst I'm here.
[20:01:47] <James_F>	 edsanders: You OK to go?
[20:02:53] <James_F>	 tgr_: OK for me to push out the EmailAuth group2 disablement?
[20:05:41] <akosiaris>	 ok, got it:  https://lounge.uname.gr/uploads/d87fdc1baa696f18/image.png 
[20:05:50] <James_F>	 Well, I'll do mine alone to be getting on with.
[20:05:51] <akosiaris>	 envoy is apparently overriding the domain
[20:05:54] <tgr_>	 James_F: appreciated, thanks! 
[20:05:57] <James_F>	 tgr_: Cool.
[20:06:02] <wikibugs>	 (03CR) 10Superpes15: [C:03+1] Enable Codex and Multiblocks in German and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy)
[20:06:02] <tgr_>	 I ended up in an unexpected meeting
[20:06:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133937 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza)
[20:06:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133948 (owner: 10Jforrester)
[20:06:35] <James_F>	 akosiaris: Is that 'easily' fixed?
[20:06:55] <akosiaris>	 looking
[20:07:30] <wikibugs>	 (03CR) 10Bking: [C:03+2] "self-merging so we can finish our reimage." [puppet] - 10https://gerrit.wikimedia.org/r/1134043 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:09:20] <James_F>	 Eurgh, CI is backed up so much it's pending waiting for config patches.
[20:09:36] <James_F>	 We have  test-prio but not gate-and-submit-prio, because this is not meant to happen.
[20:09:51] <edsanders>	 James_F: yes
[20:10:00] <James_F>	 edsanders: Excellent, will do you second.
[20:10:19] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: service: Cleanup of wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944)
[20:10:19] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mesh: Use http_host as well for mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1134050 (https://phabricator.wikimedia.org/T384944)
[20:10:30] <akosiaris>	 James_F: ok point taken, I 'll self merge https://gerrit.wikimedia.org/r/1134050
[20:10:51] <Reedy>	 jouncebot: nowandnext
[20:10:51] <jouncebot>	 For the next 0 hour(s) and 49 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T2000)
[20:10:51] <jouncebot>	 In 0 hour(s) and 49 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T2100)
[20:10:58] <James_F>	 akosiaris: :-( Won't that break requests to wikidata.org?
[20:11:04] <James_F>	 Reedy: Patience, padawan.
[20:11:10] <Reedy>	 pfft
[20:11:22] <Reedy>	 I won't review your patch then
[20:11:27] <wikibugs>	 (03PS4) 10Reedy: search-redirect: Handle $_GET potential vulnerability scanning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester)
[20:11:29] <James_F>	 Sorry.
[20:11:31] <wikibugs>	 (03PS5) 10Reedy: search-redirect: Handle $_GET potential vulnerability scanning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester)
[20:11:37] <akosiaris>	 oh, you got that too, talking via the same endpoint to wikidata.org
[20:11:40] <wikibugs>	 (03CR) 10Reedy: [C:03+1] "GTG in a backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester)
[20:11:46] <James_F>	 akosiaris: Yes.
[20:11:48] <akosiaris>	 damn
[20:12:13] <wikibugs>	 (03Merged) 10jenkins-bot: End EmailAuth enforcement group 2 test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133937 (https://phabricator.wikimedia.org/T390662) (owner: 10Gergő Tisza)
[20:12:17] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133948 (owner: 10Jforrester)
[20:12:21] <James_F>	 Finally.
[20:12:24] <kostajh>	 tgr_: can you wait on syncing the EmailAuth one please 
[20:12:35] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1133937|End EmailAuth enforcement group 2 test (T390662)]], [[gerrit:1133948|wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase]]
[20:12:35] <kostajh>	 cc Reedy 
[20:12:37] <stashbot>	 T390662: EmailAuth: Enable "enforce" mode for logins from unknown IP/device when IP is known to IPoid - https://phabricator.wikimedia.org/T390662
[20:12:38] <James_F>	 kostajh: Wait as in I should stop?
[20:12:49] <James_F>	 kostajh: Or wait as in pause once it hits debug?
[20:13:05] <kostajh>	 James_F: wait as in, some people are talking about not deploying this. I'm chatting with them now. Sorry. 
[20:13:08] <logmsgbot>	 !log jforrester@deploy1003 sync-world aborted: Backport for [[gerrit:1133937|End EmailAuth enforcement group 2 test (T390662)]], [[gerrit:1133948|wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase]] (duration: 00m 33s)
[20:13:12] <James_F>	 Ack, aborting.
[20:13:32] <James_F>	 Should we hold the whole window, or should I revert the disablement and proceed with the rest?
[20:13:59] <wikibugs>	 (03CR) 10Jforrester: [C:04-1] "This will work for calls to wikifunctions.org but not for those to wikidata.org." [puppet] - 10https://gerrit.wikimedia.org/r/1134050 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris)
[20:14:30] <kostajh>	 James_F: you can proceed with the rest. Just don't sync https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1133937, please. 
[20:15:19] <wikibugs>	 (03PS1) 10Jforrester: Revert "End EmailAuth enforcement group 2 test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134051
[20:15:23] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Revert "End EmailAuth enforcement group 2 test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134051 (owner: 10Jforrester)
[20:15:32] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] "thanks, and sorry for the confusion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134051 (owner: 10Jforrester)
[20:15:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10710462 (10phaultfinder)
[20:16:03] <James_F>	 kostajh: No worries!
[20:16:34] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "End EmailAuth enforcement group 2 test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134051 (owner: 10Jforrester)
[20:17:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123620 (https://phabricator.wikimedia.org/T387501) (owner: 10Esanders)
[20:17:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133972 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders)
[20:17:56] <wikibugs>	 (03Merged) 10jenkins-bot: Hide "Insert graph" tool in VE when graphs are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123620 (https://phabricator.wikimedia.org/T387501) (owner: 10Esanders)
[20:17:59] <wikibugs>	 (03Merged) 10jenkins-bot: Enable DiscussionTools visual enhancements on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133972 (https://phabricator.wikimedia.org/T379264) (owner: 10Esanders)
[20:18:02] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review: Create provisioning and post-provisioning checks for Traffic hosts to confirm validity of varying hardware configurations - https://phabricator.wikimedia.org/T378724#10710479 (10CDobbins) On 4/2, we discussed the merits and pitfalls of the proposed implementation with @V...
[20:18:15] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1133948|wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase]], [[gerrit:1123620|Hide "Insert graph" tool in VE when graphs are disabled (T387501)]], [[gerrit:1133972|Enable DiscussionTools visual enhancements on zhwiki (T379264)]], [[gerrit:1134051|Revert "End EmailAuth enforcement group 2 test"]]
[20:18:19] <stashbot>	 T387501: Remove "Insert Graph" from VE for now - https://phabricator.wikimedia.org/T387501
[20:18:19] <stashbot>	 T379264: Offer Usability Improvements as default-on feature at English Wikipedia and remaining wikis - https://phabricator.wikimedia.org/T379264
[20:18:45] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester)
[20:18:54] <wikibugs>	 (03PS1) 10Esanders: Mobile insert menu: Exclude media and signature tools [extensions/VisualEditor] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134053 (https://phabricator.wikimedia.org/T385851)
[20:19:05] <James_F>	 edsanders: Do you need that backport happening too?
[20:20:15] <edsanders>	 Yeah
[20:22:51] <edsanders>	 along with a config change...
[20:22:52] <edsanders>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1128374
[20:22:52] <James_F>	 Ack. Please add to the window once it's ready. I'll do these 5(!) together first though.
[20:22:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/VisualEditor] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134053 (https://phabricator.wikimedia.org/T385851) (owner: 10Esanders)
[20:22:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388604) (owner: 10Esanders)
[20:22:52] <sukhe>	 !log reprepro -C main include bullseye-wikimedia trafficserver_9.2.10-1wm1_amd64.changes: T379797
[20:22:52] <sukhe>	 SAL is down hmm
[20:41:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:59] <stashbot>	 T379797: Package and deploy ATS 9.2.6 - https://phabricator.wikimedia.org/T379797
[20:41:59] <logmsgbot>	 !log jforrester@deploy1003 esanders, jforrester: Backport for [[gerrit:1133948|wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase]], [[gerrit:1123620|Hide "Insert graph" tool in VE when graphs are disabled (T387501)]], [[gerrit:1133972|Enable DiscussionTools visual enhancements on zhwiki (T379264)]], [[gerrit:1134051|Revert "End EmailAuth enforcement group 2 test"]] synced to the testservers (https://wi
[20:41:59] <logmsgbot>	 kitech.wikimedia.org/wiki/Mwdebug)
[20:41:59] <James_F>	 edsanders: Please check on mw-debug if things look OK.
[20:41:59] <edsanders>	 James_F: looking
[20:41:59] <edsanders>	 zhwiki looks good (1/4)
[20:41:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10710501 (10phaultfinder)
[20:41:59] <James_F>	 edsanders: How's the rest of the checking going? Can I help?
[20:41:59] <akosiaris>	 James_F: hotpatched 
[20:41:59] <akosiaris>	 it now works
[20:41:59] <edsanders>	 I'm not seeing the mobile insert menu...
[20:41:59] <James_F>	 edsanders: That's not in this deploy? It's waiting on finishing this one first.
[20:41:59] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:41:59] <akosiaris>	 but ... please don't deploy wikifunctions until I 've submitted the gerrit change to fix this
[20:41:59] <James_F>	 akosiaris: The WF deploy is just a config change, unrelated.
[20:41:59] <edsanders>	 James_F: ah good - so this is just the zhwiki and the graph patch?
[20:41:59] <James_F>	 akosiaris: Nice!
[20:41:59] <James_F>	 edsanders: Yeah. OK to go.
[20:41:59] <James_F>	 edsanders: ?
[20:41:59] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[20:41:59] <edsanders>	 James_F: yep - graph tool is hidden
[20:41:59] <James_F>	 akosiaris: Or does your hotpatch touch the MW-land code?
[20:41:59] <akosiaris>	 James_F: nope, not at all. it's only an envoy config change specifically at the orchestrator
[20:41:59] <James_F>	 akosiaris: OK, so can I proceed with the MW-config deploy?
[20:41:59] <akosiaris>	 yup, go ahead
[20:41:59] <James_F>	 Ack.
[20:41:59] <logmsgbot>	 !log jforrester@deploy1003 esanders, jforrester: Continuing with sync
[20:41:59] <akosiaris>	 I am exhausted, it's 13 hours I am in front of a computer, need a break
[20:41:59] <James_F>	 akosiaris: <3
[20:41:59] <James_F>	 akosiaris: Confirm we'll absolutely not be pushing anything to deployment-charts on a Thursday night / Friday. Get some reset.
[20:41:59] <James_F>	 Also rest. Freudian.
[20:41:59] <akosiaris>	 ❤️
[20:41:59] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[20:41:59] <Reedy>	 just to point out CI is unhappy currently due to cloud stuff
[20:41:59] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:41:59] <James_F>	 It never rains but it pours.
[20:41:59] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2056.codfw.wmnet with OS bullseye
[20:41:59] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133948|wikifunctionswiki: Disable 'mathml' mode for Maths, requires RESTbase]], [[gerrit:1123620|Hide "Insert graph" tool in VE when graphs are disabled (T387501)]], [[gerrit:1133972|Enable DiscussionTools visual enhancements on zhwiki (T379264)]], [[gerrit:1134051|Revert "End EmailAuth enforcement group 2 test"]] (duration: 21m 39s)
[20:41:59] <James_F>	 edsanders: OK, next set is your two mobile menu ones.
[20:41:59] <edsanders>	 James_F: looking
[20:41:59] <James_F>	 edsanders: No no, still merging.
[20:41:59] <edsanders>	 ok
[20:41:59] <Reedy>	 might be a while
[20:42:03] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[20:42:52] <James_F>	 Probably 12 mins.
[20:42:58] <brett>	 !log Upload Varnish 7.1.1-1.1~bpo11+wmf2 to bullseye-wikimedia T389605
[20:44:36] <James_F>	 scap is having connection issues to CI ("connection broken by 'RemoteDisconnected('Remote end closed connection without response')'"). Joy.
[20:45:57] <Reedy>	 Apparently things are recovering
[20:46:06] <edsanders>	 np
[20:46:55] <Reedy>	 signs of life on zuul
[20:53:21] * James_F twiddles more thumbs.
[20:54:20] <wikibugs>	 (03Merged) 10jenkins-bot: VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388604) (owner: 10Esanders)
[20:54:24] <James_F>	 Finally.
[20:54:29] <wikibugs>	 (03CR) 10Ebernhardson: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/838182 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[20:54:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10710628 (10phaultfinder)
[20:54:40] <James_F>	 That happened a while ago, but finally the WMCS network seems fixed.
[20:55:09] <wikibugs>	 (03PS1) 10Bking: cirrussearch: Add row/rack hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1134059 (https://phabricator.wikimedia.org/T388610)
[20:55:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cirrussearch: Add row/rack hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1134059 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:56:39] <wikibugs>	 (03Abandoned) 10Bking: WIP: more fine-grained shard status checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133967 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking)
[20:57:56] <James_F>	 Oy, both CI jobs are in the PostBuildScript stage and stuck.
[20:59:59] <wikibugs>	 (03Merged) 10jenkins-bot: Mobile insert menu: Exclude media and signature tools [extensions/VisualEditor] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134053 (https://phabricator.wikimedia.org/T385851) (owner: 10Esanders)
[21:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250403T2100)
[21:00:29] <James_F>	 And we're off.
[21:00:34] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1134053|Mobile insert menu: Exclude media and signature tools (T385851)]], [[gerrit:1128374|VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias (T388604)]]
[21:00:38] <James_F>	 Sorry, Web team.
[21:00:38] <wikibugs>	 (03PS2) 10Bking: cirrussearch: Add row/rack hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1134059 (https://phabricator.wikimedia.org/T388610)
[21:00:45] <stashbot>	 T385851: Introduce additional tools within the mobile visual editor's "+" menu - https://phabricator.wikimedia.org/T385851
[21:00:45] <stashbot>	 T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604
[21:02:03] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[21:04:28] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: Add row/rack hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1134059 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:04:44] <wikibugs>	 (03CR) 10Bking: [C:03+2] "self-merging in the interest of time" [puppet] - 10https://gerrit.wikimedia.org/r/1134059 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:05:39] <jinxer-wm>	 RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:05:39] <James_F>	 edsanders: Please test on mw-debug.
[21:06:20] <logmsgbot>	 !log jforrester@deploy1003 esanders, jforrester: Backport for [[gerrit:1134053|Mobile insert menu: Exclude media and signature tools (T385851)]], [[gerrit:1128374|VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias (T388604)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:06:24] <stashbot>	 T385851: Introduce additional tools within the mobile visual editor's "+" menu - https://phabricator.wikimedia.org/T385851
[21:06:24] <stashbot>	 T388604: [Config] Deploy "+" menu (and new tools) to Phase 1 wikis - https://phabricator.wikimedia.org/T388604
[21:06:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2056.codfw.wmnet with OS bullseye
[21:06:57] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2056
[21:06:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2056
[21:07:17] <edsanders>	 James_F: we need to hold those two patches :/
[21:07:29] <James_F>	 edsanders: Boo. Hold as in revert?
[21:07:41] <James_F>	 edsanders: Or just revert the config change?
[21:08:04] <wikibugs>	 (03PS1) 10Ebernhardson: Remove unused config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134064 (https://phabricator.wikimedia.org/T389429)
[21:09:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2056.codfw.wmnet with reason: host reimage
[21:09:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10710675 (10phaultfinder)
[21:12:25] <Kemayo>	 James_F: slack debate is occurring, just a minute
[21:12:29] <James_F>	 Kemayo: Ack.
[21:12:49] <James_F>	 Thankfully Web don't seem to have turned up to deploy.
[21:13:12] <Kemayo>	 I think their windows normally go unused in my recent memory, which is convenient.
[21:13:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2056.codfw.wmnet with reason: host reimage
[21:13:36] <James_F>	 Kemayo: Yes, but given that to my count at least 6 things have gone wrong today, I'll take the success.
[21:18:53] <edsanders>	 James_F: revert the config change
[21:19:04] <edsanders>	 James_F: both if it's easier (the other one is a no-op with the ocnfig)
[21:19:13] <James_F>	 edsanders: It's easier to only revert the config change.
[21:19:22] <logmsgbot>	 !log jforrester@deploy1003 Sync cancelled.
[21:19:45] <wikibugs>	 (03PS1) 10Jforrester: Revert "VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134067
[21:20:01] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134067 (owner: 10Jforrester)
[21:21:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134067 (owner: 10Jforrester)
[21:21:24] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1134067|Revert "VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias"]]
[21:24:12] <James_F>	 edsanders: I guess there's no need to test the no-op config state?
[21:24:40] <edsanders>	 James_F: hopefully not
[21:24:45] <James_F>	 Ack.
[21:25:04] <edsanders>	 I see no changes at the moment
[21:25:16] <James_F>	 Yeah, it's still syncing to mw-debug.
[21:27:34] <edsanders>	 James_F: Ah, yes I'm seeing the change
[21:27:47] <James_F>	 edsanders: It's currently on 10 of 12 servers.
[21:27:51] <James_F>	 So you'll get it sometimes.
[21:28:43] <James_F>	 Reedy: I'm not minded to sling out 1128050 at this point, sorry.
[21:28:54] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1134067|Revert "VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:29:05] <James_F>	 edsanders: All OK?
[21:29:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch* for ban cirrus nodes to prevent replication problems - bking@cumin2002 - T388610
[21:29:22] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch* for ban cirrus nodes to prevent replication problems - bking@cumin2002 - T388610
[21:29:23] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[21:29:35] <edsanders>	 James_F: looks good - ca.wiki is back to normal
[21:29:39] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[21:29:44] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[21:29:44] <jinxer-wm>	 Deployment function-orchestrator-main-orchestrator in wikifunctions at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=wikifunctions&var-deployment=function-orchestrator-main-orchestrator - ...
[21:29:44] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[21:30:08] <James_F>	 Oh dear.
[21:30:24] <James_F>	 That was a.kosiaris's hotpatch target.
[21:30:32] <Kemayo>	 😬
[21:31:11] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] prometheus: cleanup k8s instances from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133909 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[21:31:32] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] prometheus: cleanup k8s instances from prometheus100[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi)
[21:31:38] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:35:13] <wikibugs>	 (03PS1) 10Andrew Bogott: trove: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1134072
[21:36:17] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134072 (owner: 10Andrew Bogott)
[21:36:52] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134067|Revert "VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias"]] (duration: 15m 28s)
[21:37:16] <James_F>	 !log Backport deploy done.
[21:37:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:17] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134057 (owner: 10Jforrester)
[21:38:36] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester)
[21:38:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2056.codfw.wmnet with OS bullseye
[21:40:12] <wikibugs>	 (03PS2) 10Andrew Bogott: trove: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1134072
[21:40:14] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134072 (owner: 10Andrew Bogott)
[21:42:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] trove: pin 1.x version of sqlalchemy [puppet] - 10https://gerrit.wikimedia.org/r/1134072 (owner: 10Andrew Bogott)
[21:43:41] <wikibugs>	 (03PS1) 10Bking: cirrussearch: Add cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1134078 (https://phabricator.wikimedia.org/T388610)
[21:45:03] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: Add cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1134078 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:49:14] <James_F>	 I've filed T391047 for the KubernetesDeploymentUnavailableReplicas for us and silenced it (I hope).
[21:49:17] <stashbot>	 T391047: function-orchestrator-main-orchestrator pods down in codfw due to issue in envoy config(?) - https://phabricator.wikimedia.org/T391047
[21:52:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1185 - https://phabricator.wikimedia.org/T391049 (10ops-monitoring-bot) 03NEW
[22:17:58] <wikibugs>	 (03PS13) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610)
[22:19:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10710917 (10phaultfinder)
[22:25:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:29:21] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "trove: pin 1.x version of sqlalchemy" [puppet] - 10https://gerrit.wikimedia.org/r/1134087
[22:30:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Revert "trove: pin 1.x version of sqlalchemy" [puppet] - 10https://gerrit.wikimedia.org/r/1134087 (owner: 10Andrew Bogott)
[22:36:21] <wikibugs>	 06SRE, 06Data-Engineering, 06Traffic-Icebox, 10MobileFrontend (Tracking): RFC: Remove m-dot subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998#10710939 (10toni.stoev) >>! In T214998#10676078, @bd808 wrote: > @toni.stoev Please read https://www.medi...
[22:47:00] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10710968 (10Jdlrobson-WMF) @Ladsgroup let me know if and how I can help with this, but untagging web team.
[23:12:12] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:13:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1185 - https://phabricator.wikimedia.org/T391049#10711091 (10Ladsgroup) It's a random s5 replica. I don't think we depool hosts with degraded RAID so I leave it as is until dc-ops handle it. Hot swap should be enough.
[23:15:39] <jinxer-wm>	 FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2056-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[23:25:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy)
[23:29:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10711171 (10phaultfinder)
[23:29:49] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Codex and Multiblocks in German and Italian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133592 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy)
[23:30:03] <logmsgbot>	 !log tstarling@deploy1003 Started scap sync-world: Backport for [[gerrit:1133592|Enable Codex and Multiblocks in German and Italian wiki (T377121)]]
[23:30:06] <stashbot>	 T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121
[23:32:30] <jinxer-wm>	 FIRING: Traffic bill over quota: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota Has improved   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[23:35:37] <logmsgbot>	 !log tstarling@deploy1003 hmonroy, tstarling: Backport for [[gerrit:1133592|Enable Codex and Multiblocks in German and Italian wiki (T377121)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:35:40] <stashbot>	 T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121
[23:38:37] <logmsgbot>	 !log tstarling@deploy1003 hmonroy, tstarling: Continuing with sync
[23:40:22] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1134093
[23:40:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1134093 (owner: 10TrainBranchBot)
[23:42:12] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:45:28] <logmsgbot>	 !log tstarling@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133592|Enable Codex and Multiblocks in German and Italian wiki (T377121)]] (duration: 15m 25s)
[23:45:31] <stashbot>	 T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121
[23:52:30] <jinxer-wm>	 RESOLVED: Traffic bill over quota: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota Has improved   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[23:52:40] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1134093 (owner: 10TrainBranchBot)