[00:08:04] <wikibugs>	 (03PS2) 10Eevans: cassandra: setup 'dev' target for Cassandra 4.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/1121102 (https://phabricator.wikimedia.org/T385819)
[00:08:36] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121102 (https://phabricator.wikimedia.org/T385819) (owner: 10Eevans)
[00:10:44] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 632.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:31:06] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1120923 (owner: 10Filippo Giunchedi)
[00:38:40] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121115
[00:38:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121115 (owner: 10TrainBranchBot)
[00:48:38] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121115 (owner: 10TrainBranchBot)
[00:49:56] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:50:14] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:50:22] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:04:14] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:07:14] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:08:35] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121116
[01:08:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121116 (owner: 10TrainBranchBot)
[01:09:14] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:11:18] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:13:18] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:15:14] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:17:12] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:21:12] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:26:14] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:30:05] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121116 (owner: 10TrainBranchBot)
[01:32:14] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:38:14] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:40:18] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:40:54] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:41:18] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:43:54] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:44:14] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:44:56] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:45:14] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:45:22] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:46:28] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/416455e1d3b20ea1bf708c9423206d03c25f3f045cc4ad254c29b7c6955e1ea2/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:47:56] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:48:14] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:48:22] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:50:22] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:51:14] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:53:22] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:54:14] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:55:14] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:00:18] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:01:24] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:03:14] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:06:28] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:09:44] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:10:16] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:12:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:13:22] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:21:12] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:26:56] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:27:22] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:28:16] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:33:56] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:34:16] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:34:20] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:34:22] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:04:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10566388 (10phaultfinder)
[03:32:12] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:57:21] <jinxer-wm>	 FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[04:06:22] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2025-02-20-032928-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120709 (https://phabricator.wikimedia.org/T386677)
[04:08:38] <icinga-wm>	 PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[04:52:21] <jinxer-wm>	 RESOLVED: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn   - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn
[05:07:22] <kart_>	 Deploying cxserver..
[05:07:30] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-02-20-032928-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120709 (https://phabricator.wikimedia.org/T386677) (owner: 10KartikMistry)
[05:08:37] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2025-02-20-032928-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120709 (https://phabricator.wikimedia.org/T386677) (owner: 10KartikMistry)
[05:14:08] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:14:33] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:31:18] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[05:31:47] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[05:33:25] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[05:33:59] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[05:34:35] <kart_>	 !log Updated cxserver to 2025-02-20-032928-production (T386677, T386464)
[05:34:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:34:39] <stashbot>	 T386677: Automatic translation failed error when translating from de -> en using CX - https://phabricator.wikimedia.org/T386677
[05:34:40] <stashbot>	 T386464: Post-creation work for sylwiki - https://phabricator.wikimedia.org/T386464
[05:49:18] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 206582224 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:49:52] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1257/IPv6: Connect - Tele2, AS1257/IPv4: Connect - Tele2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:50:18] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 110760 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[06:08:22] <wikibugs>	 (03PS1) 10Stevemunene: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080)
[06:09:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene)
[06:12:58] <wikibugs>	 (03PS2) 10Stevemunene: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080)
[06:16:26] <icinga-wm>	 PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim
[06:19:32] <icinga-wm>	 RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim
[06:23:28] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:30:28] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:37:28] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:52:12] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T0700).
[07:05:19] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1036.eqiad.wmnet with reason: remove from cluster for reimage
[07:05:31] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10566767 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9ff89e50-cdd1-449a-a676-876c36729c2f) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[07:08:19] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] haproxykafka: limit memory usage to 5% of total physical memory [puppet] - 10https://gerrit.wikimedia.org/r/1120922 (https://phabricator.wikimedia.org/T386747) (owner: 10Fabfur)
[07:08:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1036 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1120934 (owner: 10Muehlenhoff)
[07:10:22] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] aptrepo,haproxy: Allow installing HAProxy 1.3 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1120926 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[07:12:28] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:18:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1036.eqiad.wmnet
[07:23:35] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639
[07:23:35] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225
[07:23:35] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547
[07:23:36] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548
[07:23:52] <icinga-wm>	 RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 69, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:26:04] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto)
[07:27:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1025.eqiad.wmnet to cluster eqiad and group A
[07:29:45] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1025.eqiad.wmnet to cluster eqiad and group A
[07:42:58] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904 (10Ben.buchenau) 03NEW
[07:48:51] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ganeti1036.eqiad.wmnet
[07:57:26] <wikibugs>	 (03PS3) 10Isabelle Hurbain-Palatin: Turn on Parsoid Read Views for 27 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra)
[07:58:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Turn on Parsoid Read Views for 27 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra)
[07:59:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1036.eqiad.wmnet with OS bookworm
[07:59:28] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10566804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1036.eqiad.wmnet with OS bookworm
[07:59:41] <wikibugs>	 (03PS4) 10Isabelle Hurbain-Palatin: Turn on Parsoid Read Views for 27 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra)
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:07:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra)
[08:17:48] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:18:34] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10566806 (10MoritzMuehlenhoff)
[08:20:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1036.eqiad.wmnet with reason: host reimage
[08:20:56] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti1036.eqiad.wmnet with reason: host reimage
[08:21:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet
[08:22:37] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10566807 (10ops-monitoring-bot) Draining ganeti1026.eqiad.wmnet of running VMs
[08:23:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet
[08:25:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet
[08:26:09] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10566810 (10ops-monitoring-bot) Draining ganeti1026.eqiad.wmnet of running VMs
[08:28:25] <moritzm>	 !log installing ruby2.7 security updates
[08:28:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:37] <wikibugs>	 (03PS1) 10Elukey: services: bump allocated memory for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121309 (https://phabricator.wikimedia.org/T386648)
[08:33:35] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxykafka: limit memory usage to 5% of total physical memory [puppet] - 10https://gerrit.wikimedia.org/r/1120922 (https://phabricator.wikimedia.org/T386747) (owner: 10Fabfur)
[08:34:48] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:36:29] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: bump allocated memory for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121309 (https://phabricator.wikimedia.org/T386648) (owner: 10Elukey)
[08:37:28] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[08:37:38] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[08:37:54] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync
[08:38:52] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync
[08:38:59] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync
[08:39:31] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync
[08:42:23] <wikibugs>	 (03CR) 10Jgiannelos: Turn on Parsoid Read Views for 27 wiktionaries (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra)
[08:42:32] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=wikikube-worker1002*.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl
[08:42:57] <vgutierrez>	 !log uploaded haproxy 3.1.3 to thirdparty/haproxy31 - T386796
[08:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:00] <stashbot>	 T386796: Evaluate HAProxy 3.1 - https://phabricator.wikimedia.org/T386796
[08:44:14] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker1002.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl
[08:46:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1036.eqiad.wmnet with OS bookworm
[08:46:40] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10566861 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1036.eqiad.wmnet with OS bookworm completed: - ganeti103...
[08:52:13] <wikibugs>	 (03PS1) 10Brouberol: opensearch:cirrus: add the opensearch- prefix to some plugins [puppet] - 10https://gerrit.wikimedia.org/r/1121312 (https://phabricator.wikimedia.org/T380752)
[08:52:41] <wikibugs>	 (03PS2) 10Brouberol: opensearch:cirrus: add the opensearch- prefix to some plugins [puppet] - 10https://gerrit.wikimedia.org/r/1121312 (https://phabricator.wikimedia.org/T380752)
[08:53:35] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4959/co" [puppet] - 10https://gerrit.wikimedia.org/r/1121312 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol)
[09:00:05] <jouncebot>	 dancy and andre: Your horoscope predicts another MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T0900).
[09:01:32] <wikibugs>	 (03PS1) 10Elukey: services: update cpu resources for kartotherian's mesh/statsd containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121315 (https://phabricator.wikimedia.org/T386648)
[09:04:48] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:07:41] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10566894 (10WMDE-leszek) I confirm @Ben.buchenau 's affiliation with WMDE, and approve the request on WMDE's end. While you're at it, mind adding Ben's account to the `wm...
[09:09:32] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: prometheus: node_kernel_messages: ensure /etc/prometheus exists [puppet] - 10https://gerrit.wikimedia.org/r/1121316 (https://phabricator.wikimedia.org/T386850)
[09:12:19] <wikibugs>	 (03PS1) 10DCausse: Fix typo in opensearch-analysis-stconvert [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1121317
[09:13:17] <wikibugs>	 (03CR) 10DCausse: [C:03+1] opensearch:cirrus: add the opensearch- prefix to some plugins [puppet] - 10https://gerrit.wikimedia.org/r/1121312 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol)
[09:14:08] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121316 (https://phabricator.wikimedia.org/T386850) (owner: 10Arturo Borrero Gonzalez)
[09:19:02] <wikibugs>	 (03CR) 10Majavah: "most of the prometheus-*-exporter packages do provision this dir, I would maybe depend on `Package['prometheus-node-exporter']` instead as" [puppet] - 10https://gerrit.wikimedia.org/r/1121316 (https://phabricator.wikimedia.org/T386850) (owner: 10Arturo Borrero Gonzalez)
[09:20:59] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "I don't think that would be deterministic enough :-(" [puppet] - 10https://gerrit.wikimedia.org/r/1121316 (https://phabricator.wikimedia.org/T386850) (owner: 10Arturo Borrero Gonzalez)
[09:23:48] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:24:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/1120923 (owner: 10Filippo Giunchedi)
[09:26:43] <wikibugs>	 (03PS1) 10Urbanecm: beta: Do not undeclare wmgGEActiveExperiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121318 (https://phabricator.wikimedia.org/T386846)
[09:27:38] <wikibugs>	 (03CR) 10Urbanecm: [V:03+1] "Expected new variables show in https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig/3571/console." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121318 (https://phabricator.wikimedia.org/T386846) (owner: 10Urbanecm)
[09:27:41] <wikibugs>	 (03CR) 10Brouberol: [V:03+1 C:03+2] opensearch:cirrus: add the opensearch- prefix to some plugins [puppet] - 10https://gerrit.wikimedia.org/r/1121312 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol)
[09:28:39] <wikibugs>	 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, and 2 others: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10566946 (10elukey) @BCornwall the easiest way is probably to use test-cookbook on a cumin host, using a depooled magru cp node as target. Once we are sure that the settin...
[09:29:45] <wikibugs>	 (03CR) 10Elukey: "Hi! Added a comment to the task. I'd prefer that we tested this via test-cookbook on a single magru cp node, to verify the settings applie" [cookbooks] - 10https://gerrit.wikimedia.org/r/1121086 (https://phabricator.wikimedia.org/T373993) (owner: 10BCornwall)
[09:29:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] o11y: promote thanos compact alerts to critical [alerts] - 10https://gerrit.wikimedia.org/r/1120923 (owner: 10Filippo Giunchedi)
[09:30:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Fix typo in opensearch-analysis-stconvert [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1121317 (owner: 10DCausse)
[09:33:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1036.eqiad.wmnet
[09:33:28] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] "Thinks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121318 (https://phabricator.wikimedia.org/T386846) (owner: 10Urbanecm)
[09:33:54] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] "*Thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121318 (https://phabricator.wikimedia.org/T386846) (owner: 10Urbanecm)
[09:34:54] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10566953 (10fgiunchedi) >>! In T384731#10565308, @cmooney wrote: >>>! In T384731#10563685, @ayounsi wrote: >> Is it...
[09:36:00] <urbanecm>	 jouncebot: nowandnext
[09:36:00] <jouncebot>	 For the next 1 hour(s) and 23 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T0900)
[09:36:00] <jouncebot>	 In 1 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1100)
[09:40:05] <wikibugs>	 (03PS4) 10Vgutierrez: hiera,swift: Enable IPIP on ms-fe@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564)
[09:40:05] <wikibugs>	 (03PS3) 10Vgutierrez: hiera: Enable IPIP on ms-fe@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564)
[09:40:29] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez)
[09:40:56] <wikibugs>	 (03PS4) 10Vgutierrez: hiera: Enable IPIP on ms-fe@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564)
[09:41:09] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez)
[09:43:16] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:43:24] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:43:54] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:44:46] <wikibugs>	 (03CR) 10MVernon: [C:03+1] hiera,swift: Enable IPIP on ms-fe@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez)
[09:45:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10566958 (10cmooney) Thanks for the update @fgiunchedi >  >! In T384731#10566953, @fgiunchedi wrote: >> And what ha...
[09:47:42] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera,swift: Enable IPIP on ms-fe@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez)
[09:48:22] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1036.eqiad.wmnet
[09:49:09] <wikibugs>	 (03PS1) 10Filippo Giunchedi: icinga: temp remove check for virt.cloudgw.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1121319
[09:50:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] icinga: temp remove check for virt.cloudgw.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1121319 (owner: 10Filippo Giunchedi)
[09:51:24] <vgutierrez>	 !log enabling IPIP encapsulation for swift-fe@codfw - T385564
[09:51:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:28] <stashbot>	 T385564: migrate swift/swift-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T385564
[09:55:40] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene)
[09:59:34] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.dns.netbox
[10:07:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti1026 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1121320
[10:08:20] <wikibugs>	 (03CR) 10Elukey: [C:03+2] "Given how easy this is I'll proceed, please tell me if anything doesn't look ok and I'll amend :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121315 (https://phabricator.wikimedia.org/T386648) (owner: 10Elukey)
[10:10:17] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[10:10:28] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[10:10:34] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync
[10:11:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Nice, thank you! since this is essentially the same as DiskSpace in team-sre/resources.yaml (modulo "runbook" link) what we could also do " [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene)
[10:11:11] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync
[10:11:42] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1121316 (https://phabricator.wikimedia.org/T386850) (owner: 10Arturo Borrero Gonzalez)
[10:11:43] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync
[10:12:15] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync
[10:13:20] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:13:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] prometheus: node_kernel_messages: ensure /etc/prometheus exists [puppet] - 10https://gerrit.wikimedia.org/r/1121316 (https://phabricator.wikimedia.org/T386850) (owner: 10Arturo Borrero Gonzalez)
[10:13:39] <wikibugs>	 (03CR) 10Urbanecm: [V:03+1 C:03+2] "beta only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121318 (https://phabricator.wikimedia.org/T386846) (owner: 10Urbanecm)
[10:14:23] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Do not undeclare wmgGEActiveExperiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121318 (https://phabricator.wikimedia.org/T386846) (owner: 10Urbanecm)
[10:14:51] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw updates - aborrero@cumin1002"
[10:14:56] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw updates - aborrero@cumin1002"
[10:14:57] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:16:06] <vgutierrez>	 !log restarting pybal on lvs2014 - T385564
[10:16:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:10] <stashbot>	 T385564: migrate swift/swift-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T385564
[10:16:40] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker1003.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl
[10:17:29] <wikibugs>	 (03PS2) 10Urbanecm: [Growth] enwiki: Release Add Link to 15% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120925 (https://phabricator.wikimedia.org/T386029)
[10:17:32] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [Growth] enwiki: Release Add Link to 15% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120925 (https://phabricator.wikimedia.org/T386029) (owner: 10Urbanecm)
[10:18:17] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] enwiki: Release Add Link to 15% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120925 (https://phabricator.wikimedia.org/T386029) (owner: 10Urbanecm)
[10:18:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet
[10:18:58] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1120925|[Growth] enwiki: Release Add Link to 15% of newcomers (T386029)]]
[10:19:02] <stashbot>	 T386029: Add a link (Structured task): Increase rollout on English Wikipedia to 15% - https://phabricator.wikimedia.org/T386029
[10:20:51] <wikibugs>	 (03CR) 10Isabelle Hurbain-Palatin: Turn on Parsoid Read Views for 27 wiktionaries (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra)
[10:22:25] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1120925|[Growth] enwiki: Release Add Link to 15% of newcomers (T386029)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[10:22:29] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[10:22:58] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.dns.wipe-cache virt.cloudgw.eqiad1.wikimediacloud.org on all recursors
[10:23:02] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) virt.cloudgw.eqiad1.wikimediacloud.org on all recursors
[10:23:24] <wikibugs>	 (03PS1) 10FNegri: prometheus::node_kernel_messages: ignore some false positives [puppet] - 10https://gerrit.wikimedia.org/r/1121321 (https://phabricator.wikimedia.org/T386850)
[10:24:11] <vgutierrez>	 !log restarting pybal on lvs2013, effectively enabling IPIP encapsulation for swift-fe@codfw - T385564
[10:24:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:15] <stashbot>	 T385564: migrate swift/swift-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T385564
[10:25:02] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:25:51] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "icinga: temp remove check for virt.cloudgw.eqiad1.wikimediacloud.org" [puppet] - 10https://gerrit.wikimedia.org/r/1121322
[10:26:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] Revert "icinga: temp remove check for virt.cloudgw.eqiad1.wikimediacloud.org" [puppet] - 10https://gerrit.wikimedia.org/r/1121322 (owner: 10Filippo Giunchedi)
[10:27:10] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow1002.eqiad.wmnet with reason: disabling gnmic in systemd
[10:28:02] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:32:41] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hiera: restore thanos retention settings [puppet] - 10https://gerrit.wikimedia.org/r/1121324 (https://phabricator.wikimedia.org/T357747)
[10:34:15] <logmsgbot>	 !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1026.eqiad.wmnet with reason: remove from cluster for reimage
[10:34:26] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10567066 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8efe0251-40ee-433b-a080-3bef582e4f79) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(...
[10:34:33] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1120925|[Growth] enwiki: Release Add Link to 15% of newcomers (T386029)]]
[10:34:37] <stashbot>	 T386029: Add a link (Structured task): Increase rollout on English Wikipedia to 15% - https://phabricator.wikimedia.org/T386029
[10:35:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 7.143% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:36:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext/next (k8s) 1.523s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:36:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1026 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1121320 (owner: 10Muehlenhoff)
[10:37:42] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1120925|[Growth] enwiki: Release Add Link to 15% of newcomers (T386029)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[10:37:47] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[10:38:01] <wikibugs>	 (03CR) 10MVernon: [C:03+1] hiera: Enable IPIP on ms-fe@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez)
[10:40:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:41:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext/next (k8s) 1.523s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:41:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1026.eqiad.wmnet
[10:43:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1036.eqiad.wmnet to cluster eqiad and group B
[10:44:11] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1036.eqiad.wmnet to cluster eqiad and group B
[10:44:24] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120925|[Growth] enwiki: Release Add Link to 15% of newcomers (T386029)]] (duration: 09m 50s)
[10:44:28] <stashbot>	 T386029: Add a link (Structured task): Increase rollout on English Wikipedia to 15% - https://phabricator.wikimedia.org/T386029
[10:46:03] <wikibugs>	 (03PS5) 10Vgutierrez: hiera: Enable IPIP on ms-fe@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564)
[10:48:37] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Enable IPIP on ms-fe@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez)
[10:54:40] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on netflow1002.eqiad.wmnet with reason: keeping gnmic running in debug mode to observe performance change
[10:57:42] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10567098 (10MoritzMuehlenhoff)
[10:58:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet
[10:59:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1121321 (https://phabricator.wikimedia.org/T386850) (owner: 10FNegri)
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1100)
[11:00:18] <wikibugs>	 (03PS1) 10Gergő Tisza: Restore "Add configuration options and global preference for the SUL3 rolllout" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121328 (https://phabricator.wikimedia.org/T386836)
[11:00:37] <wikibugs>	 (03PS1) 10Gergő Tisza: Restore "Add configuration options and global preference for the SUL3 rolllout" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121329 (https://phabricator.wikimedia.org/T386836)
[11:01:22] <wikibugs>	 (03PS1) 10Gergő Tisza: SharedDomainUtils: Avoid early instantiation of NamespaceInfo [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121330 (https://phabricator.wikimedia.org/T386836)
[11:02:09] <wikibugs>	 (03PS1) 10Gergő Tisza: SharedDomainUtils: Avoid early instantiation of NamespaceInfo [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121332 (https://phabricator.wikimedia.org/T386836)
[11:02:30] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=wikikube-worker1003.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl
[11:02:36] <wikibugs>	 (03PS1) 10Gergő Tisza: Make sure isSul3Enabled() is a boolean [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121333 (https://phabricator.wikimedia.org/T384549)
[11:03:09] <wikibugs>	 (03PS1) 10Gergő Tisza: Make sure isSul3Enabled() is a boolean [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121334 (https://phabricator.wikimedia.org/T384549)
[11:03:39] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121328 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza)
[11:03:46] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121329 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza)
[11:04:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121330 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza)
[11:04:14] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121332 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza)
[11:05:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121333 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza)
[11:05:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121334 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza)
[11:07:35] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:07:46] <wikibugs>	 (03PS1) 10Stevemunene: Create dse-k8s control panel partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/1121335 (https://phabricator.wikimedia.org/T386900)
[11:07:47] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:08:06] <vgutierrez>	 !log restarting pybal on lvs1020 - T385564
[11:08:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:10] <stashbot>	 T385564: migrate swift/swift-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T385564
[11:08:31] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:08:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet
[11:08:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1026.eqiad.wmnet
[11:09:59] <vgutierrez>	 !log restarting pybal on lvs1019, effectively enabling IPIP encapsulation for swift-fe@eqiad - T385564
[11:10:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:02] <Lucas_WMDE>	 tgr|away: for me backporting the MediaWikiServices change would be okay
[11:11:12] <Lucas_WMDE>	 assuming we can reach agreement to merge it on master
[11:14:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1124:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1124 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:15:18] <wikibugs>	 (03PS1) 10Vgutierrez: service: Switch swift and swift-https to maglev [puppet] - 10https://gerrit.wikimedia.org/r/1121336 (https://phabricator.wikimedia.org/T385564)
[11:15:47] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121336 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez)
[11:20:33] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121336 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez)
[11:25:19] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "I don't claim to understand the significance of moving to maglev from wrr, but this change looks to do what it says it does." [puppet] - 10https://gerrit.wikimedia.org/r/1121336 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez)
[11:27:04] <wikibugs>	 (03PS1) 10Sergio Gimeno: LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551)
[11:27:20] <wikibugs>	 (03PS1) 10Sergio Gimeno: LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121338 (https://phabricator.wikimedia.org/T369551)
[11:33:21] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:35:17] <wikibugs>	 (03CR) 10Hnowlan: "lgtm with a but - we currently override php.servergroup at helmfile level for every mw-* deployment. Will that break these behaviours?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto)
[11:35:19] <wikibugs>	 (03PS1) 10Vgutierrez: liberica: USE CAP_NET_RAW instead of CAP_NET_ADMIN for healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1121339
[11:36:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno)
[11:39:23] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] service: Switch swift and swift-https to maglev [puppet] - 10https://gerrit.wikimedia.org/r/1121336 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez)
[11:40:26] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] Turn on Parsoid Read Views for 27 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra)
[11:41:02] <vgutierrez>	 !log restarting pybal on lvs2014 - T385564
[11:41:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:06] <stashbot>	 T385564: migrate swift/swift-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T385564
[11:42:21] <jinxer-wm>	 FIRING: ProbeDown: Service shellbox-syntaxhighlight:4014 has failed probes (http_shellbox-syntaxhighlight_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-syntaxhighlight:4014 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:42:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:43:54] <vgutierrez>	 !log restarting pybal on lvs2013, effectively switching swift-fe@codfw to maglev - T385564
[11:43:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:33] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:46:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:46:49] <hnowlan>	 hrm, what just happened to shellbox-syntaxhighlight? 
[11:47:21] <jinxer-wm>	 RESOLVED: ProbeDown: Service shellbox-syntaxhighlight:4014 has failed probes (http_shellbox-syntaxhighlight_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-syntaxhighlight:4014 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:47:49] <vgutierrez>	 !log restarting pybal on lvs1020 - T385564
[11:47:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:53] <stashbot>	 T385564: migrate swift/swift-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T385564
[11:48:05] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:48:43] <vgutierrez>	 uh?
[11:49:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:49:32] <vgutierrez>	 I'm assuming that's bad timing with the pybal restart on lvs1020.. BGP session is back
[11:51:12] <vgutierrez>	 !log restarting pybal on lvs1019, effectively switching swift-fe@eqiad to maglev - T385564
[11:51:14] <wikibugs>	 (03PS1) 10Andrew Bogott: rename validatelabsfqdn.py to validatecloudvpsfqdn.py [puppet] - 10https://gerrit.wikimedia.org/r/1121342
[11:51:14] <wikibugs>	 (03PS1) 10Andrew Bogott: realm.pp: remove use of $labsproject [puppet] - 10https://gerrit.wikimedia.org/r/1121343
[11:51:14] <wikibugs>	 (03PS1) 10Andrew Bogott: validatecloudvpsfqdn.py: Support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121344 (https://phabricator.wikimedia.org/T379030)
[11:51:15] <wikibugs>	 (03PS1) 10Andrew Bogott: wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030)
[11:51:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:17] <wikibugs>	 (03PS1) 10Andrew Bogott: Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030)
[11:51:18] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs puppet-enc: use project id for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1121347
[11:52:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott)
[11:54:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott)
[11:54:55] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:54:59] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:00:05] <jouncebot>	 urbanecm, sergi0, and Cyndywikime: Time to do the Community Configuration migration deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1200).
[12:00:11] <icinga-wm>	 RECOVERY - Host ms-be2075 is UP: PING WARNING - Packet loss = 77%, RTA = 33.31 ms
[12:00:45] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 (owner: 10Giuseppe Lavagetto)
[12:00:46] <sergi0>	 Hi
[12:03:36] <wikibugs>	 (03PS1) 10FNegri: prometheus::node_kernel_messages: add new line to ignore list [puppet] - 10https://gerrit.wikimedia.org/r/1121348 (https://phabricator.wikimedia.org/T386850)
[12:04:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1124:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1124 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:04:41] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+2] LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno)
[12:04:51] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+2] LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121338 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno)
[12:06:35] <icinga-wm>	 PROBLEM - Host ms-be2075 is DOWN: PING CRITICAL - Packet loss = 100%
[12:09:19] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: use testwiki PCS without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1121350 (https://phabricator.wikimedia.org/T385719)
[12:14:30] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] trafficserver: use testwiki PCS without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1121350 (https://phabricator.wikimedia.org/T385719) (owner: 10Hnowlan)
[12:15:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno)
[12:20:48] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Máté Szabó - https://phabricator.wikimedia.org/T386918 (10mszabo) 03NEW
[12:21:04] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for Máté Szabó - https://phabricator.wikimedia.org/T386918#10567349 (10mszabo)
[12:21:40] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "Absolutely +1" [puppet] - 10https://gerrit.wikimedia.org/r/1121339 (owner: 10Vgutierrez)
[12:21:42] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+2] "..." [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno)
[12:23:17] <wikibugs>	 (03Merged) 10jenkins-bot: LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121338 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno)
[12:30:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno)
[12:34:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10567389 (10cmooney) I ran gnmic in debug mode on netflow1002 but nothing is jumping out at me as a problem, at least on a basic review of the logs.  One thing I do notice, and...
[12:39:20] <wikibugs>	 (03Abandoned) 10Sergio Gimeno: LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno)
[12:39:27] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for Máté Szabó - https://phabricator.wikimedia.org/T386918#10567395 (10kostajh) Approving as @mszabo's interim manager.
[12:40:52] <wikibugs>	 (03PS1) 10Sergio Gimeno: Revert "LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds." [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121358
[12:40:59] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+2] Revert "LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds." [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121358 (owner: 10Sergio Gimeno)
[12:45:34] <wikibugs>	 (03CR) 10Hnowlan: [C:04-1] "nit: I realise this isn't a real functional chart per se, but are there minimal fixtures that could go here?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto)
[12:45:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1121339 (owner: 10Vgutierrez)
[12:45:54] <wikibugs>	 (03CR) 10Volans: "As agreed in the call, I did a pass to the CRs as they are now." [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 (owner: 10Federico Ceratto)
[12:46:00] <wikibugs>	 (03CR) 10Volans: "As agreed in the call, I did a pass to the CRs as they are now." [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (owner: 10Federico Ceratto)
[12:48:02] <wikibugs>	 (03CR) 10Isabelle Hurbain-Palatin: Turn on Parsoid Read Views for 27 wiktionaries (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra)
[12:49:59] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds." [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121358 (owner: 10Sergio Gimeno)
[12:52:50] <logmsgbot>	 !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1121338|LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. (T369551)]], [[gerrit:1121358|Revert "LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds."]]
[12:52:54] <stashbot>	 T369551: Use a constant to mark minimum for getting started notification - https://phabricator.wikimedia.org/T369551
[12:53:58] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Dashboards in Superset for harroyo-wmf - https://phabricator.wikimedia.org/T386922 (10hector.arroyo) 03NEW
[12:54:44] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Dashboards in Superset for harroyo-wmf - https://phabricator.wikimedia.org/T386922#10567459 (10kostajh) Approving as @hector.arroyo's interim manager
[12:55:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:55:56] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1121338|LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. (T369551)]], [[gerrit:1121358|Revert "LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds."]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:56:14] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset for harroyo-wmf - https://phabricator.wikimedia.org/T386922#10567463 (10hector.arroyo)
[12:56:52] <wikibugs>	 (03CR) 10Sergio Gimeno: [C:03+2] LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. (031 comment) [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121338 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno)
[12:56:58] <logmsgbot>	 !log sgimeno@deploy2002 sgimeno: Continuing with sync
[13:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1300)
[13:01:27] <sergi0>	 I'm still deploying the changes from the CC window, not much left
[13:03:33] <logmsgbot>	 !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121338|LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. (T369551)]], [[gerrit:1121358|Revert "LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds."]] (duration: 10m 43s)
[13:03:37] <stashbot>	 T369551: Use a constant to mark minimum for getting started notification - https://phabricator.wikimedia.org/T369551
[13:03:51] <sergi0>	 Done
[13:05:28] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: increase replicas in ref quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121362
[13:08:02] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] ml-services: increase replicas in ref quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121362 (owner: 10Ilias Sarantopoulos)
[13:08:47] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: increase replicas in ref quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121362 (owner: 10Ilias Sarantopoulos)
[13:08:47] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10567530 (10cmooney) Also fwiw I grabbed the same stats for 24 hours from both prometheus servers, and compared the total stats.  In total there are 115 gaps in the data, 68 of...
[13:09:53] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: increase replicas in ref quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121362 (owner: 10Ilias Sarantopoulos)
[13:10:23] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[13:10:35] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[13:15:24] <wikibugs>	 (03CR) 10Stevemunene: "I think, we should go as is incase we need to adjust the min values later on" [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene)
[13:20:28] <wikibugs>	 (03CR) 10Ladsgroup: Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121098 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson)
[13:20:46] <wikibugs>	 (03PS1) 10Elukey: services: double the capacity for Kartotherian in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121363 (https://phabricator.wikimedia.org/T386926)
[13:23:14] <wikibugs>	 (03PS2) 10Ladsgroup: Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121098 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson)
[13:23:38] <wikibugs>	 (03CR) 10Ladsgroup: Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121098 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson)
[13:23:49] <Amir1>	 jouncebot: nowandnext
[13:23:49] <jouncebot>	 For the next 0 hour(s) and 36 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1300)
[13:23:49] <jouncebot>	 In 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1400)
[13:24:00] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121098 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson)
[13:24:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121098 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson)
[13:24:45] <wikibugs>	 (03Merged) 10jenkins-bot: Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121098 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson)
[13:25:12] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1121098|Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" (T384619)]]
[13:25:16] <stashbot>	 T384619: Update skins to support different logos at different resolutions - https://phabricator.wikimedia.org/T384619
[13:28:11] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup, jdlrobson: Backport for [[gerrit:1121098|Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" (T384619)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:28:28] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Bump versions of Java 11/17 production images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120544 (owner: 10Muehlenhoff)
[13:30:32] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup, jdlrobson: Continuing with sync
[13:33:08] <wikibugs>	 (03PS2) 10Andrew Bogott: wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030)
[13:33:08] <wikibugs>	 (03PS2) 10Andrew Bogott: Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030)
[13:33:08] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs puppet-enc: use project id for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1121347
[13:33:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott)
[13:36:10] <wikibugs>	 (03PS1) 10David Caro: toolforge: add jobs-emailer stats gathering [puppet] - 10https://gerrit.wikimedia.org/r/1121364 (https://phabricator.wikimedia.org/T320284)
[13:37:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Create dse-k8s control panel partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/1121335 (https://phabricator.wikimedia.org/T386900) (owner: 10Stevemunene)
[13:37:07] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121098|Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" (T384619)]] (duration: 11m 54s)
[13:37:10] <stashbot>	 T384619: Update skins to support different logos at different resolutions - https://phabricator.wikimedia.org/T384619
[13:40:46] <wikibugs>	 (03PS3) 10Andrew Bogott: wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030)
[13:40:46] <wikibugs>	 (03PS3) 10Andrew Bogott: Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030)
[13:40:46] <wikibugs>	 (03PS3) 10Andrew Bogott: wmcs puppet-enc: use project id for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1121347
[13:44:26] <wikibugs>	 (03CR) 10David Caro: [V:03+1] "Manually tested in tools:" [puppet] - 10https://gerrit.wikimedia.org/r/1121364 (https://phabricator.wikimedia.org/T320284) (owner: 10David Caro)
[13:50:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:51:33] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] services: double the capacity for Kartotherian in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121363 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey)
[13:55:27] <wikibugs>	 (03PS1) 10Jforrester: Re-update function-schemata sub-module to HEAD (39b22ad) [extensions/WikiLambda] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121366
[13:55:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:56:27] <wikibugs>	 (03PS1) 10Brouberol: airflow-research: allow task pods to reach out to gitlab.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121367 (https://phabricator.wikimedia.org/T386933)
[13:57:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1161:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1161 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:59:21] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: double the capacity for Kartotherian in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121363 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey)
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1400).
[14:00:05] <jouncebot>	 Daimona, ihurbain, and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:16] <ihurbain>	 indeed!
[14:00:23] <Daimona>	 o/
[14:00:24] * TheresNoTime is not able to deploy today!
[14:00:27] <tgr|away>	 o/
[14:00:29] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync
[14:00:38] <Lucas_WMDE>	 I’m in a meeting, so I probably can’t deploy
[14:00:52] <ihurbain>	 so IDEALLY i'd like to deploy my own, BUT if i do that it would be my very first own deploy, so i'd need someone to hold my hand and tell me to breathe :D
[14:01:04] <ihurbain>	 (i *think* i have the proper rights for it, and i have read doc this morning)
[14:01:04] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync
[14:01:19] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync
[14:01:47] <Lucas_WMDE>	 it looks like you’re in the deployment group, yes ^^
[14:01:49] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync
[14:02:11] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] trafficserver: use testwiki PCS without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1121350 (https://phabricator.wikimedia.org/T385719) (owner: 10Hnowlan)
[14:02:28] <tgr|away>	 ihurbain: scap backport is very easy to use, but we can help in case of any trouble
[14:02:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1161:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1161 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:02:46] <ihurbain>	 do i feel bold enough.
[14:03:12] <apergos>	 well Daimona is before you in the list so you could build up some courage first :-D
[14:03:37] <wikibugs>	 (03CR) 10Ssingh: "For posterity, we made a typo in the commit message: it is 3.1 that we imported and not 1.3." [puppet] - 10https://gerrit.wikimedia.org/r/1120926 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez)
[14:03:53] <tgr|away>	 or you can deploy that patch as well
[14:04:00] <apergos>	 good point!
[14:04:12] <tgr|away>	 and by the time you get to your own patch, you'll already be an experienced deployer
[14:04:35] <ihurbain>	 okay. folks, hold my hand, i'm trying to do the deployment window.
[14:04:42] <apergos>	 sweet!
[14:04:45] <ihurbain>	 (aaaa!)
[14:05:07] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:05:07] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:05:16] <ihurbain>	 so: I can do the deploys today! (following documentation.)
[14:05:39] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:07:08] <ihurbain>	 Daimona: starting with yours.
[14:07:25] <Daimona>	 Yay! Good luck!
[14:07:31] <Lucas_WMDE>	 \o/
[14:07:36] * TheresNoTime is half-around & watching, please ping if there's anything!
[14:08:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121080 (https://phabricator.wikimedia.org/T383800) (owner: 10Daimona Eaytoy)
[14:09:02] <wikibugs>	 (03Merged) 10jenkins-bot: Enable $wgCampaignEventsEnableEventInvitation on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121080 (https://phabricator.wikimedia.org/T383800) (owner: 10Daimona Eaytoy)
[14:09:33] <logmsgbot>	 !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1121080|Enable $wgCampaignEventsEnableEventInvitation on most wikis (T383800)]]
[14:09:36] <stashbot>	 T383800: Enable invitation lists by default (except Meta, ZH Wikipedia, and ES Wikipedia) - https://phabricator.wikimedia.org/T383800
[14:09:43] <wikibugs>	 (03CR) 10Fabian Kaelin: [C:03+1] airflow-research: allow task pods to reach out to gitlab.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121367 (https://phabricator.wikimedia.org/T386933) (owner: 10Brouberol)
[14:12:07] <wikibugs>	 (03CR) 10Ssingh: "Sorry for not following up on this -- I missed this in the review stack." [cookbooks] - 10https://gerrit.wikimedia.org/r/1121086 (https://phabricator.wikimedia.org/T373993) (owner: 10BCornwall)
[14:12:32] <logmsgbot>	 !log ihurbain@deploy2002 daimona, ihurbain: Backport for [[gerrit:1121080|Enable $wgCampaignEventsEnableEventInvitation on most wikis (T383800)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:12:46] <ihurbain>	 Daimona: if you have stuff to check on mwdebug, now's the time
[14:12:59] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] liberica: USE CAP_NET_RAW instead of CAP_NET_ADMIN for healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1121339 (owner: 10Vgutierrez)
[14:13:11] <Daimona>	 Yup, doing
[14:14:39] <Daimona>	 Looking good
[14:14:52] <ihurbain>	 then let's gooo
[14:14:57] <logmsgbot>	 !log ihurbain@deploy2002 daimona, ihurbain: Continuing with sync
[14:15:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikiLambda] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121366 (owner: 10Jforrester)
[14:19:39] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:20:07] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:20:07] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:20:22] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-research: allow task pods to reach out to gitlab.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121367 (https://phabricator.wikimedia.org/T386933) (owner: 10Brouberol)
[14:21:15] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply
[14:21:35] <logmsgbot>	 !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121080|Enable $wgCampaignEventsEnableEventInvitation on most wikis (T383800)]] (duration: 12m 02s)
[14:21:38] <stashbot>	 T383800: Enable invitation lists by default (except Meta, ZH Wikipedia, and ES Wikipedia) - https://phabricator.wikimedia.org/T383800
[14:21:41] <ihurbain>	 there6
[14:21:51] <ihurbain>	 Daimona: all good (... in theory :D )
[14:21:53] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply
[14:22:10] <wikibugs>	 (03CR) 10Majavah: [C:03+1] toolforge: add jobs-emailer stats gathering [puppet] - 10https://gerrit.wikimedia.org/r/1121364 (https://phabricator.wikimedia.org/T320284) (owner: 10David Caro)
[14:22:24] <Daimona>	 Yay, congrats on your first deployment :D
[14:22:29] <ihurbain>	 thank you \o/
[14:22:38] <ihurbain>	 well, now i can move forward with miiiine
[14:22:46] <wikibugs>	 (03CR) 10MVernon: [C:03+1] cassandra: setup 'dev' target for Cassandra 4.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/1121102 (https://phabricator.wikimedia.org/T385819) (owner: 10Eevans)
[14:22:48] <wikibugs>	 (03CR) 10Bking: [C:03+2] Fix typo in opensearch-analysis-stconvert [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1121317 (owner: 10DCausse)
[14:22:48] <wikibugs>	 06SRE, 10Maps, 06Traffic, 13Patch-For-Review: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10567788 (10ssingh) @MSantos: Any update on this? Thanks!
[14:22:52] <wikibugs>	 (03CR) 10Bking: [V:03+2 C:03+2] Fix typo in opensearch-analysis-stconvert [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1121317 (owner: 10DCausse)
[14:23:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:23:37] <ihurbain>	 :looks suspiscious:
[14:23:44] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for 27 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra)
[14:24:10] <wikibugs>	 (03CR) 10David Caro: [V:03+1 C:03+2] toolforge: add jobs-emailer stats gathering [puppet] - 10https://gerrit.wikimedia.org/r/1121364 (https://phabricator.wikimedia.org/T320284) (owner: 10David Caro)
[14:24:16] <logmsgbot>	 !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1120679|Turn on Parsoid Read Views for 27 wiktionaries (T386762)]]
[14:24:19] <stashbot>	 T386762: Parsoid Read Views to Wiktionary deploy ~2025-02-20 - https://phabricator.wikimedia.org/T386762
[14:27:11] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:27:19] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:27:22] <logmsgbot>	 !log ihurbain@deploy2002 arlolra, ihurbain: Backport for [[gerrit:1120679|Turn on Parsoid Read Views for 27 wiktionaries (T386762)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:28:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 21.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:29:21] <ihurbain>	 we are parsoided on canary, continuing.
[14:29:24] <logmsgbot>	 !log ihurbain@deploy2002 arlolra, ihurbain: Continuing with sync
[14:30:07] <TheresNoTime>	 (congrats on your first deploy by the way!)
[14:30:13] <ihurbain>	 thank you :D
[14:32:41] <wikibugs>	 (03PS3) 10Andrew Bogott: vendordata.txt: include rudimentary clouds.yaml in initial VM [puppet] - 10https://gerrit.wikimedia.org/r/1120683 (https://phabricator.wikimedia.org/T379030)
[14:32:41] <wikibugs>	 (03PS8) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030)
[14:32:42] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps instance: populate /etc/openstack/project_id [puppet] - 10https://gerrit.wikimedia.org/r/1121369
[14:32:51] <ihurbain>	 tgr|away: just to make sure i understand, you're backporting a set of 3 patches to .16 (currently group 2) and .17 (currently group 0 and 1), and all these (the 6 patches for both branches) can go in a single scap?
[14:33:09] <tgr|away>	 yeah, they all need to go together
[14:33:17] <ihurbain>	 ack :)
[14:33:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloud-vps instance: populate /etc/openstack/project_id [puppet] - 10https://gerrit.wikimedia.org/r/1121369 (owner: 10Andrew Bogott)
[14:34:34] <wikibugs>	 (03PS2) 10Andrew Bogott: cloud-vps instance: populate /etc/openstack/project_id [puppet] - 10https://gerrit.wikimedia.org/r/1121369
[14:34:34] <wikibugs>	 (03PS4) 10Andrew Bogott: vendordata.txt: include rudimentary clouds.yaml in initial VM [puppet] - 10https://gerrit.wikimedia.org/r/1120683 (https://phabricator.wikimedia.org/T379030)
[14:34:34] <wikibugs>	 (03PS9) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030)
[14:34:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "It shouldn't. If it does, it means I made a mistake in this patch, and should be clear from diffs in this change's CI" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto)
[14:36:16] <logmsgbot>	 !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120679|Turn on Parsoid Read Views for 27 wiktionaries (T386762)]] (duration: 12m 00s)
[14:36:20] <stashbot>	 T386762: Parsoid Read Views to Wiktionary deploy ~2025-02-20 - https://phabricator.wikimedia.org/T386762
[14:36:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cloud-vps instance: populate /etc/openstack/project_id [puppet] - 10https://gerrit.wikimedia.org/r/1121369 (owner: 10Andrew Bogott)
[14:37:18] <wikibugs>	 (03PS1) 10Ayounsi: Fix tox bug [software/homer] - 10https://gerrit.wikimedia.org/r/1121370
[14:37:35] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene)
[14:37:40] <ihurbain>	 also, i do have logspam-watch running, i'm keeping an eye on it, and it doesn't look RIDICULOUS, but that's basically the only check i'm doing on that
[14:37:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:46] <ihurbain>	 (if anything else let me know)
[14:37:55] <ihurbain>	 tgr|away: starting your scap now
[14:37:57] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=wikikube-worker1003.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl
[14:37:58] <wikibugs>	 (03PS2) 10Ayounsi: Fix tox bug [software/homer] - 10https://gerrit.wikimedia.org/r/1121370
[14:38:07] <wikibugs>	 (03PS1) 10Jgreen: Add fundraising-analytics hostgroup and two new checks to nsca_frack.cfg.erb. [puppet] - 10https://gerrit.wikimedia.org/r/1121371 (https://phabricator.wikimedia.org/T386259)
[14:38:14] <wikibugs>	 (03PS3) 10Andrew Bogott: cloud-vps instance: populate /etc/openstack/project_id [puppet] - 10https://gerrit.wikimedia.org/r/1121369 (https://phabricator.wikimedia.org/T379030)
[14:38:15] <wikibugs>	 (03PS5) 10Andrew Bogott: vendordata.txt: include rudimentary clouds.yaml in initial VM [puppet] - 10https://gerrit.wikimedia.org/r/1120683 (https://phabricator.wikimedia.org/T379030)
[14:38:15] <wikibugs>	 (03PS10) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030)
[14:38:22] <wikibugs>	 (03CR) 10Volans: "Actually a better fix is doing something like I77af7f4aab59572f2a93ffd82d78d7027b67a41f" [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi)
[14:38:46] <wikibugs>	 (03Merged) 10jenkins-bot: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene)
[14:38:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121328 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza)
[14:38:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121329 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza)
[14:38:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121330 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza)
[14:38:53] <tgr|away>	 yeah, logspam-watch or the mediawiki-errors logstash dashboard is all you need
[14:38:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121332 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza)
[14:38:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121333 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza)
[14:38:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121334 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza)
[14:39:42] <ihurbain>	 tgr|away: my remark was more about "i'm not sure i'd be able to catch anything that's not entirely ridiculous but also not normal"
[14:39:48] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10567848 (10Jclark-ctr)
[14:41:07] <apergos>	 it's also the patch owner's responsibility to keep an eye out for errors as they test.   
[14:41:18] <apergos>	 speaking of which,, I oughta pull up the dashboard
[14:41:22] <ihurbain>	 :D
[14:41:29] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] trafficserver: use testwiki PCS without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1121350 (https://phabricator.wikimedia.org/T385719) (owner: 10Hnowlan)
[14:42:14] <tgr|away>	 there's an mwdebug-specific dashboard which is more useful for testing
[14:42:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] Add fundraising-analytics hostgroup and two new checks to nsca_frack.cfg.erb. [puppet] - 10https://gerrit.wikimedia.org/r/1121371 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen)
[14:42:57] <tgr|away>	 the error dashboard / logspam-watch probably won't tell you something is wrong until it hits production
[14:43:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Fix tox bug [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi)
[14:43:18] <tgr|away>	 though it's very rare that that happens, scap has a bunch of canary checks built in
[14:43:21] <apergos>	 that dashboard   is mentioned specifically in the backport deployer's docs. I have both that and the prod one pulled up
[14:44:02] <ihurbain>	 ah yes indeed
[14:44:46] <tgr|away>	 in the bad old days where deploying was more of a manual process, you could break things by e.g. syncing files in the wrong order and that caused huge error spikes
[14:45:18] <apergos>	 I deployed during those bad old days, and they were bad. 
[14:45:26] <wikibugs>	 (03Merged) 10jenkins-bot: Restore "Add configuration options and global preference for the SUL3 rolllout" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121328 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza)
[14:45:44] <ihurbain>	 i'm happy to live in the good new days, then. (well, as far as deployments are concerned.)
[14:45:47] <tgr|away>	 these days if something is non-broken enough to pass the scap canary tests, any problems it causes probably won't be obvious
[14:46:32] <apergos>	 deployers these days.  with their single command deploys. back in my day we had to walk barefoot through the snow to deploy... uphill... in both directions :-P
[14:46:45] <tgr|away>	 good to watch the logspam just in case, but it has been years since it last helped me catch a bug
[14:46:56] <James_F>	 apergos: And we were grateful! ;-)
[14:46:59] <inflatador>	 !log bking@apt1002:~/pkg$  sudo -E reprepro -C component/opensearch13 include bullseye-wikimedia $HOME/pkg/wmf-opensearch-search-plugins_1.3.20-1_amd64.changes T380752
[14:47:00] <apergos>	 lolol
[14:47:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:02] <stashbot>	 T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752
[14:47:08] <ihurbain>	 :D
[14:47:13] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225
[14:47:14] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547
[14:47:14] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548
[14:47:27] <wikibugs>	 (03Merged) 10jenkins-bot: SharedDomainUtils: Avoid early instantiation of NamespaceInfo [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121330 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza)
[14:47:29] <wikibugs>	 (03Merged) 10jenkins-bot: Make sure isSul3Enabled() is a boolean [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121333 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza)
[14:47:30] <wikibugs>	 (03Merged) 10jenkins-bot: Restore "Add configuration options and global preference for the SUL3 rolllout" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121329 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza)
[14:47:31] <wikibugs>	 (03Merged) 10jenkins-bot: SharedDomainUtils: Avoid early instantiation of NamespaceInfo [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121332 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza)
[14:47:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "I will add fixtures later, that should help catch issues like the one we had with dependencies." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto)
[14:47:58] <James_F>	 OTOH, back when I was deploying ~10 times a day I could slip out a minor config tweak to all production in < 45 seconds. No fancy commands, no safety systems, no atomic roll-outs, no canaries, and nothing to slow things down.
[14:48:05] <ihurbain>	 (one left to merge on the stack of 6)
[14:48:13] <James_F>	 Nowadays the very fastest deploys take ~6 mins.
[14:48:35] * James_F shakes his walking stick at the sky.
[14:48:43] <apergos>	 but it's been a long time since anyone's earned The Shirt for a deploy (at least, I think it's been a long time)
[14:48:54] <ihurbain>	 don't jinx meeee :D
[14:49:03] <apergos>	 taking it all back right now :-)
[14:49:06] <James_F>	 Indeed. Fingers crossed, etc.
[14:49:32] <wikibugs>	 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10567903 (10Jclark-ctr)
[14:50:02] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker1003.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl
[14:50:22] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker1004.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl
[14:51:33] <wikibugs>	 (03Merged) 10jenkins-bot: Make sure isSul3Enabled() is a boolean [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121334 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza)
[14:52:08] <logmsgbot>	 !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1121328|Restore "Add configuration options and global preference for the SUL3 rolllout" (T386836)]], [[gerrit:1121329|Restore "Add configuration options and global preference for the SUL3 rolllout" (T386836)]], [[gerrit:1121330|SharedDomainUtils: Avoid early instantiation of NamespaceInfo (T386836)]], [[gerrit:1121332|SharedDomainUtils: Avoid early in
[14:52:08] <logmsgbot>	 stantiation of NamespaceInfo (T386836)]], [[gerrit:1121333|Make sure isSul3Enabled() is a boolean (T384549)]], [[gerrit:1121334|Make sure isSul3Enabled() is a boolean (T384549)]]
[14:52:15] <stashbot>	 T386836: Wikibase CI broken with several errors - https://phabricator.wikimedia.org/T386836
[14:52:15] <stashbot>	 T384549: Create a per-user flag for enabling SUL3 - https://phabricator.wikimedia.org/T384549
[14:53:45] <vgutierrez>	 !log upload liberica 0.8 to apt.wm.o (bookworm-wikimedia)
[14:53:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto)
[14:55:05] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 (owner: 10Giuseppe Lavagetto)
[14:55:07] <logmsgbot>	 !log ihurbain@deploy2002 tgr, ihurbain: Backport for [[gerrit:1121328|Restore "Add configuration options and global preference for the SUL3 rolllout" (T386836)]], [[gerrit:1121329|Restore "Add configuration options and global preference for the SUL3 rolllout" (T386836)]], [[gerrit:1121330|SharedDomainUtils: Avoid early instantiation of NamespaceInfo (T386836)]], [[gerrit:1121332|SharedDomainUtils: Avoid early instantiatio
[14:55:07] <logmsgbot>	 n of NamespaceInfo (T386836)]], [[gerrit:1121333|Make sure isSul3Enabled() is a boolean (T384549)]], [[gerrit:1121334|Make sure isSul3Enabled() is a boolean (T384549)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:55:16] <vgutierrez>	 !log testing liberica 0.8 in lvs1013
[14:55:17] <ihurbain>	 tgr|away: canary time!
[14:55:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:24] <tgr|away>	 ihurbain: looking, might take a bit
[14:56:44] <ihurbain>	 i actually have a few log lines popping on mwdebug logstash, i'm assuming this is "transient restart" stuff, but please confirm
[14:58:13] <tgr|away>	 usually it just means someone is using the WikimediaDebug extension
[14:58:38] <tgr|away>	 unlike the production errors dashboard, it's not filtered by severity
[14:58:39] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=inactive:weight=5; selector: name=wikikube-worker2001.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl
[14:58:47] <ihurbain>	 ack
[14:59:49] <tgr|away>	 the errors are all session cache failures, which is outside MediaWiki so I don't think it can be caused by an MW deploy
[15:00:26] <apergos>	 a few of tehse are 503s (failed to store some session) but those are all at 14:54 so 
[15:00:30] <apergos>	 I think it's ok
[15:00:32] <ihurbain>	 let's say that if there's "deploy around centralauth" and "stuff that talks about sessions in the logs", it feels worth checking :D
[15:01:00] <tgr|away>	 though it's also an error that I don't see on production at all, so not sure what's up with that
[15:01:15] <tgr|away>	 but the patches aren't related to session handling
[15:01:42] <tgr|away>	 I guess let's see if it happens more or was a one-time fluke
[15:02:29] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker2001.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl
[15:02:29] <apergos>	 so    far these ar 14:54, 14:55, 6 of them total, that's it. 
[15:02:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:03:06] <ihurbain>	 php was restarted at 14:54
[15:05:11] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-02-19-134350 to 2025-02-20-140756 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121380 (https://phabricator.wikimedia.org/T383448)
[15:05:12] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-02-19-135838 to 2025-02-20-142923 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121381 (https://phabricator.wikimedia.org/T383448)
[15:06:10] <ihurbain>	 also, for the sake of communication: yes, the backport window is running over a bit
[15:06:45] <apergos>	 couple warnings, probably side effects of the tests
[15:07:10] <apergos>	 (token mismatch, couldn't find global id, one of each)
[15:08:01] <tgr|away>	 gah, WikimediaDebug automatically disabling itself is so annyoing
[15:08:09] <apergos>	 oh does it? woops
[15:08:18] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be2088.codfw.wmnet
[15:14:15] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10568007 (10elukey) I've updated the Broadcom 3908's firmware on ms-be2088 as indicated by Supermicro, since the changelog shows some JB...
[15:16:07] <apergos>	 Expectation (readQueryRows <= 10000) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: 12437) in trx #ab165eaf69: SELECT pi_property_id,pi_info FROM `wb_property_info`        
[15:16:25] <apergos>	 that's just now and the only thing possibly of interest imo
[15:16:27] <tgr|away>	 well, it's not really doing what I'd expect but it's not breaking anything either
[15:16:45] <tgr|away>	 maybe I'm just misremembering how it needs to be configured
[15:17:27] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker2002.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl
[15:17:38] <tgr|away>	 anyhow I think that's good enough to deploy - unless apergos wants to do more checks
[15:17:59] <apergos>	 did you change the config locally on one of the mwdebug instances? or...?
[15:18:25] <Lucas_WMDE>	 apergos: that one is known, one sec
[15:18:26] <apergos>	 I mean, if it's 0/0 and always on the target wiki, you shouldn't see any behavioural change 
[15:18:34] <Lucas_WMDE>	 (the wb_property_info I mean)
[15:18:35] <tgr|away>	 "Couldn't find a global ID for user Tgr-test-c1121328"
[15:18:43] <tgr|away>	 I guess that would explain it
[15:19:12] <tgr|away>	 not related to these patches though, something seems to be broken in CentralAuthUser caching
[15:19:13] <Lucas_WMDE>	 T349511 is that one messae0
[15:19:13] <stashbot>	 T349511: [LIB] [TECH] Wikibase reads too many wb_property_info rows at once (expectation readQueryRows <= 10000 not met) - https://phabricator.wikimedia.org/T349511
[15:19:15] <Lucas_WMDE>	 *message
[15:19:25] <apergos>	 thanks Lucas_WMDE
[15:19:34] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be2088.codfw.wmnet
[15:19:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:19:48] <tgr|away>	 Lucas_WMDE: I checked Wikidata and it seemed fine, not sure if you want to check anything more specific
[15:19:57] <Lucas_WMDE>	 I’ll take a quick look
[15:20:03] <Lucas_WMDE>	 but if it didn’t completely crash I’d guess it’s okay
[15:20:13] <apergos>	 just... call us paranoid :-))
[15:20:28] <inflatador>	 !log bking@apt1002:~/pkg$  sudo -E reprepro  -C component/opensearch13 remove bullseye-wikimedia wmf-opensearch-search-plugins T380752
[15:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:32] <stashbot>	 T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752
[15:20:43] <inflatador>	 !log bking@apt1002:~/pkg$  sudo -E reprepro -C component/opensearch13 include bullseye-wikimedia $HOME/pkg/wmf-opensearch-search-plugins_1.3.20-1_amd64.changes (again)T380752
[15:20:44] <Lucas_WMDE>	 editing works https://www.wikidata.org/w/index.php?title=Q4115189&diff=prev&oldid=2314331748
[15:20:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:56] <apergos>	 mwdebug right? Lucas_WMDE
[15:20:59] <Lucas_WMDE>	 yeah
[15:21:09] <apergos>	 good enough for me then, let's keep going
[15:21:16] <ihurbain>	 continuing with sync?
[15:21:22] <Lucas_WMDE>	 server
[15:21:22] <Lucas_WMDE>	 	mw-debug.codfw.pinkunicorn-85b7df9765-n5sxp
[15:21:27] <Lucas_WMDE>	 yeah, go ahead imho :)
[15:21:31] <ihurbain>	 ship it!
[15:21:34] <logmsgbot>	 !log ihurbain@deploy2002 tgr, ihurbain: Continuing with sync
[15:21:42] <Lucas_WMDE>	 (that was a response header, firefox copied it on two separate lines 🤷)
[15:21:48] <apergos>	 lol
[15:23:48] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrus: drop cirrus_saneitize_jobs periodic job (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1113741 (owner: 10DCausse)
[15:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10568060 (10phaultfinder)
[15:25:29] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 248, active_shards: 497, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli
[15:25:29] <icinga-wm>	 h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[15:28:11] <logmsgbot>	 !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121328|Restore "Add configuration options and global preference for the SUL3 rolllout" (T386836)]], [[gerrit:1121329|Restore "Add configuration options and global preference for the SUL3 rolllout" (T386836)]], [[gerrit:1121330|SharedDomainUtils: Avoid early instantiation of NamespaceInfo (T386836)]], [[gerrit:1121332|SharedDomainUtils: Avoid early i
[15:28:11] <logmsgbot>	 nstantiation of NamespaceInfo (T386836)]], [[gerrit:1121333|Make sure isSul3Enabled() is a boolean (T384549)]], [[gerrit:1121334|Make sure isSul3Enabled() is a boolean (T384549)]] (duration: 36m 02s)
[15:28:15] <stashbot>	 T386836: Wikibase CI broken with several errors - https://phabricator.wikimedia.org/T386836
[15:28:15] <stashbot>	 T384549: Create a per-user flag for enabling SUL3 - https://phabricator.wikimedia.org/T384549
[15:28:33] <ihurbain>	 there!
[15:28:52] <tgr|away>	 thank you ihurbain!
[15:29:21] <apergos>	 congrats on your first deploy and running the entire window! hope you do more of it soon!
[15:29:41] <ihurbain>	 !log UTC afternoon deploys done
[15:29:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:01] <ihurbain>	 there! thank y'all for the support :)
[15:31:38] <Lucas_WMDE>	 ihurbain: congrats on your first deployment window \o/
[15:32:16] <ihurbain>	 \o/
[15:32:22] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Broken thumb and can't move file - https://phabricator.wikimedia.org/T386943 (10MGA73) 03NEW
[15:33:55] <wikibugs>	 (03PS1) 10Jforrester: [wikifunctionswiki] Give wikilambda-bypass-cache to staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121385 (https://phabricator.wikimedia.org/T379432)
[15:34:12] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121385 (https://phabricator.wikimedia.org/T379432) (owner: 10Jforrester)
[15:34:15] <apergos>	 I'll be at train log triage in half an hour, so if anything weird pop up there, I'll report back
[15:37:14] <tgr|away>	 the one potential problem I can think of is generating too much DB load, since the global preference lookups now happen on all wikis
[15:37:37] <tgr|away>	 I'll see if there's a dashboard for checking that
[15:43:59] <Lucas_WMDE>	 does anyone mind if I deploy one more config change? it should be a no-op (config cleanup)
[15:44:05] <Lucas_WMDE>	 cc apergos, dancy, andre for the upcoming window
[15:44:15] <dancy>	 OK w/ me.
[15:44:28] <apergos>	 no objections here
[15:44:37] <Lucas_WMDE>	 thanks!
[15:44:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115016 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor)
[15:45:04] <Lucas_WMDE>	 did a quick check with rg to confirm that the old option name doesn’t appear anywhere except in wmf-config/Wikibase.php
[15:45:16] <Lucas_WMDE>	 (i.e. not in the existing branch directories on deploy1002)
[15:45:29] <wikibugs>	 (03Merged) 10jenkins-bot: Remove `tmpEnableMulLanguageCode` setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115016 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor)
[15:45:56] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1115016|Remove `tmpEnableMulLanguageCode` setting (T330217)]]
[15:45:59] <stashbot>	 T330217: MUL - Cleanup soft rollout flag - https://phabricator.wikimedia.org/T330217
[15:46:44] <arturo>	 !log update k9s in bookworm-wikimedia thirdparty/k9s to 0.40.5
[15:46:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:50] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 arthurtaylor, lucaswerkmeister-wmde: Backport for [[gerrit:1115016|Remove `tmpEnableMulLanguageCode` setting (T330217)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:49:01] <Lucas_WMDE>	 testing
[15:49:16] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Broken thumb and can't move file - https://phabricator.wikimedia.org/T386943#10568176 (10MatthewVernon) The problem with the thumbnail is that the image is malformed. If I download it and open it in GIMP, it says: ` Error loading PNG file: IDAT: in...
[15:49:41] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 arthurtaylor, lucaswerkmeister-wmde: Continuing with sync
[15:49:42] <Lucas_WMDE>	 lgtm
[15:50:35] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Broken thumb and can't move file - https://phabricator.wikimedia.org/T386943#10568177 (10MatthewVernon) [I would expect the usual thing to do would be to upload the fixed image as a new version of the broken one, rather than trying to move somethin...
[15:51:30] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test operations in mixed opensearch/elasticsearch cluster  - bking@cumin2002 - T380752
[15:51:31] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test operations in mixed opensearch/elasticsearch cluster  - bking@cumin2002 - T380752
[15:51:34] <stashbot>	 T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752
[15:51:37] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker2003.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl
[15:56:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115016|Remove `tmpEnableMulLanguageCode` setting (T330217)]] (duration: 10m 43s)
[15:56:43] <stashbot>	 T330217: MUL - Cleanup soft rollout flag - https://phabricator.wikimedia.org/T330217
[15:56:44] * Lucas_WMDE done deploying
[15:58:51] <apergos>	 and 5 minutes until train log triage, so that's perfect
[15:59:09] <wikibugs>	 (03PS2) 10Scott French: aptrepo: add component/pcre2 for bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1120586 (https://phabricator.wikimedia.org/T386006)
[15:59:09] <wikibugs>	 (03PS2) 10Scott French: package_builder: add pbuilder hook for pcre2 component [puppet] - 10https://gerrit.wikimedia.org/r/1120587 (https://phabricator.wikimedia.org/T386006)
[15:59:09] <wikibugs>	 (03PS1) 10Scott French: aptrepo: update pcre2 backport from apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1121388 (https://phabricator.wikimedia.org/T386006)
[16:00:05] <jouncebot>	 dancy and andre: Time to snap out of that daydream and deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1600).
[16:03:20] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:03:41] <vgutierrez>	 !log upload liberica 0.9 to apt.wm.o (bookworm-wikimedia)
[16:03:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:15] <wikibugs>	 (03CR) 10Scott French: "I think this should be the last step once the stable-to-bullseye backports are available in apt-staging. Other than my actually going and " [puppet] - 10https://gerrit.wikimedia.org/r/1121388 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French)
[16:08:32] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10568237 (10Pppery) a:05Ben.buchenau→03None
[16:08:37] <wikibugs>	 (03CR) 10Herron: [C:03+1] hiera: restore thanos retention settings [puppet] - 10https://gerrit.wikimedia.org/r/1121324 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi)
[16:09:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10568239 (10phaultfinder)
[16:09:54] <wikibugs>	 (03PS1) 10Jgreen: Fix hostgroup and alpha order for analytics role passive checks in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/1121390 (https://phabricator.wikimedia.org/T386259)
[16:10:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Fix hostgroup and alpha order for analytics role passive checks in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/1121390 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen)
[16:10:46] <vgutierrez>	 !log updating liberica to version 0.10 in ulsfo load balancers
[16:10:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:11:17] <wikibugs>	 (03PS1) 10Elukey: services: update Kartotherian's replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121391 (https://phabricator.wikimedia.org/T386926)
[16:12:45] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=inactive:weight=5; selector: name=wikikube-worker2003.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl
[16:13:00] <logmsgbot>	 !log elukey@puppetserver1001 conftool action : set/pooled=inactive:weight=5; selector: name=wikikube-worker1004.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl
[16:13:36] <tarrow>	 anyone around who could rotate some accidentally leaked phabricator bot credentials for me?
[16:16:52] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] Bust cache for recreated pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118890 (https://phabricator.wikimedia.org/T386244) (owner: 10Arlolra)
[16:17:18] <wikibugs>	 (03CR) 10SBassett: [C:03+1] [wikifunctionswiki] Give wikilambda-bypass-cache to staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121385 (https://phabricator.wikimedia.org/T379432) (owner: 10Jforrester)
[16:17:36] <wikibugs>	 (03PS2) 10Jgreen: Fix hostgroup and alpha order for analytics passive checks in nsca_frack.cfg.erb. [puppet] - 10https://gerrit.wikimedia.org/r/1121390 (https://phabricator.wikimedia.org/T386259)
[16:18:02] <wikibugs>	 (03Merged) 10jenkins-bot: Bust cache for recreated pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118890 (https://phabricator.wikimedia.org/T386244) (owner: 10Arlolra)
[16:18:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Fix hostgroup and alpha order for analytics passive checks in nsca_frack.cfg.erb. [puppet] - 10https://gerrit.wikimedia.org/r/1121390 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen)
[16:20:47] <wikibugs>	 (03PS3) 10Jgreen: Fix hostgroup and order for analytics checks in nsca_frack.cfg.erb. [puppet] - 10https://gerrit.wikimedia.org/r/1121390 (https://phabricator.wikimedia.org/T386259)
[16:23:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:27:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:27:21] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:30:46] <tarrow>	 ^^ all done in T386949 for people following along
[16:31:03] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.137.0" for 204 host(s)
[16:32:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] Fix hostgroup and order for analytics checks in nsca_frack.cfg.erb. [puppet] - 10https://gerrit.wikimedia.org/r/1121390 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen)
[16:32:51] <wikibugs>	 10ops-codfw, 06DC-Ops: Install test Mellanox nic into sretest2001 - https://phabricator.wikimedia.org/T386951 (10RobH) 03NEW p:05Triage→03High
[16:33:12] <wikibugs>	 10ops-codfw, 06DC-Ops: Install test Mellanox nic into sretest2001 - https://phabricator.wikimedia.org/T386951#10568366 (10RobH)
[16:33:28] <wikibugs>	 (03PS1) 10Vgutierrez: liberica: run cp checks periodically [puppet] - 10https://gerrit.wikimedia.org/r/1121394
[16:34:15] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121394 (owner: 10Vgutierrez)
[16:35:34] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.137.0" completed for 204 hosts
[16:40:42] <wikibugs>	 (03PS1) 10ZhaoFJx: zhwiki: Create abusefilter editor group on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879)
[16:41:22] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] liberica: run cp checks periodically (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1121394 (owner: 10Vgutierrez)
[16:43:18] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] liberica: run cp checks periodically [puppet] - 10https://gerrit.wikimedia.org/r/1121394 (owner: 10Vgutierrez)
[16:45:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:48:47] <wikibugs>	 (03CR) 10ZhaoFJx: zhwiki: Create abusefilter editor group on zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx)
[16:49:03] <logmsgbot>	 !log arlolra@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply
[16:49:51] <logmsgbot>	 !log arlolra@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[16:50:22] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[16:50:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx)
[16:53:20] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[16:55:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:55:35] <logmsgbot>	 !log arlolra@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply
[16:56:22] <mutante>	 !log phab1004 (phabricator) - systemctl stop phabricator_stats_job_mfa_check timer and service; systemctl (gerrit:1117489)
[16:56:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:40] <logmsgbot>	 !log arlolra@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[17:00:04] <jouncebot>	 jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1700).
[17:00:05] <jouncebot>	 Krinkle: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:21] <rzl>	 o/
[17:00:52] <Krinkle>	 o/
[17:02:06] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] mediawiki: Add rewrite rule to fix serving of /.well-known static files [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle)
[17:02:58] <rzl>	 I'll deploy to metal mwdebug2001 first just because it's easy, then scap stopping at kubernetes debug hosts, then everywhere
[17:03:14] <Krinkle>	 Ack
[17:03:20] <rzl>	 transient httpbb alerts are expected
[17:04:17] <Krinkle>	 https://auth.wikimedia.beta.wmflabs.org/.well-known/assetlinks.json
[17:04:30] <Krinkle>	 https://auth.wikimedia.org/.well-known/assetlinks.json
[17:04:43] <Krinkle>	 I'll be looking at that one on mwdebug in a minute
[17:04:55] <logmsgbot>	 !log arlolra@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[17:05:20] <rzl>	 live at mwdebug2001
[17:05:31] <logmsgbot>	 !log arlolra@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[17:05:46] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:06:43] <rzl>	 httpbb passes
[17:08:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:08:47] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART
[17:10:09] <Krinkle>	 rzl: LGTM
[17:10:14] <rzl>	 👍
[17:10:57] <logmsgbot>	 !log rzl@deploy2002 Started scap sync-world: T385520
[17:11:00] <stashbot>	 T385520: Deploy DAL files for seamless credential sharing in Chrome - https://phabricator.wikimedia.org/T385520
[17:12:26] <logmsgbot>	 !log rzl@deploy2002 rzl: T385520 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:12:52] <rzl>	 httpbb's still passing on k8s-mwdebug, go ahead and test
[17:13:14] <wikibugs>	 (03PS1) 10Vgutierrez: liberica: Fix liberica cp check job [puppet] - 10https://gerrit.wikimedia.org/r/1121401
[17:13:31] <Krinkle>	 LGTM on k8s mwdebug
[17:13:36] <logmsgbot>	 !log rzl@deploy2002 rzl: Continuing with sync
[17:13:40] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] liberica: Fix liberica cp check job [puppet] - 10https://gerrit.wikimedia.org/r/1121401 (owner: 10Vgutierrez)
[17:13:58] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] liberica: Fix liberica cp check job [puppet] - 10https://gerrit.wikimedia.org/r/1121401 (owner: 10Vgutierrez)
[17:14:10] <icinga-wm>	 PROBLEM - Disk space on netflow1002 is CRITICAL: DISK CRITICAL - free space: / 0MiB (0% inode=93%): /tmp 0MiB (0% inode=93%): /var/tmp 0MiB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=netflow1002&var-datasource=eqiad+prometheus/ops
[17:16:23] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra: setup 'dev' target for Cassandra 4.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/1121102 (https://phabricator.wikimedia.org/T385819) (owner: 10Eevans)
[17:19:23] <logmsgbot>	 !log rzl@deploy2002 Finished scap sync-world: T385520 (duration: 09m 01s)
[17:19:27] <stashbot>	 T385520: Deploy DAL files for seamless credential sharing in Chrome - https://phabricator.wikimedia.org/T385520
[17:19:43] <Krinkle>	 worksforme in prod now
[17:19:54] <rzl>	 sweet
[17:20:07] <rzl>	 thanks for writing an httpbb test, that made everything easy
[17:20:14] <Krinkle>	 :)
[17:20:27] <rzl>	 puppet window complete! gavel gavel
[17:21:30] <dancy>	 Like the Law & Order sound?
[17:22:51] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching cassandra-dev2001.codfw.wmnet: Upgrade to Cassandra 4.1.8 — T385819 - eevans@cumin1002
[17:23:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:24:24] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 on cloudvirt1047 - https://phabricator.wikimedia.org/T386083#10568570 (10Jhancock.wm) a:03VRiley-WMF
[17:24:35] <rzl>	 dancy: remember the big freaky klingon gavel from Star Trek VI? closer to that
[17:24:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383213#10568572 (10Jhancock.wm) a:03VRiley-WMF
[17:24:59] <rzl>	 (also --k8s-confirm-diff is still in good shape, thanks for that work)
[17:25:16] <dancy>	 https://memory-alpha.fandom.com/wiki/Gavel?file=Klingon_Magistrate.jpg
[17:25:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission  mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10568578 (10Jhancock.wm) a:03VRiley-WMF
[17:25:53] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 on cloudvirt1047 - https://phabricator.wikimedia.org/T386083#10568579 (10Andrew) 05Resolved→03Invalid I'm putting this host back in service and closing the...
[17:26:15] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121404
[17:26:29] <rzl>	 oh my god of *course* there's a whole article on Gavel
[17:26:44] <rzl>	 we do love a star trek courtroom drama, I guess that makes sense
[17:29:48] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching cassandra-dev2001.codfw.wmnet: Upgrade to Cassandra 4.1.8 — T385819 - eevans@cumin1002
[17:29:51] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002
[17:29:54] <stashbot>	 T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420
[17:37:19] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching cassandra-dev200[2-3].codfw.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002
[17:38:24] <wikibugs>	 (03PS1) 10ZhaoFJx: cowikimedia: Change the workmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121405 (https://phabricator.wikimedia.org/T386872)
[17:38:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121405 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx)
[17:39:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10568629 (10phaultfinder)
[17:41:17] <wikibugs>	 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959 (10BCornwall) 03NEW
[17:42:08] <mutante>	 cccccbukvgbchuuebhklecbdrrbhvulvgeliecljvdvb
[17:42:10] <wikibugs>	 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10568650 (10BCornwall)
[17:42:16] <wikibugs>	 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, and 2 others: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10568651 (10BCornwall)
[17:42:59] <mutante>	 yea, that button is way too close to the keyboard now with the nanon 
[17:43:05] <wikibugs>	 (03CR) 10BCornwall: "Filed at https://phabricator.wikimedia.org/T386959" [cookbooks] - 10https://gerrit.wikimedia.org/r/1121086 (https://phabricator.wikimedia.org/T373993) (owner: 10BCornwall)
[17:47:33] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002
[17:47:37] <stashbot>	 T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420
[17:50:22] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9200 on relforge1003 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[17:51:06] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching cassandra-dev200[2-3].codfw.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002
[17:51:49] <wikibugs>	 (03PS1) 10Vgutierrez: sre: Provide LibericaDiffFPCheck alert [alerts] - 10https://gerrit.wikimedia.org/r/1121409
[17:53:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre: Provide LibericaDiffFPCheck alert [alerts] - 10https://gerrit.wikimedia.org/r/1121409 (owner: 10Vgutierrez)
[17:53:20] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[17:55:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:55:51] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-02-17-122018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121411
[17:59:40] <wikibugs>	 (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-02-17-122018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121411 (owner: 10BryanDavis)
[18:00:05] <jouncebot>	 bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1800).
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1800)
[18:00:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:00:54] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-02-17-122018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121411 (owner: 10BryanDavis)
[18:09:52] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[18:10:04] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:10:08] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:10:13] <logmsgbot>	 !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[18:10:54] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:10:58] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:11:09] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[18:14:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloud-vps instance: populate /etc/openstack/project_id [puppet] - 10https://gerrit.wikimedia.org/r/1121369 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott)
[18:17:57] <logmsgbot>	 !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[18:18:05] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[18:18:24] <logmsgbot>	 !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[18:21:41] <wikibugs>	 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10568826 (10ssingh) Hi @wiki_willy: adding you based on our discussion so you can triage it accordingly, thanks!
[18:23:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[18:25:11] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383341#10568841 (10Jhancock.wm) a:03Jhancock.wm
[18:28:01] <wikibugs>	 (03PS2) 10Ssingh: sre: Provide LibericaDiffFPCheck alert [alerts] - 10https://gerrit.wikimedia.org/r/1121409 (owner: 10Vgutierrez)
[18:29:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre: Provide LibericaDiffFPCheck alert [alerts] - 10https://gerrit.wikimedia.org/r/1121409 (owner: 10Vgutierrez)
[18:29:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10568871 (10phaultfinder)
[18:30:59] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "no effect on prod puppetmasters  https://puppet-compiler.wmflabs.org/output/1121079/4960/" [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn)
[18:32:56] <wikibugs>	 (03PS3) 10Ssingh: sre: Provide LibericaDiffFPCheck alert [alerts] - 10https://gerrit.wikimedia.org/r/1121409 (owner: 10Vgutierrez)
[18:33:30] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "Should have said "puppetservers" --> https://puppet-compiler.wmflabs.org/output/1121079/4961/" [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn)
[18:37:32] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "prod: puppetserver1001, puppetmaster1003, puppetmaster2001 - noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn)
[18:41:31] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10568908 (10Dzahn) Arthur clarified the new key is a yubikey key. I advised to first have this added in addition to the existing key and test things.  And that we sho...
[18:41:57] <mutante>	 cccccbukvgbcvdetuvfrvbuudkljcjrnljgbhjnedkee
[18:42:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test operations in mixed opensearch/elasticsearch cluster  - bking@cumin2002 - T380752:
[18:42:08] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test operations in mixed opensearch/elasticsearch cluster  - bking@cumin2002 - T380752:
[18:42:11] <stashbot>	 T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752
[18:44:33] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "cloud: puppetmaster-1003.devtools - noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn)
[18:44:59] <wikibugs>	 (03CR) 10Ssingh: "Hi @dzahn@wikimedia.org: I am guessing there is another related commit for this to be followed up in setting cache::alternate_domains; wou" [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn)
[18:45:30] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "This fixed the previous puppet error for any new puppetserver in cloud but it's just on to the next issue now. ""Unable to create director" [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn)
[18:45:53] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good but I was wondering if we should just put in the dashboard links before merging this so that we don't forget." [alerts] - 10https://gerrit.wikimedia.org/r/1121409 (owner: 10Vgutierrez)
[18:46:23] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "seems the puppetserver module has never been tested on cloud before" [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn)
[18:47:19] <wikibugs>	 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10568916 (10wiki_willy) a:03RobH Thanks for creating this task @ssingh.  @RobH - can open up a Dell Tech Support ticket to get one of the technicians out to magru and see if they can figure out what might...
[18:48:08] <wikibugs>	 (03PS1) 10Stevemunene: Fix team name typo for hadoop worker [alerts] - 10https://gerrit.wikimedia.org/r/1121415 (https://phabricator.wikimedia.org/T386900)
[18:48:32] <wikibugs>	 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10568925 (10ssingh) Thanks Willy! @BCornwall has been leading this from the Traffic team and will be the point of contact for this.
[18:49:15] <wikibugs>	 (03CR) 10Dzahn: "@sukhe There is not, so far. This is only in response to https://phabricator.wikimedia.org/T274228#10529569 to create the mere possibility" [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn)
[18:50:09] <wikibugs>	 (03CR) 10Dzahn: "in this form this is only meant to be "allow a new option in varnish that did not exist before" and that's it." [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn)
[18:50:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:51:20] <wikibugs>	 (03CR) 10Ssingh: "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn)
[18:51:44] <wikibugs>	 (03CR) 10Ssingh: "(This looks good to me for what it's worth.)" [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn)
[18:57:35] <wikibugs>	 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10568948 (10RobH) It was my understanding this issue was resolved by the new temp profile settings on T373993, do we still need to open a case on this?
[19:00:04] <jouncebot>	 dancy and andre: That opportune time for a MediaWiki train - Utc-7+Utc-0 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1900).
[19:00:13] <andre>	 uh uh
[19:00:14] <dancy>	 o/
[19:00:33] * dancy presses the button.
[19:00:39] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121417 (https://phabricator.wikimedia.org/T382368)
[19:00:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121417 (https://phabricator.wikimedia.org/T382368) (owner: 10TrainBranchBot)
[19:01:02] <mutante>	 dancy: sounds like spiderpig :o
[19:01:24] <dancy>	 It's on my todo list.
[19:01:46] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121417 (https://phabricator.wikimedia.org/T382368) (owner: 10TrainBranchBot)
[19:11:26] <logmsgbot>	 !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.17  refs T382368
[19:11:30] <stashbot>	 T382368: 1.44.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T382368
[19:12:38] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on netflow1002.eqiad.wmnet with reason: keeping gnmic running in debug mode to observe performance change
[19:14:10] <icinga-wm>	 RECOVERY - Disk space on netflow1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=netflow1002&var-datasource=eqiad+prometheus/ops
[19:16:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] realm.pp: remove use of $labsproject [puppet] - 10https://gerrit.wikimedia.org/r/1121343 (owner: 10Andrew Bogott)
[19:16:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] rename validatelabsfqdn.py to validatecloudvpsfqdn.py [puppet] - 10https://gerrit.wikimedia.org/r/1121342 (owner: 10Andrew Bogott)
[19:18:42] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121342 (owner: 10Andrew Bogott)
[19:19:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:20:32] <wikibugs>	 06SRE, 06Data-Engineering, 10DPE-Mediawiki-Content, 10Dumps-Generation, 07Epic: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10569062 (10Ahoelzl)
[19:23:20] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[19:23:46] <wikibugs>	 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10569088 (10wiki_willy) Hey @RobH - Sukhbir and I were talking at the offsite after the fix was implemented.  While increasing the fan speed helped specifically in this scenario, the other sites are able to...
[19:24:14] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:24:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10569097 (10phaultfinder)
[19:25:14] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:27:06] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance backend on cp4038 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:28:06] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance backend on cp4038 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:30:45] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] services: update Kartotherian's replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121391 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey)
[19:31:29] <wikibugs>	 (03PS2) 10Andrew Bogott: validatecloudvpsfqdn.py: Support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121344 (https://phabricator.wikimedia.org/T379030)
[19:31:29] <wikibugs>	 (03PS4) 10Andrew Bogott: wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030)
[19:31:29] <wikibugs>	 (03PS4) 10Andrew Bogott: Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030)
[19:31:30] <wikibugs>	 (03PS4) 10Andrew Bogott: wmcs puppet-enc: use project id for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1121347
[19:31:31] <wikibugs>	 (03PS1) 10Andrew Bogott: validatecloudvpsfqdn.py: Only support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121423 (https://phabricator.wikimedia.org/T379030)
[19:35:55] <wikibugs>	 (03PS3) 10Andrew Bogott: validatecloudvpsfqdn.py: Support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121344 (https://phabricator.wikimedia.org/T379030)
[19:35:55] <wikibugs>	 (03PS5) 10Andrew Bogott: wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030)
[19:35:55] <wikibugs>	 (03PS5) 10Andrew Bogott: Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030)
[19:35:56] <wikibugs>	 (03PS5) 10Andrew Bogott: wmcs puppet-enc: use project id for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1121347
[19:35:57] <wikibugs>	 (03PS2) 10Andrew Bogott: validatecloudvpsfqdn.py: Only support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121423 (https://phabricator.wikimedia.org/T379030)
[19:37:22] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121347 (owner: 10Andrew Bogott)
[19:37:27] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121423 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott)
[19:37:32] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott)
[19:37:35] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott)
[19:37:38] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121344 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott)
[19:41:36] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Broken thumb and can't move file - https://phabricator.wikimedia.org/T386943#10569165 (10MGA73) Thank you for checking this! I moved the file to retain edit history and to test if the thumb would work on Commons (there are a number of similar files...
[19:44:31] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Broken thumb and can't move file - https://phabricator.wikimedia.org/T386943#10569176 (10MGA73) 05Open→03Resolved a:03MGA73
[19:46:34] <James_F>	 All quiet; I'm going to deploy a service update.
[19:46:51] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-02-19-134350 to 2025-02-20-140756 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121380 (https://phabricator.wikimedia.org/T383448) (owner: 10Jforrester)
[19:48:22] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-02-19-134350 to 2025-02-20-140756 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121380 (https://phabricator.wikimedia.org/T383448) (owner: 10Jforrester)
[19:49:36] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[19:50:09] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[19:50:51] <logmsgbot>	 !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on netflow1002.eqiad.wmnet with reason: keeping gnmic running in debug mode to observe performance change
[19:51:40] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[19:52:26] <logmsgbot>	 !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[19:52:31] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[19:53:22] <logmsgbot>	 !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[19:54:42] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:54:55] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-02-19-135838 to 2025-02-20-142923 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121381 (https://phabricator.wikimedia.org/T383448) (owner: 10Jforrester)
[19:56:09] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-02-19-135838 to 2025-02-20-142923 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121381 (https://phabricator.wikimedia.org/T383448) (owner: 10Jforrester)
[20:01:25] <wikibugs>	 (03PS1) 10Eevans: aqs2001: upgrade to Cassandra 4.1.8 (canary) [puppet] - 10https://gerrit.wikimedia.org/r/1121428 (https://phabricator.wikimedia.org/T386969)
[20:02:09] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121428 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans)
[20:07:32] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs2001: upgrade to Cassandra 4.1.8 (canary) [puppet] - 10https://gerrit.wikimedia.org/r/1121428 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans)
[20:09:35] <wikibugs>	 (03PS1) 10Aklapper: Remove an unused array variable [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121429
[20:10:36] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Remove an unused array variable [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121429 (owner: 10Aklapper)
[20:10:54] <logmsgbot>	 !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919
[20:10:58] <stashbot>	 T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919
[20:11:30] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs2001.codfw.wmnet: Upgrading to Cassandra 4.1.8 (canary) — T385819 - eevans@cumin1002
[20:18:55] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs2001.codfw.wmnet: Upgrading to Cassandra 4.1.8 (canary) — T385819 - eevans@cumin1002
[20:24:40] <wikibugs>	 (03PS1) 10Ahmon Dancy: Use buildkit:wmf-v0.20.0-2 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1121432 (https://phabricator.wikimedia.org/T386955)
[20:42:37] <wikibugs>	 (03PS1) 10Aklapper: Rename $editScore to $transaction_score [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121435
[20:43:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[20:43:36] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Rename $editScore to $transaction_score [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121435 (owner: 10Aklapper)
[20:46:16] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1121324 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi)
[20:48:30] <wikibugs>	 (03PS15) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945)
[20:49:08] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Use buildkit:wmf-v0.20.0-2 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1121432 (https://phabricator.wikimedia.org/T386955) (owner: 10Ahmon Dancy)
[20:49:16] <wikibugs>	 (03CR) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy)
[20:51:46] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] admin: upgrade arthurtaylor from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1121088 (https://phabricator.wikimedia.org/T386349) (owner: 10Dzahn)
[20:52:16] <mutante>	 !log welcome new deployer Arthur Taylor (T386349)
[20:52:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:26] <wikibugs>	 (03CR) 10Jforrester: [C:04-1] zhwiki: Create abusefilter editor group on zhwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx)
[20:54:10] <wikibugs>	 (03PS1) 10Cathal Mooney: Update policy for K8s BGP to allow a wider range of v4 prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/1121438 (https://phabricator.wikimedia.org/T375845)
[20:54:20] <mutante>	 !log logmsgbot: are you logging?
[20:54:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:46] <mutante>	 but not to phab tickets anymore
[20:58:16] <James_F>	 mutante: Maybe the token expired for the bot?
[20:58:30] <James_F>	 ISTR that was an issue with a different tool this week.
[20:59:22] <wikibugs>	 (03CR) 10ZhaoFJx: zhwiki: Create abusefilter editor group on zhwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx)
[20:59:27] <mutante>	 sounds like a possibility, ack
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T2100). Please do the needful.
[21:00:05] <jouncebot>	 James_F and ZhaoFJx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:11] <James_F>	 I can deploy.
[21:00:29] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10569335 (10Dzahn) 05In progress→03Resolved a:03Dzahn ` 20:52 < mutante> !log welcome new deployer Arthur Taylor (T386349)  `   ` [deploy1003:~] $ id arthur...
[21:00:33] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Re-update function-schemata sub-module to HEAD (39b22ad) [extensions/WikiLambda] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121366 (owner: 10Jforrester)
[21:01:37] <ZhaoFJx>	 Thanks
[21:02:19] <wikibugs>	 (03PS2) 10Jforrester: cowikimedia: Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121405 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx)
[21:02:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121405 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx)
[21:03:20] <wikibugs>	 (03Merged) 10jenkins-bot: cowikimedia: Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121405 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx)
[21:03:40] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1121405|cowikimedia: Change the wordmark (T386872)]]
[21:03:44] <stashbot>	 T386872: Requesting logo change for co.wikimedia.org - https://phabricator.wikimedia.org/T386872
[21:05:00] <wikibugs>	 (03Merged) 10jenkins-bot: Re-update function-schemata sub-module to HEAD (39b22ad) [extensions/WikiLambda] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121366 (owner: 10Jforrester)
[21:06:24] <logmsgbot>	 !log jforrester@deploy2002 jforrester, zhaofjx: Backport for [[gerrit:1121405|cowikimedia: Change the wordmark (T386872)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:06:48] <James_F>	 ZhaoFJx: Deployed and it looks "fine" assuming that's what they wanted – can you confirm?
[21:07:41] <ZhaoFJx>	 not good on my side (k8s-mwdebug)
[21:07:54] <James_F>	 What's wrong from your end?
[21:08:35] <ZhaoFJx>	 there are two wikimedia logos
[21:08:36] <James_F>	 The logo duplication?
[21:08:42] <ZhaoFJx>	 yep
[21:08:51] <James_F>	 Yes, is that not what they wanted?
[21:09:03] <wikibugs>	 (03PS2) 10ZhaoFJx: zhwiki: Create abusefilter editor group on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879)
[21:09:19] <ZhaoFJx>	 I believe no
[21:09:26] * James_F sighs.
[21:09:32] <ZhaoFJx>	 alas
[21:09:38] <James_F>	 OK, we can revert and they can say what file they /actually/ want?
[21:09:46] <logmsgbot>	 !log jforrester@deploy2002 Sync cancelled.
[21:10:03] <ZhaoFJx>	 I will ask them on phabricator
[21:10:07] <wikibugs>	 (03PS1) 10Jforrester: Revert "cowikimedia: Change the wordmark" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121440
[21:10:10] <James_F>	 Thanks!
[21:10:15] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] Revert "cowikimedia: Change the wordmark" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121440 (owner: 10Jforrester)
[21:10:42] <ZhaoFJx>	 Could you also check the zhwiki one? I just updated the patchset
[21:10:43] <ZhaoFJx>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1121395
[21:10:54] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "cowikimedia: Change the wordmark" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121440 (owner: 10Jforrester)
[21:10:55] <James_F>	 Yeah, looks good, will deploy now.
[21:11:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx)
[21:11:42] <wikibugs>	 (03Merged) 10jenkins-bot: zhwiki: Create abusefilter editor group on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx)
[21:11:57] <wikibugs>	 (03CR) 10Cathal Mooney: "LGTM in general, one question in line." [cookbooks] - 10https://gerrit.wikimedia.org/r/1121086 (https://phabricator.wikimedia.org/T373993) (owner: 10BCornwall)
[21:12:13] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1121395|zhwiki: Create abusefilter editor group on zhwiki (T386879)]]
[21:12:17] <stashbot>	 T386879: Create abusefilter editor group on zhwiki - https://phabricator.wikimedia.org/T386879
[21:14:54] <logmsgbot>	 !log jforrester@deploy2002 jforrester, zhaofjx: Backport for [[gerrit:1121395|zhwiki: Create abusefilter editor group on zhwiki (T386879)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:15:24] <ZhaoFJx>	 All good!
[21:15:25] <James_F>	 ZhaoFJx: How's that look for you on debug?
[21:15:27] <James_F>	 Excellent.
[21:15:29] <logmsgbot>	 !log jforrester@deploy2002 jforrester, zhaofjx: Continuing with sync
[21:16:06] <James_F>	 https://zh.wikipedia.org/wiki/Wikipedia:Abusefilter-editor should get filled in at some point. :-)
[21:16:34] <ZhaoFJx>	 And I will call them for i18n soon
[21:16:38] <ZhaoFJx>	 thanks for mention
[21:16:39] <James_F>	 Brilliant.
[21:17:57] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10569355 (10bking) 05Open→03In progress p:05Triage→03Medium a:03bking
[21:20:11] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10569364 (10bking) Hello DC Ops,  I've created [[ https://docs.google.com/spreadsheets/d/1DfzoKJM...
[21:22:08] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121395|zhwiki: Create abusefilter editor group on zhwiki (T386879)]] (duration: 09m 54s)
[21:22:12] <stashbot>	 T386879: Create abusefilter editor group on zhwiki - https://phabricator.wikimedia.org/T386879
[21:22:23] <James_F>	 ZhaoFJx: All done for you, I think? Sorry that the first one didn't work out.
[21:22:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121385 (https://phabricator.wikimedia.org/T379432) (owner: 10Jforrester)
[21:22:47] <wikibugs>	 (03PS1) 10RLazarus: deployment_server: Support multiple Kubernetes configs in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1121443 (https://phabricator.wikimedia.org/T378429)
[21:23:30] <wikibugs>	 (03Merged) 10jenkins-bot: [wikifunctionswiki] Give wikilambda-bypass-cache to staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121385 (https://phabricator.wikimedia.org/T379432) (owner: 10Jforrester)
[21:23:37] <ZhaoFJx>	 Yep
[21:23:41] <ZhaoFJx>	 Thanks for deployment
[21:23:49] <logmsgbot>	 !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1121385|[wikifunctionswiki] Give wikilambda-bypass-cache to staff (T379432)]]
[21:23:52] <ZhaoFJx>	 James_F have a good one
[21:23:53] <stashbot>	 T379432: Create a way to temporarily bypass the results cache on production - https://phabricator.wikimedia.org/T379432
[21:23:57] <James_F>	 ZhaoFJx: And you!
[21:24:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:26:34] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1121385|[wikifunctionswiki] Give wikilambda-bypass-cache to staff (T379432)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:26:52] <logmsgbot>	 !log jforrester@deploy2002 jforrester: Continuing with sync
[21:29:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 17.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[21:33:20] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:33:24] <logmsgbot>	 !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121385|[wikifunctionswiki] Give wikilambda-bypass-cache to staff (T379432)]] (duration: 09m 34s)
[21:33:28] <stashbot>	 T379432: Create a way to temporarily bypass the results cache on production - https://phabricator.wikimedia.org/T379432
[21:39:15] <wikibugs>	 (03PS1) 10Aklapper: Do not lower score when setting customfield [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121446
[21:40:11] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Do not lower score when setting customfield [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121446 (owner: 10Aklapper)
[21:43:20] <icinga-wm>	 RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[21:43:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1103:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1103 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:48:01] <wikibugs>	 (03PS1) 10Aklapper: Sort recent user transactions by newest first [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121447
[21:48:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1103:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1103 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:49:05] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Sort recent user transactions by newest first [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121447 (owner: 10Aklapper)
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T2200)
[22:06:21] <logmsgbot>	 !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919
[22:06:25] <stashbot>	 T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919
[22:07:35] <wikibugs>	 (03PS1) 10Ebrahim: Improve Persian Wikipedia's tagline and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121449
[22:11:06] <wikibugs>	 (03PS1) 10Jdrewniak: Fix 0 tick not firing for session length mixin, and ensure ticks happen every 30 seconds [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121450 (https://phabricator.wikimedia.org/T386495)
[22:13:20] <icinga-wm>	 PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[22:14:52] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3347 MB (3% inode=98%): /tmp 3347 MB (3% inode=98%): /var/tmp 3347 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[22:15:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121450 (https://phabricator.wikimedia.org/T386495) (owner: 10Jdrewniak)
[22:21:53] <wikibugs>	 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10569604 (10bking)
[22:22:28] <wikibugs>	 (03Merged) 10jenkins-bot: Fix 0 tick not firing for session length mixin, and ensure ticks happen every 30 seconds [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121450 (https://phabricator.wikimedia.org/T386495) (owner: 10Jdrewniak)
[22:22:45] <logmsgbot>	 !log jdrewniak@deploy2002 Started scap sync-world: Backport for [[gerrit:1121450|Fix 0 tick not firing for session length mixin, and ensure ticks happen every 30 seconds (T386495)]]
[22:22:48] <stashbot>	 T386495: Fix session tick mixin relating to when events fire - https://phabricator.wikimedia.org/T386495
[22:25:25] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:1121450|Fix 0 tick not firing for session length mixin, and ensure ticks happen every 30 seconds (T386495)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:28:16] <wikibugs>	 (03PS1) 10RLazarus: deployment_server: Read mwscript-k8s MW image from values, not kube API [puppet] - 10https://gerrit.wikimedia.org/r/1121455 (https://phabricator.wikimedia.org/T378429)
[22:30:02] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak: Continuing with sync
[22:36:37] <logmsgbot>	 !log jdrewniak@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121450|Fix 0 tick not firing for session length mixin, and ensure ticks happen every 30 seconds (T386495)]] (duration: 13m 52s)
[22:36:41] <stashbot>	 T386495: Fix session tick mixin relating to when events fire - https://phabricator.wikimedia.org/T386495
[22:40:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[22:45:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow
[22:51:41] <wikibugs>	 (03PS2) 10Ebrahim: Improve Persian Wikipedia's tagline and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121449
[23:03:02] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9200 on relforge1005 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:03:02] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9200 on relforge1007 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:11:38] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9200 on relforge1006 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[23:20:17] <wikibugs>	 (03PS1) 10Aklapper: Penalize removing all subscribers and edges [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121461
[23:25:58] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Penalize removing all subscribers and edges [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121461 (owner: 10Aklapper)
[23:33:06] <icinga-wm>	 PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[23:33:14] <icinga-wm>	 PROBLEM - Router interfaces on mr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.130, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:33:42] <icinga-wm>	 PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100%
[23:35:14] <icinga-wm>	 RECOVERY - Router interfaces on mr1-drmrs is OK: OK: host 185.15.58.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:35:14] <wikibugs>	 (03PS1) 10Aklapper: Differentiate more on account age [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121464
[23:35:52] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Differentiate more on account age [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121464 (owner: 10Aklapper)
[23:49:08] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 86.41 ms
[23:53:44] <icinga-wm>	 RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.46 ms
[23:55:27] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1121443 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus)