[00:08:04] (03PS2) 10Eevans: cassandra: setup 'dev' target for Cassandra 4.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/1121102 (https://phabricator.wikimedia.org/T385819) [00:08:36] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121102 (https://phabricator.wikimedia.org/T385819) (owner: 10Eevans) [00:10:44] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 632.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:31:06] (03CR) 10Cwhite: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1120923 (owner: 10Filippo Giunchedi) [00:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121115 [00:38:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121115 (owner: 10TrainBranchBot) [00:48:38] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121115 (owner: 10TrainBranchBot) [00:49:56] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:50:14] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:50:22] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:04:14] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:07:14] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:08:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121116 [01:08:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121116 (owner: 10TrainBranchBot) [01:09:14] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:11:18] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:13:18] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:15:14] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:17:12] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:21:12] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:26:14] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:30:05] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121116 (owner: 10TrainBranchBot) [01:32:14] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:38:14] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:40:18] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:40:54] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:41:18] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:43:54] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:44:14] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:44:56] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:45:14] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:45:22] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:46:28] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/416455e1d3b20ea1bf708c9423206d03c25f3f045cc4ad254c29b7c6955e1ea2/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:47:56] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:48:14] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:48:22] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:50:22] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:51:14] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:53:22] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:54:14] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:55:14] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:00:18] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:01:24] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:03:14] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:06:28] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:09:44] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:10:16] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:12:58] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:13:22] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:21:12] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:26:56] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:27:22] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:28:16] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:33:56] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:34:16] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:34:20] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:34:22] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10566388 (10phaultfinder) [03:32:12] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:57:21] FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [04:06:22] (03PS2) 10KartikMistry: Update cxserver to 2025-02-20-032928-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120709 (https://phabricator.wikimedia.org/T386677) [04:08:38] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:52:21] RESOLVED: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [05:07:22] Deploying cxserver.. [05:07:30] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-02-20-032928-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120709 (https://phabricator.wikimedia.org/T386677) (owner: 10KartikMistry) [05:08:37] (03Merged) 10jenkins-bot: Update cxserver to 2025-02-20-032928-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120709 (https://phabricator.wikimedia.org/T386677) (owner: 10KartikMistry) [05:14:08] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:14:33] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:31:18] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:31:47] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:33:25] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:33:59] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:34:35] !log Updated cxserver to 2025-02-20-032928-production (T386677, T386464) [05:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:39] T386677: Automatic translation failed error when translating from de -> en using CX - https://phabricator.wikimedia.org/T386677 [05:34:40] T386464: Post-creation work for sylwiki - https://phabricator.wikimedia.org/T386464 [05:49:18] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 206582224 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:49:52] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1257/IPv6: Connect - Tele2, AS1257/IPv4: Connect - Tele2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:50:18] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 110760 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:08:22] (03PS1) 10Stevemunene: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) [06:09:57] (03CR) 10CI reject: [V:04-1] Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [06:12:58] (03PS2) 10Stevemunene: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) [06:16:26] PROBLEM - Exim SMTP on lists1004 is CRITICAL: connect to address 208.80.154.81 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Exim [06:19:32] RECOVERY - Exim SMTP on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Exim [06:23:28] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:30:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:37:28] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 185.15.59.129, interfaces up: 67, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:52:12] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T0700). [07:05:19] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1036.eqiad.wmnet with reason: remove from cluster for reimage [07:05:31] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10566767 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9ff89e50-cdd1-449a-a676-876c36729c2f) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [07:08:19] (03CR) 10Vgutierrez: [C:03+1] haproxykafka: limit memory usage to 5% of total physical memory [puppet] - 10https://gerrit.wikimedia.org/r/1120922 (https://phabricator.wikimedia.org/T386747) (owner: 10Fabfur) [07:08:31] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1036 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1120934 (owner: 10Muehlenhoff) [07:10:22] (03CR) 10Vgutierrez: [C:03+2] aptrepo,haproxy: Allow installing HAProxy 1.3 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1120926 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [07:12:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 185.15.59.129, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:18:06] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1036.eqiad.wmnet [07:23:35] (03PS5) 10Giuseppe Lavagetto: mediawiki: introduce feature flags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 [07:23:35] (03PS4) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [07:23:35] (03PS3) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [07:23:36] (03PS4) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [07:23:52] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 69, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:26:04] (03CR) 10CI reject: [V:04-1] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [07:27:08] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1025.eqiad.wmnet to cluster eqiad and group A [07:29:45] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1025.eqiad.wmnet to cluster eqiad and group A [07:42:58] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904 (10Ben.buchenau) 03NEW [07:48:51] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ganeti1036.eqiad.wmnet [07:57:26] (03PS3) 10Isabelle Hurbain-Palatin: Turn on Parsoid Read Views for 27 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra) [07:58:11] (03CR) 10CI reject: [V:04-1] Turn on Parsoid Read Views for 27 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra) [07:59:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1036.eqiad.wmnet with OS bookworm [07:59:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10566804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1036.eqiad.wmnet with OS bookworm [07:59:41] (03PS4) 10Isabelle Hurbain-Palatin: Turn on Parsoid Read Views for 27 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra) [08:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:07:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra) [08:17:48] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:18:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10566806 (10MoritzMuehlenhoff) [08:20:42] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1036.eqiad.wmnet with reason: host reimage [08:20:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti1036.eqiad.wmnet with reason: host reimage [08:21:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet [08:22:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10566807 (10ops-monitoring-bot) Draining ganeti1026.eqiad.wmnet of running VMs [08:23:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet [08:25:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1026.eqiad.wmnet [08:26:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10566810 (10ops-monitoring-bot) Draining ganeti1026.eqiad.wmnet of running VMs [08:28:25] !log installing ruby2.7 security updates [08:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:37] (03PS1) 10Elukey: services: bump allocated memory for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121309 (https://phabricator.wikimedia.org/T386648) [08:33:35] (03CR) 10Fabfur: [C:03+2] haproxykafka: limit memory usage to 5% of total physical memory [puppet] - 10https://gerrit.wikimedia.org/r/1120922 (https://phabricator.wikimedia.org/T386747) (owner: 10Fabfur) [08:34:48] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:36:29] (03CR) 10Elukey: [C:03+2] services: bump allocated memory for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121309 (https://phabricator.wikimedia.org/T386648) (owner: 10Elukey) [08:37:28] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [08:37:38] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [08:37:54] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [08:38:52] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [08:38:59] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [08:39:31] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [08:42:23] (03CR) 10Jgiannelos: Turn on Parsoid Read Views for 27 wiktionaries (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra) [08:42:32] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=wikikube-worker1002*.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [08:42:57] !log uploaded haproxy 3.1.3 to thirdparty/haproxy31 - T386796 [08:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:00] T386796: Evaluate HAProxy 3.1 - https://phabricator.wikimedia.org/T386796 [08:44:14] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker1002.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [08:46:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1036.eqiad.wmnet with OS bookworm [08:46:40] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10566861 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1036.eqiad.wmnet with OS bookworm completed: - ganeti103... [08:52:13] (03PS1) 10Brouberol: opensearch:cirrus: add the opensearch- prefix to some plugins [puppet] - 10https://gerrit.wikimedia.org/r/1121312 (https://phabricator.wikimedia.org/T380752) [08:52:41] (03PS2) 10Brouberol: opensearch:cirrus: add the opensearch- prefix to some plugins [puppet] - 10https://gerrit.wikimedia.org/r/1121312 (https://phabricator.wikimedia.org/T380752) [08:53:35] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4959/co" [puppet] - 10https://gerrit.wikimedia.org/r/1121312 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [09:00:05] dancy and andre: Your horoscope predicts another MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T0900). [09:01:32] (03PS1) 10Elukey: services: update cpu resources for kartotherian's mesh/statsd containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121315 (https://phabricator.wikimedia.org/T386648) [09:04:48] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:07:41] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10566894 (10WMDE-leszek) I confirm @Ben.buchenau 's affiliation with WMDE, and approve the request on WMDE's end. While you're at it, mind adding Ben's account to the `wm... [09:09:32] (03PS1) 10Arturo Borrero Gonzalez: prometheus: node_kernel_messages: ensure /etc/prometheus exists [puppet] - 10https://gerrit.wikimedia.org/r/1121316 (https://phabricator.wikimedia.org/T386850) [09:12:19] (03PS1) 10DCausse: Fix typo in opensearch-analysis-stconvert [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1121317 [09:13:17] (03CR) 10DCausse: [C:03+1] opensearch:cirrus: add the opensearch- prefix to some plugins [puppet] - 10https://gerrit.wikimedia.org/r/1121312 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [09:14:08] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121316 (https://phabricator.wikimedia.org/T386850) (owner: 10Arturo Borrero Gonzalez) [09:19:02] (03CR) 10Majavah: "most of the prometheus-*-exporter packages do provision this dir, I would maybe depend on `Package['prometheus-node-exporter']` instead as" [puppet] - 10https://gerrit.wikimedia.org/r/1121316 (https://phabricator.wikimedia.org/T386850) (owner: 10Arturo Borrero Gonzalez) [09:20:59] (03CR) 10Arturo Borrero Gonzalez: "I don't think that would be deterministic enough :-(" [puppet] - 10https://gerrit.wikimedia.org/r/1121316 (https://phabricator.wikimedia.org/T386850) (owner: 10Arturo Borrero Gonzalez) [09:23:48] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:24:20] (03CR) 10Filippo Giunchedi: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/1120923 (owner: 10Filippo Giunchedi) [09:26:43] (03PS1) 10Urbanecm: beta: Do not undeclare wmgGEActiveExperiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121318 (https://phabricator.wikimedia.org/T386846) [09:27:38] (03CR) 10Urbanecm: [V:03+1] "Expected new variables show in https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig/3571/console." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121318 (https://phabricator.wikimedia.org/T386846) (owner: 10Urbanecm) [09:27:41] (03CR) 10Brouberol: [V:03+1 C:03+2] opensearch:cirrus: add the opensearch- prefix to some plugins [puppet] - 10https://gerrit.wikimedia.org/r/1121312 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [09:28:39] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, and 2 others: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10566946 (10elukey) @BCornwall the easiest way is probably to use test-cookbook on a cumin host, using a depooled magru cp node as target. Once we are sure that the settin... [09:29:45] (03CR) 10Elukey: "Hi! Added a comment to the task. I'd prefer that we tested this via test-cookbook on a single magru cp node, to verify the settings applie" [cookbooks] - 10https://gerrit.wikimedia.org/r/1121086 (https://phabricator.wikimedia.org/T373993) (owner: 10BCornwall) [09:29:57] (03CR) 10Filippo Giunchedi: [C:03+2] o11y: promote thanos compact alerts to critical [alerts] - 10https://gerrit.wikimedia.org/r/1120923 (owner: 10Filippo Giunchedi) [09:30:03] (03CR) 10Brouberol: [C:03+1] Fix typo in opensearch-analysis-stconvert [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1121317 (owner: 10DCausse) [09:33:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1036.eqiad.wmnet [09:33:28] (03CR) 10Michael Große: [C:03+1] "Thinks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121318 (https://phabricator.wikimedia.org/T386846) (owner: 10Urbanecm) [09:33:54] (03CR) 10Michael Große: [C:03+1] "*Thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121318 (https://phabricator.wikimedia.org/T386846) (owner: 10Urbanecm) [09:34:54] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10566953 (10fgiunchedi) >>! In T384731#10565308, @cmooney wrote: >>>! In T384731#10563685, @ayounsi wrote: >> Is it... [09:36:00] jouncebot: nowandnext [09:36:00] For the next 1 hour(s) and 23 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T0900) [09:36:00] In 1 hour(s) and 23 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1100) [09:40:05] (03PS4) 10Vgutierrez: hiera,swift: Enable IPIP on ms-fe@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) [09:40:05] (03PS3) 10Vgutierrez: hiera: Enable IPIP on ms-fe@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) [09:40:29] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [09:40:56] (03PS4) 10Vgutierrez: hiera: Enable IPIP on ms-fe@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) [09:41:09] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [09:43:16] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:43:24] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:43:54] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:44:46] (03CR) 10MVernon: [C:03+1] hiera,swift: Enable IPIP on ms-fe@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [09:45:18] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10566958 (10cmooney) Thanks for the update @fgiunchedi > >! In T384731#10566953, @fgiunchedi wrote: >> And what ha... [09:47:42] (03CR) 10Vgutierrez: [C:03+2] hiera,swift: Enable IPIP on ms-fe@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [09:48:22] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti1036.eqiad.wmnet [09:49:09] (03PS1) 10Filippo Giunchedi: icinga: temp remove check for virt.cloudgw.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1121319 [09:50:13] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: temp remove check for virt.cloudgw.eqiad1.wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1121319 (owner: 10Filippo Giunchedi) [09:51:24] !log enabling IPIP encapsulation for swift-fe@codfw - T385564 [09:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:28] T385564: migrate swift/swift-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T385564 [09:55:40] (03CR) 10Brouberol: [C:03+1] Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [09:59:34] !log aborrero@cumin1002 START - Cookbook sre.dns.netbox [10:07:06] (03PS1) 10Muehlenhoff: Switch ganeti1026 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1121320 [10:08:20] (03CR) 10Elukey: [C:03+2] "Given how easy this is I'll proceed, please tell me if anything doesn't look ok and I'll amend :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121315 (https://phabricator.wikimedia.org/T386648) (owner: 10Elukey) [10:10:17] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [10:10:28] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [10:10:34] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [10:11:02] (03CR) 10Filippo Giunchedi: "Nice, thank you! since this is essentially the same as DiskSpace in team-sre/resources.yaml (modulo "runbook" link) what we could also do " [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [10:11:11] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [10:11:42] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1121316 (https://phabricator.wikimedia.org/T386850) (owner: 10Arturo Borrero Gonzalez) [10:11:43] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [10:12:15] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [10:13:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:13:35] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] prometheus: node_kernel_messages: ensure /etc/prometheus exists [puppet] - 10https://gerrit.wikimedia.org/r/1121316 (https://phabricator.wikimedia.org/T386850) (owner: 10Arturo Borrero Gonzalez) [10:13:39] (03CR) 10Urbanecm: [V:03+1 C:03+2] "beta only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121318 (https://phabricator.wikimedia.org/T386846) (owner: 10Urbanecm) [10:14:23] (03Merged) 10jenkins-bot: beta: Do not undeclare wmgGEActiveExperiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121318 (https://phabricator.wikimedia.org/T386846) (owner: 10Urbanecm) [10:14:51] !log aborrero@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw updates - aborrero@cumin1002" [10:14:56] !log aborrero@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw updates - aborrero@cumin1002" [10:14:57] !log aborrero@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:16:06] !log restarting pybal on lvs2014 - T385564 [10:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:10] T385564: migrate swift/swift-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T385564 [10:16:40] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker1003.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [10:17:29] (03PS2) 10Urbanecm: [Growth] enwiki: Release Add Link to 15% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120925 (https://phabricator.wikimedia.org/T386029) [10:17:32] (03CR) 10Urbanecm: [C:03+2] [Growth] enwiki: Release Add Link to 15% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120925 (https://phabricator.wikimedia.org/T386029) (owner: 10Urbanecm) [10:18:17] (03Merged) 10jenkins-bot: [Growth] enwiki: Release Add Link to 15% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120925 (https://phabricator.wikimedia.org/T386029) (owner: 10Urbanecm) [10:18:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1026.eqiad.wmnet [10:18:58] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1120925|[Growth] enwiki: Release Add Link to 15% of newcomers (T386029)]] [10:19:02] T386029: Add a link (Structured task): Increase rollout on English Wikipedia to 15% - https://phabricator.wikimedia.org/T386029 [10:20:51] (03CR) 10Isabelle Hurbain-Palatin: Turn on Parsoid Read Views for 27 wiktionaries (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra) [10:22:25] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1120925|[Growth] enwiki: Release Add Link to 15% of newcomers (T386029)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:22:29] !log urbanecm@deploy2002 urbanecm: Continuing with sync [10:22:58] !log aborrero@cumin1002 START - Cookbook sre.dns.wipe-cache virt.cloudgw.eqiad1.wikimediacloud.org on all recursors [10:23:02] !log aborrero@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) virt.cloudgw.eqiad1.wikimediacloud.org on all recursors [10:23:24] (03PS1) 10FNegri: prometheus::node_kernel_messages: ignore some false positives [puppet] - 10https://gerrit.wikimedia.org/r/1121321 (https://phabricator.wikimedia.org/T386850) [10:24:11] !log restarting pybal on lvs2013, effectively enabling IPIP encapsulation for swift-fe@codfw - T385564 [10:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:15] T385564: migrate swift/swift-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T385564 [10:25:02] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:25:51] (03PS1) 10Filippo Giunchedi: Revert "icinga: temp remove check for virt.cloudgw.eqiad1.wikimediacloud.org" [puppet] - 10https://gerrit.wikimedia.org/r/1121322 [10:26:39] (03CR) 10Filippo Giunchedi: [C:03+2] Revert "icinga: temp remove check for virt.cloudgw.eqiad1.wikimediacloud.org" [puppet] - 10https://gerrit.wikimedia.org/r/1121322 (owner: 10Filippo Giunchedi) [10:27:10] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow1002.eqiad.wmnet with reason: disabling gnmic in systemd [10:28:02] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:32:41] (03PS1) 10Filippo Giunchedi: hiera: restore thanos retention settings [puppet] - 10https://gerrit.wikimedia.org/r/1121324 (https://phabricator.wikimedia.org/T357747) [10:34:15] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1026.eqiad.wmnet with reason: remove from cluster for reimage [10:34:26] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10567066 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8efe0251-40ee-433b-a080-3bef582e4f79) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [10:34:33] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1120925|[Growth] enwiki: Release Add Link to 15% of newcomers (T386029)]] [10:34:37] T386029: Add a link (Structured task): Increase rollout on English Wikipedia to 15% - https://phabricator.wikimedia.org/T386029 [10:35:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 7.143% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:36:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext/next (k8s) 1.523s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:36:19] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1026 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1121320 (owner: 10Muehlenhoff) [10:37:42] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1120925|[Growth] enwiki: Release Add Link to 15% of newcomers (T386029)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:37:47] !log urbanecm@deploy2002 urbanecm: Continuing with sync [10:38:01] (03CR) 10MVernon: [C:03+1] hiera: Enable IPIP on ms-fe@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [10:40:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 23.21% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:41:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-api-ext/next (k8s) 1.523s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:41:43] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1026.eqiad.wmnet [10:43:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1036.eqiad.wmnet to cluster eqiad and group B [10:44:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1036.eqiad.wmnet to cluster eqiad and group B [10:44:24] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120925|[Growth] enwiki: Release Add Link to 15% of newcomers (T386029)]] (duration: 09m 50s) [10:44:28] T386029: Add a link (Structured task): Increase rollout on English Wikipedia to 15% - https://phabricator.wikimedia.org/T386029 [10:46:03] (03PS5) 10Vgutierrez: hiera: Enable IPIP on ms-fe@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) [10:48:37] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable IPIP on ms-fe@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [10:54:40] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on netflow1002.eqiad.wmnet with reason: keeping gnmic running in debug mode to observe performance change [10:57:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10567098 (10MoritzMuehlenhoff) [10:58:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [10:59:23] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1121321 (https://phabricator.wikimedia.org/T386850) (owner: 10FNegri) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1100) [11:00:18] (03PS1) 10Gergő Tisza: Restore "Add configuration options and global preference for the SUL3 rolllout" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121328 (https://phabricator.wikimedia.org/T386836) [11:00:37] (03PS1) 10Gergő Tisza: Restore "Add configuration options and global preference for the SUL3 rolllout" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121329 (https://phabricator.wikimedia.org/T386836) [11:01:22] (03PS1) 10Gergő Tisza: SharedDomainUtils: Avoid early instantiation of NamespaceInfo [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121330 (https://phabricator.wikimedia.org/T386836) [11:02:09] (03PS1) 10Gergő Tisza: SharedDomainUtils: Avoid early instantiation of NamespaceInfo [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121332 (https://phabricator.wikimedia.org/T386836) [11:02:30] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=wikikube-worker1003.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [11:02:36] (03PS1) 10Gergő Tisza: Make sure isSul3Enabled() is a boolean [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121333 (https://phabricator.wikimedia.org/T384549) [11:03:09] (03PS1) 10Gergő Tisza: Make sure isSul3Enabled() is a boolean [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121334 (https://phabricator.wikimedia.org/T384549) [11:03:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121328 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza) [11:03:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121329 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza) [11:04:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121330 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza) [11:04:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121332 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza) [11:05:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121333 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [11:05:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121334 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [11:07:35] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:07:46] (03PS1) 10Stevemunene: Create dse-k8s control panel partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/1121335 (https://phabricator.wikimedia.org/T386900) [11:07:47] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:08:06] !log restarting pybal on lvs1020 - T385564 [11:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:10] T385564: migrate swift/swift-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T385564 [11:08:31] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:08:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet [11:08:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1026.eqiad.wmnet [11:09:59] !log restarting pybal on lvs1019, effectively enabling IPIP encapsulation for swift-fe@eqiad - T385564 [11:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:02] tgr|away: for me backporting the MediaWikiServices change would be okay [11:11:12] assuming we can reach agreement to merge it on master [11:14:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1124:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1124 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:15:18] (03PS1) 10Vgutierrez: service: Switch swift and swift-https to maglev [puppet] - 10https://gerrit.wikimedia.org/r/1121336 (https://phabricator.wikimedia.org/T385564) [11:15:47] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121336 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [11:20:33] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121336 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [11:25:19] (03CR) 10MVernon: [C:03+1] "I don't claim to understand the significance of moving to maglev from wrr, but this change looks to do what it says it does." [puppet] - 10https://gerrit.wikimedia.org/r/1121336 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [11:27:04] (03PS1) 10Sergio Gimeno: LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551) [11:27:20] (03PS1) 10Sergio Gimeno: LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121338 (https://phabricator.wikimedia.org/T369551) [11:33:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:35:17] (03CR) 10Hnowlan: "lgtm with a but - we currently override php.servergroup at helmfile level for every mw-* deployment. Will that break these behaviours?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [11:35:19] (03PS1) 10Vgutierrez: liberica: USE CAP_NET_RAW instead of CAP_NET_ADMIN for healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1121339 [11:36:33] (03CR) 10CI reject: [V:04-1] LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [11:39:23] (03CR) 10Vgutierrez: [C:03+2] service: Switch swift and swift-https to maglev [puppet] - 10https://gerrit.wikimedia.org/r/1121336 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [11:40:26] (03CR) 10Jgiannelos: [C:03+1] Turn on Parsoid Read Views for 27 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra) [11:41:02] !log restarting pybal on lvs2014 - T385564 [11:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:06] T385564: migrate swift/swift-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T385564 [11:42:21] FIRING: ProbeDown: Service shellbox-syntaxhighlight:4014 has failed probes (http_shellbox-syntaxhighlight_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-syntaxhighlight:4014 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:42:35] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:43:54] !log restarting pybal on lvs2013, effectively switching swift-fe@codfw to maglev - T385564 [11:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:33] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:46:03] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:46:49] hrm, what just happened to shellbox-syntaxhighlight? [11:47:21] RESOLVED: ProbeDown: Service shellbox-syntaxhighlight:4014 has failed probes (http_shellbox-syntaxhighlight_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-syntaxhighlight:4014 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:49] !log restarting pybal on lvs1020 - T385564 [11:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:53] T385564: migrate swift/swift-https LB VIPs to IPIP encapsulation - https://phabricator.wikimedia.org/T385564 [11:48:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: OpenSent - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:48:43] uh? [11:49:03] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2004.codfw.wmnet, kubestage2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:49:32] I'm assuming that's bad timing with the pybal restart on lvs1020.. BGP session is back [11:51:12] !log restarting pybal on lvs1019, effectively switching swift-fe@eqiad to maglev - T385564 [11:51:14] (03PS1) 10Andrew Bogott: rename validatelabsfqdn.py to validatecloudvpsfqdn.py [puppet] - 10https://gerrit.wikimedia.org/r/1121342 [11:51:14] (03PS1) 10Andrew Bogott: realm.pp: remove use of $labsproject [puppet] - 10https://gerrit.wikimedia.org/r/1121343 [11:51:14] (03PS1) 10Andrew Bogott: validatecloudvpsfqdn.py: Support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121344 (https://phabricator.wikimedia.org/T379030) [11:51:15] (03PS1) 10Andrew Bogott: wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030) [11:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:17] (03PS1) 10Andrew Bogott: Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030) [11:51:18] (03PS1) 10Andrew Bogott: wmcs puppet-enc: use project id for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1121347 [11:52:15] (03CR) 10CI reject: [V:04-1] wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [11:54:21] (03CR) 10CI reject: [V:04-1] Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [11:54:55] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:54:59] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:00:05] urbanecm, sergi0, and Cyndywikime: Time to do the Community Configuration migration deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1200). [12:00:11] RECOVERY - Host ms-be2075 is UP: PING WARNING - Packet loss = 77%, RTA = 33.31 ms [12:00:45] (03CR) 10Hnowlan: [C:03+1] Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 (owner: 10Giuseppe Lavagetto) [12:00:46] Hi [12:03:36] (03PS1) 10FNegri: prometheus::node_kernel_messages: add new line to ignore list [puppet] - 10https://gerrit.wikimedia.org/r/1121348 (https://phabricator.wikimedia.org/T386850) [12:04:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1124:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1124 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:04:41] (03CR) 10Sergio Gimeno: [C:03+2] LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [12:04:51] (03CR) 10Sergio Gimeno: [C:03+2] LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121338 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [12:06:35] PROBLEM - Host ms-be2075 is DOWN: PING CRITICAL - Packet loss = 100% [12:09:19] (03PS1) 10Hnowlan: trafficserver: use testwiki PCS without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1121350 (https://phabricator.wikimedia.org/T385719) [12:14:30] (03CR) 10Jgiannelos: [C:03+1] trafficserver: use testwiki PCS without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1121350 (https://phabricator.wikimedia.org/T385719) (owner: 10Hnowlan) [12:15:45] (03CR) 10CI reject: [V:04-1] LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [12:20:48] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Máté Szabó - https://phabricator.wikimedia.org/T386918 (10mszabo) 03NEW [12:21:04] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for Máté Szabó - https://phabricator.wikimedia.org/T386918#10567349 (10mszabo) [12:21:40] (03CR) 10Fabfur: [C:03+1] "Absolutely +1" [puppet] - 10https://gerrit.wikimedia.org/r/1121339 (owner: 10Vgutierrez) [12:21:42] (03CR) 10Sergio Gimeno: [C:03+2] "..." [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [12:23:17] (03Merged) 10jenkins-bot: LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121338 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [12:30:56] (03CR) 10CI reject: [V:04-1] LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [12:34:03] 06SRE, 06Infrastructure-Foundations, 10netops: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10567389 (10cmooney) I ran gnmic in debug mode on netflow1002 but nothing is jumping out at me as a problem, at least on a basic review of the logs. One thing I do notice, and... [12:39:20] (03Abandoned) 10Sergio Gimeno: LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. [extensions/GrowthExperiments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121337 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [12:39:27] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to analytics-privatedata-users for Máté Szabó - https://phabricator.wikimedia.org/T386918#10567395 (10kostajh) Approving as @mszabo's interim manager. [12:40:52] (03PS1) 10Sergio Gimeno: Revert "LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds." [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121358 [12:40:59] (03CR) 10Sergio Gimeno: [C:03+2] Revert "LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds." [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121358 (owner: 10Sergio Gimeno) [12:45:34] (03CR) 10Hnowlan: [C:04-1] "nit: I realise this isn't a real functional chart per se, but are there minimal fixtures that could go here?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto) [12:45:45] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1121339 (owner: 10Vgutierrez) [12:45:54] (03CR) 10Volans: "As agreed in the call, I did a pass to the CRs as they are now." [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 (owner: 10Federico Ceratto) [12:46:00] (03CR) 10Volans: "As agreed in the call, I did a pass to the CRs as they are now." [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (owner: 10Federico Ceratto) [12:48:02] (03CR) 10Isabelle Hurbain-Palatin: Turn on Parsoid Read Views for 27 wiktionaries (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra) [12:49:59] (03Merged) 10jenkins-bot: Revert "LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds." [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121358 (owner: 10Sergio Gimeno) [12:52:50] !log sgimeno@deploy2002 Started scap sync-world: Backport for [[gerrit:1121338|LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. (T369551)]], [[gerrit:1121358|Revert "LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds."]] [12:52:54] T369551: Use a constant to mark minimum for getting started notification - https://phabricator.wikimedia.org/T369551 [12:53:58] 06SRE, 10SRE-Access-Requests: Requesting access to Dashboards in Superset for harroyo-wmf - https://phabricator.wikimedia.org/T386922 (10hector.arroyo) 03NEW [12:54:44] 06SRE, 10SRE-Access-Requests: Requesting access to Dashboards in Superset for harroyo-wmf - https://phabricator.wikimedia.org/T386922#10567459 (10kostajh) Approving as @hector.arroyo's interim manager [12:55:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:56] !log sgimeno@deploy2002 sgimeno: Backport for [[gerrit:1121338|LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. (T369551)]], [[gerrit:1121358|Revert "LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds."]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:56:14] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset for harroyo-wmf - https://phabricator.wikimedia.org/T386922#10567463 (10hector.arroyo) [12:56:52] (03CR) 10Sergio Gimeno: [C:03+2] LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. (031 comment) [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121338 (https://phabricator.wikimedia.org/T369551) (owner: 10Sergio Gimeno) [12:56:58] !log sgimeno@deploy2002 sgimeno: Continuing with sync [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1300) [13:01:27] I'm still deploying the changes from the CC window, not much left [13:03:33] !log sgimeno@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121338|LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds. (T369551)]], [[gerrit:1121358|Revert "LevelingUp: Schema migration for GELevelingUpKeepGoingNotificationThresholds."]] (duration: 10m 43s) [13:03:37] T369551: Use a constant to mark minimum for getting started notification - https://phabricator.wikimedia.org/T369551 [13:03:51] Done [13:05:28] (03PS1) 10Ilias Sarantopoulos: ml-services: increase replicas in ref quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121362 [13:08:02] (03CR) 10AikoChou: [C:03+1] ml-services: increase replicas in ref quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121362 (owner: 10Ilias Sarantopoulos) [13:08:47] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: increase replicas in ref quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121362 (owner: 10Ilias Sarantopoulos) [13:08:47] 06SRE, 06Infrastructure-Foundations, 10netops: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10567530 (10cmooney) Also fwiw I grabbed the same stats for 24 hours from both prometheus servers, and compared the total stats. In total there are 115 gaps in the data, 68 of... [13:09:53] (03Merged) 10jenkins-bot: ml-services: increase replicas in ref quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121362 (owner: 10Ilias Sarantopoulos) [13:10:23] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [13:10:35] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [13:15:24] (03CR) 10Stevemunene: "I think, we should go as is incase we need to adjust the min values later on" [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [13:20:28] (03CR) 10Ladsgroup: Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121098 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [13:20:46] (03PS1) 10Elukey: services: double the capacity for Kartotherian in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121363 (https://phabricator.wikimedia.org/T386926) [13:23:14] (03PS2) 10Ladsgroup: Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121098 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [13:23:38] (03CR) 10Ladsgroup: Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121098 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [13:23:49] jouncebot: nowandnext [13:23:49] For the next 0 hour(s) and 36 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1300) [13:23:49] In 0 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1400) [13:24:00] (03CR) 10Ladsgroup: [C:03+2] Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121098 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [13:24:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121098 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [13:24:45] (03Merged) 10jenkins-bot: Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121098 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [13:25:12] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1121098|Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" (T384619)]] [13:25:16] T384619: Update skins to support different logos at different resolutions - https://phabricator.wikimedia.org/T384619 [13:28:11] !log ladsgroup@deploy2002 ladsgroup, jdlrobson: Backport for [[gerrit:1121098|Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" (T384619)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:28:28] (03CR) 10Elukey: [C:03+1] Bump versions of Java 11/17 production images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120544 (owner: 10Muehlenhoff) [13:30:32] !log ladsgroup@deploy2002 ladsgroup, jdlrobson: Continuing with sync [13:33:08] (03PS2) 10Andrew Bogott: wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030) [13:33:08] (03PS2) 10Andrew Bogott: Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030) [13:33:08] (03PS2) 10Andrew Bogott: wmcs puppet-enc: use project id for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1121347 [13:33:43] (03CR) 10CI reject: [V:04-1] wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [13:36:10] (03PS1) 10David Caro: toolforge: add jobs-emailer stats gathering [puppet] - 10https://gerrit.wikimedia.org/r/1121364 (https://phabricator.wikimedia.org/T320284) [13:37:02] (03CR) 10Brouberol: [C:03+1] Create dse-k8s control panel partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/1121335 (https://phabricator.wikimedia.org/T386900) (owner: 10Stevemunene) [13:37:07] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121098|Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" (T384619)]] (duration: 11m 54s) [13:37:10] T384619: Update skins to support different logos at different resolutions - https://phabricator.wikimedia.org/T384619 [13:40:46] (03PS3) 10Andrew Bogott: wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030) [13:40:46] (03PS3) 10Andrew Bogott: Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030) [13:40:46] (03PS3) 10Andrew Bogott: wmcs puppet-enc: use project id for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1121347 [13:44:26] (03CR) 10David Caro: [V:03+1] "Manually tested in tools:" [puppet] - 10https://gerrit.wikimedia.org/r/1121364 (https://phabricator.wikimedia.org/T320284) (owner: 10David Caro) [13:50:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:51:33] (03CR) 10Hnowlan: [C:03+1] services: double the capacity for Kartotherian in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121363 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [13:55:27] (03PS1) 10Jforrester: Re-update function-schemata sub-module to HEAD (39b22ad) [extensions/WikiLambda] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121366 [13:55:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:56:27] (03PS1) 10Brouberol: airflow-research: allow task pods to reach out to gitlab.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121367 (https://phabricator.wikimedia.org/T386933) [13:57:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1161:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1161 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:59:21] (03CR) 10Elukey: [C:03+2] services: double the capacity for Kartotherian in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121363 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1400). [14:00:05] Daimona, ihurbain, and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:16] indeed! [14:00:23] o/ [14:00:24] * TheresNoTime is not able to deploy today! [14:00:27] o/ [14:00:29] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [14:00:38] I’m in a meeting, so I probably can’t deploy [14:00:52] so IDEALLY i'd like to deploy my own, BUT if i do that it would be my very first own deploy, so i'd need someone to hold my hand and tell me to breathe :D [14:01:04] (i *think* i have the proper rights for it, and i have read doc this morning) [14:01:04] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [14:01:19] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [14:01:47] it looks like you’re in the deployment group, yes ^^ [14:01:49] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [14:02:11] (03CR) 10Ssingh: [C:03+1] trafficserver: use testwiki PCS without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1121350 (https://phabricator.wikimedia.org/T385719) (owner: 10Hnowlan) [14:02:28] ihurbain: scap backport is very easy to use, but we can help in case of any trouble [14:02:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1161:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1161 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:02:46] do i feel bold enough. [14:03:12] well Daimona is before you in the list so you could build up some courage first :-D [14:03:37] (03CR) 10Ssingh: "For posterity, we made a typo in the commit message: it is 3.1 that we imported and not 1.3." [puppet] - 10https://gerrit.wikimedia.org/r/1120926 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [14:03:53] or you can deploy that patch as well [14:04:00] good point! [14:04:12] and by the time you get to your own patch, you'll already be an experienced deployer [14:04:35] okay. folks, hold my hand, i'm trying to do the deployment window. [14:04:42] sweet! [14:04:45] (aaaa!) [14:05:07] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:05:07] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:05:16] so: I can do the deploys today! (following documentation.) [14:05:39] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:07:08] Daimona: starting with yours. [14:07:25] Yay! Good luck! [14:07:31] \o/ [14:07:36] * TheresNoTime is half-around & watching, please ping if there's anything! [14:08:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121080 (https://phabricator.wikimedia.org/T383800) (owner: 10Daimona Eaytoy) [14:09:02] (03Merged) 10jenkins-bot: Enable $wgCampaignEventsEnableEventInvitation on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121080 (https://phabricator.wikimedia.org/T383800) (owner: 10Daimona Eaytoy) [14:09:33] !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1121080|Enable $wgCampaignEventsEnableEventInvitation on most wikis (T383800)]] [14:09:36] T383800: Enable invitation lists by default (except Meta, ZH Wikipedia, and ES Wikipedia) - https://phabricator.wikimedia.org/T383800 [14:09:43] (03CR) 10Fabian Kaelin: [C:03+1] airflow-research: allow task pods to reach out to gitlab.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121367 (https://phabricator.wikimedia.org/T386933) (owner: 10Brouberol) [14:12:07] (03CR) 10Ssingh: "Sorry for not following up on this -- I missed this in the review stack." [cookbooks] - 10https://gerrit.wikimedia.org/r/1121086 (https://phabricator.wikimedia.org/T373993) (owner: 10BCornwall) [14:12:32] !log ihurbain@deploy2002 daimona, ihurbain: Backport for [[gerrit:1121080|Enable $wgCampaignEventsEnableEventInvitation on most wikis (T383800)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:12:46] Daimona: if you have stuff to check on mwdebug, now's the time [14:12:59] (03CR) 10Vgutierrez: [C:03+2] liberica: USE CAP_NET_RAW instead of CAP_NET_ADMIN for healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1121339 (owner: 10Vgutierrez) [14:13:11] Yup, doing [14:14:39] Looking good [14:14:52] then let's gooo [14:14:57] !log ihurbain@deploy2002 daimona, ihurbain: Continuing with sync [14:15:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikiLambda] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121366 (owner: 10Jforrester) [14:19:39] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:20:07] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:20:07] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:20:22] (03CR) 10Brouberol: [C:03+2] airflow-research: allow task pods to reach out to gitlab.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121367 (https://phabricator.wikimedia.org/T386933) (owner: 10Brouberol) [14:21:15] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [14:21:35] !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121080|Enable $wgCampaignEventsEnableEventInvitation on most wikis (T383800)]] (duration: 12m 02s) [14:21:38] T383800: Enable invitation lists by default (except Meta, ZH Wikipedia, and ES Wikipedia) - https://phabricator.wikimedia.org/T383800 [14:21:41] there6 [14:21:51] Daimona: all good (... in theory :D ) [14:21:53] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [14:22:10] (03CR) 10Majavah: [C:03+1] toolforge: add jobs-emailer stats gathering [puppet] - 10https://gerrit.wikimedia.org/r/1121364 (https://phabricator.wikimedia.org/T320284) (owner: 10David Caro) [14:22:24] Yay, congrats on your first deployment :D [14:22:29] thank you \o/ [14:22:38] well, now i can move forward with miiiine [14:22:46] (03CR) 10MVernon: [C:03+1] cassandra: setup 'dev' target for Cassandra 4.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/1121102 (https://phabricator.wikimedia.org/T385819) (owner: 10Eevans) [14:22:48] (03CR) 10Bking: [C:03+2] Fix typo in opensearch-analysis-stconvert [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1121317 (owner: 10DCausse) [14:22:48] 06SRE, 10Maps, 06Traffic, 13Patch-For-Review: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10567788 (10ssingh) @MSantos: Any update on this? Thanks! [14:22:52] (03CR) 10Bking: [V:03+2 C:03+2] Fix typo in opensearch-analysis-stconvert [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1121317 (owner: 10DCausse) [14:23:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:23:37] :looks suspiscious: [14:23:44] (03Merged) 10jenkins-bot: Turn on Parsoid Read Views for 27 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra) [14:24:10] (03CR) 10David Caro: [V:03+1 C:03+2] toolforge: add jobs-emailer stats gathering [puppet] - 10https://gerrit.wikimedia.org/r/1121364 (https://phabricator.wikimedia.org/T320284) (owner: 10David Caro) [14:24:16] !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1120679|Turn on Parsoid Read Views for 27 wiktionaries (T386762)]] [14:24:19] T386762: Parsoid Read Views to Wiktionary deploy ~2025-02-20 - https://phabricator.wikimedia.org/T386762 [14:27:11] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:27:19] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:27:22] !log ihurbain@deploy2002 arlolra, ihurbain: Backport for [[gerrit:1120679|Turn on Parsoid Read Views for 27 wiktionaries (T386762)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:28:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 21.43% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:29:21] we are parsoided on canary, continuing. [14:29:24] !log ihurbain@deploy2002 arlolra, ihurbain: Continuing with sync [14:30:07] (congrats on your first deploy by the way!) [14:30:13] thank you :D [14:32:41] (03PS3) 10Andrew Bogott: vendordata.txt: include rudimentary clouds.yaml in initial VM [puppet] - 10https://gerrit.wikimedia.org/r/1120683 (https://phabricator.wikimedia.org/T379030) [14:32:41] (03PS8) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) [14:32:42] (03PS1) 10Andrew Bogott: cloud-vps instance: populate /etc/openstack/project_id [puppet] - 10https://gerrit.wikimedia.org/r/1121369 [14:32:51] tgr|away: just to make sure i understand, you're backporting a set of 3 patches to .16 (currently group 2) and .17 (currently group 0 and 1), and all these (the 6 patches for both branches) can go in a single scap? [14:33:09] yeah, they all need to go together [14:33:17] ack :) [14:33:18] (03CR) 10CI reject: [V:04-1] cloud-vps instance: populate /etc/openstack/project_id [puppet] - 10https://gerrit.wikimedia.org/r/1121369 (owner: 10Andrew Bogott) [14:34:34] (03PS2) 10Andrew Bogott: cloud-vps instance: populate /etc/openstack/project_id [puppet] - 10https://gerrit.wikimedia.org/r/1121369 [14:34:34] (03PS4) 10Andrew Bogott: vendordata.txt: include rudimentary clouds.yaml in initial VM [puppet] - 10https://gerrit.wikimedia.org/r/1120683 (https://phabricator.wikimedia.org/T379030) [14:34:34] (03PS9) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) [14:34:51] (03CR) 10Giuseppe Lavagetto: "It shouldn't. If it does, it means I made a mistake in this patch, and should be clear from diffs in this change's CI" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1116639 (owner: 10Giuseppe Lavagetto) [14:36:16] !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120679|Turn on Parsoid Read Views for 27 wiktionaries (T386762)]] (duration: 12m 00s) [14:36:20] T386762: Parsoid Read Views to Wiktionary deploy ~2025-02-20 - https://phabricator.wikimedia.org/T386762 [14:36:58] (03CR) 10CI reject: [V:04-1] cloud-vps instance: populate /etc/openstack/project_id [puppet] - 10https://gerrit.wikimedia.org/r/1121369 (owner: 10Andrew Bogott) [14:37:18] (03PS1) 10Ayounsi: Fix tox bug [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 [14:37:35] (03CR) 10Stevemunene: [C:03+2] Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [14:37:40] also, i do have logspam-watch running, i'm keeping an eye on it, and it doesn't look RIDICULOUS, but that's basically the only check i'm doing on that [14:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:46] (if anything else let me know) [14:37:55] tgr|away: starting your scap now [14:37:57] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=wikikube-worker1003.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [14:37:58] (03PS2) 10Ayounsi: Fix tox bug [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 [14:38:07] (03PS1) 10Jgreen: Add fundraising-analytics hostgroup and two new checks to nsca_frack.cfg.erb. [puppet] - 10https://gerrit.wikimedia.org/r/1121371 (https://phabricator.wikimedia.org/T386259) [14:38:14] (03PS3) 10Andrew Bogott: cloud-vps instance: populate /etc/openstack/project_id [puppet] - 10https://gerrit.wikimedia.org/r/1121369 (https://phabricator.wikimedia.org/T379030) [14:38:15] (03PS5) 10Andrew Bogott: vendordata.txt: include rudimentary clouds.yaml in initial VM [puppet] - 10https://gerrit.wikimedia.org/r/1120683 (https://phabricator.wikimedia.org/T379030) [14:38:15] (03PS10) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) [14:38:22] (03CR) 10Volans: "Actually a better fix is doing something like I77af7f4aab59572f2a93ffd82d78d7027b67a41f" [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi) [14:38:46] (03Merged) 10jenkins-bot: Port disk space check for hadoop worker to Alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1121131 (https://phabricator.wikimedia.org/T371080) (owner: 10Stevemunene) [14:38:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121328 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza) [14:38:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121329 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza) [14:38:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121330 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza) [14:38:53] yeah, logspam-watch or the mediawiki-errors logstash dashboard is all you need [14:38:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121332 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza) [14:38:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121333 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [14:38:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ihurbain@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121334 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [14:39:42] tgr|away: my remark was more about "i'm not sure i'd be able to catch anything that's not entirely ridiculous but also not normal" [14:39:48] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10567848 (10Jclark-ctr) [14:41:07] it's also the patch owner's responsibility to keep an eye out for errors as they test. [14:41:18] speaking of which,, I oughta pull up the dashboard [14:41:22] :D [14:41:29] (03CR) 10Hnowlan: [C:03+2] trafficserver: use testwiki PCS without restbase [puppet] - 10https://gerrit.wikimedia.org/r/1121350 (https://phabricator.wikimedia.org/T385719) (owner: 10Hnowlan) [14:42:14] there's an mwdebug-specific dashboard which is more useful for testing [14:42:36] (03CR) 10Filippo Giunchedi: [C:03+2] Add fundraising-analytics hostgroup and two new checks to nsca_frack.cfg.erb. [puppet] - 10https://gerrit.wikimedia.org/r/1121371 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen) [14:42:57] the error dashboard / logspam-watch probably won't tell you something is wrong until it hits production [14:43:12] (03CR) 10CI reject: [V:04-1] Fix tox bug [software/homer] - 10https://gerrit.wikimedia.org/r/1121370 (owner: 10Ayounsi) [14:43:18] though it's very rare that that happens, scap has a bunch of canary checks built in [14:43:21] that dashboard is mentioned specifically in the backport deployer's docs. I have both that and the prod one pulled up [14:44:02] ah yes indeed [14:44:46] in the bad old days where deploying was more of a manual process, you could break things by e.g. syncing files in the wrong order and that caused huge error spikes [14:45:18] I deployed during those bad old days, and they were bad. [14:45:26] (03Merged) 10jenkins-bot: Restore "Add configuration options and global preference for the SUL3 rolllout" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121328 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza) [14:45:44] i'm happy to live in the good new days, then. (well, as far as deployments are concerned.) [14:45:47] these days if something is non-broken enough to pass the scap canary tests, any problems it causes probably won't be obvious [14:46:32] deployers these days. with their single command deploys. back in my day we had to walk barefoot through the snow to deploy... uphill... in both directions :-P [14:46:45] good to watch the logspam just in case, but it has been years since it last helped me catch a bug [14:46:56] apergos: And we were grateful! ;-) [14:46:59] !log bking@apt1002:~/pkg$ sudo -E reprepro -C component/opensearch13 include bullseye-wikimedia $HOME/pkg/wmf-opensearch-search-plugins_1.3.20-1_amd64.changes T380752 [14:47:00] lolol [14:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:02] T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752 [14:47:08] :D [14:47:13] (03PS5) 10Giuseppe Lavagetto: Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 [14:47:14] (03PS4) 10Giuseppe Lavagetto: mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 [14:47:14] (03PS5) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [14:47:27] (03Merged) 10jenkins-bot: SharedDomainUtils: Avoid early instantiation of NamespaceInfo [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121330 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza) [14:47:29] (03Merged) 10jenkins-bot: Make sure isSul3Enabled() is a boolean [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121333 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [14:47:30] (03Merged) 10jenkins-bot: Restore "Add configuration options and global preference for the SUL3 rolllout" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121329 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza) [14:47:31] (03Merged) 10jenkins-bot: SharedDomainUtils: Avoid early instantiation of NamespaceInfo [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121332 (https://phabricator.wikimedia.org/T386836) (owner: 10Gergő Tisza) [14:47:51] (03CR) 10Giuseppe Lavagetto: "I will add fixtures later, that should help catch issues like the one we had with dependencies." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto) [14:47:58] OTOH, back when I was deploying ~10 times a day I could slip out a minor config tweak to all production in < 45 seconds. No fancy commands, no safety systems, no atomic roll-outs, no canaries, and nothing to slow things down. [14:48:05] (one left to merge on the stack of 6) [14:48:13] Nowadays the very fastest deploys take ~6 mins. [14:48:35] * James_F shakes his walking stick at the sky. [14:48:43] but it's been a long time since anyone's earned The Shirt for a deploy (at least, I think it's been a long time) [14:48:54] don't jinx meeee :D [14:49:03] taking it all back right now :-) [14:49:06] Indeed. Fingers crossed, etc. [14:49:32] 10ops-eqiad, 06Data-Persistence, 06DC-Ops: Q3:rack/setup/install backup101[34] - https://phabricator.wikimedia.org/T384977#10567903 (10Jclark-ctr) [14:50:02] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker1003.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [14:50:22] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker1004.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [14:51:33] (03Merged) 10jenkins-bot: Make sure isSul3Enabled() is a boolean [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121334 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [14:52:08] !log ihurbain@deploy2002 Started scap sync-world: Backport for [[gerrit:1121328|Restore "Add configuration options and global preference for the SUL3 rolllout" (T386836)]], [[gerrit:1121329|Restore "Add configuration options and global preference for the SUL3 rolllout" (T386836)]], [[gerrit:1121330|SharedDomainUtils: Avoid early instantiation of NamespaceInfo (T386836)]], [[gerrit:1121332|SharedDomainUtils: Avoid early in [14:52:08] stantiation of NamespaceInfo (T386836)]], [[gerrit:1121333|Make sure isSul3Enabled() is a boolean (T384549)]], [[gerrit:1121334|Make sure isSul3Enabled() is a boolean (T384549)]] [14:52:15] T386836: Wikibase CI broken with several errors - https://phabricator.wikimedia.org/T386836 [14:52:15] T384549: Create a per-user flag for enabling SUL3 - https://phabricator.wikimedia.org/T384549 [14:53:45] !log upload liberica 0.8 to apt.wm.o (bookworm-wikimedia) [14:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:20] (03CR) 10Hnowlan: [C:03+1] mediawiki-common: introduce chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117547 (owner: 10Giuseppe Lavagetto) [14:55:05] (03CR) 10Hnowlan: [C:03+1] Add the networkpolicy feature flag [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117225 (owner: 10Giuseppe Lavagetto) [14:55:07] !log ihurbain@deploy2002 tgr, ihurbain: Backport for [[gerrit:1121328|Restore "Add configuration options and global preference for the SUL3 rolllout" (T386836)]], [[gerrit:1121329|Restore "Add configuration options and global preference for the SUL3 rolllout" (T386836)]], [[gerrit:1121330|SharedDomainUtils: Avoid early instantiation of NamespaceInfo (T386836)]], [[gerrit:1121332|SharedDomainUtils: Avoid early instantiatio [14:55:07] n of NamespaceInfo (T386836)]], [[gerrit:1121333|Make sure isSul3Enabled() is a boolean (T384549)]], [[gerrit:1121334|Make sure isSul3Enabled() is a boolean (T384549)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:55:16] !log testing liberica 0.8 in lvs1013 [14:55:17] tgr|away: canary time! [14:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:24] ihurbain: looking, might take a bit [14:56:44] i actually have a few log lines popping on mwdebug logstash, i'm assuming this is "transient restart" stuff, but please confirm [14:58:13] usually it just means someone is using the WikimediaDebug extension [14:58:38] unlike the production errors dashboard, it's not filtered by severity [14:58:39] !log elukey@puppetserver1001 conftool action : set/pooled=inactive:weight=5; selector: name=wikikube-worker2001.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [14:58:47] ack [14:59:49] the errors are all session cache failures, which is outside MediaWiki so I don't think it can be caused by an MW deploy [15:00:26] a few of tehse are 503s (failed to store some session) but those are all at 14:54 so [15:00:30] I think it's ok [15:00:32] let's say that if there's "deploy around centralauth" and "stuff that talks about sessions in the logs", it feels worth checking :D [15:01:00] though it's also an error that I don't see on production at all, so not sure what's up with that [15:01:15] but the patches aren't related to session handling [15:01:42] I guess let's see if it happens more or was a one-time fluke [15:02:29] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker2001.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [15:02:29] so far these ar 14:54, 14:55, 6 of them total, that's it. [15:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:06] php was restarted at 14:54 [15:05:11] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-02-19-134350 to 2025-02-20-140756 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121380 (https://phabricator.wikimedia.org/T383448) [15:05:12] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-02-19-135838 to 2025-02-20-142923 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121381 (https://phabricator.wikimedia.org/T383448) [15:06:10] also, for the sake of communication: yes, the backport window is running over a bit [15:06:45] couple warnings, probably side effects of the tests [15:07:10] (token mismatch, couldn't find global id, one of each) [15:08:01] gah, WikimediaDebug automatically disabling itself is so annyoing [15:08:09] oh does it? woops [15:08:18] !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be2088.codfw.wmnet [15:14:15] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10568007 (10elukey) I've updated the Broadcom 3908's firmware on ms-be2088 as indicated by Supermicro, since the changelog shows some JB... [15:16:07] Expectation (readQueryRows <= 10000) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: 12437) in trx #ab165eaf69: SELECT pi_property_id,pi_info FROM `wb_property_info` [15:16:25] that's just now and the only thing possibly of interest imo [15:16:27] well, it's not really doing what I'd expect but it's not breaking anything either [15:16:45] maybe I'm just misremembering how it needs to be configured [15:17:27] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker2002.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [15:17:38] anyhow I think that's good enough to deploy - unless apergos wants to do more checks [15:17:59] did you change the config locally on one of the mwdebug instances? or...? [15:18:25] apergos: that one is known, one sec [15:18:26] I mean, if it's 0/0 and always on the target wiki, you shouldn't see any behavioural change [15:18:34] (the wb_property_info I mean) [15:18:35] "Couldn't find a global ID for user Tgr-test-c1121328" [15:18:43] I guess that would explain it [15:19:12] not related to these patches though, something seems to be broken in CentralAuthUser caching [15:19:13] T349511 is that one messae0 [15:19:13] T349511: [LIB] [TECH] Wikibase reads too many wb_property_info rows at once (expectation readQueryRows <= 10000 not met) - https://phabricator.wikimedia.org/T349511 [15:19:15] *message [15:19:25] thanks Lucas_WMDE [15:19:34] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ms-be2088.codfw.wmnet [15:19:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:48] Lucas_WMDE: I checked Wikidata and it seemed fine, not sure if you want to check anything more specific [15:19:57] I’ll take a quick look [15:20:03] but if it didn’t completely crash I’d guess it’s okay [15:20:13] just... call us paranoid :-)) [15:20:28] !log bking@apt1002:~/pkg$ sudo -E reprepro -C component/opensearch13 remove bullseye-wikimedia wmf-opensearch-search-plugins T380752 [15:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:32] T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752 [15:20:43] !log bking@apt1002:~/pkg$ sudo -E reprepro -C component/opensearch13 include bullseye-wikimedia $HOME/pkg/wmf-opensearch-search-plugins_1.3.20-1_amd64.changes (again)T380752 [15:20:44] editing works https://www.wikidata.org/w/index.php?title=Q4115189&diff=prev&oldid=2314331748 [15:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:56] mwdebug right? Lucas_WMDE [15:20:59] yeah [15:21:09] good enough for me then, let's keep going [15:21:16] continuing with sync? [15:21:22] server [15:21:22] mw-debug.codfw.pinkunicorn-85b7df9765-n5sxp [15:21:27] yeah, go ahead imho :) [15:21:31] ship it! [15:21:34] !log ihurbain@deploy2002 tgr, ihurbain: Continuing with sync [15:21:42] (that was a response header, firefox copied it on two separate lines 🤷) [15:21:48] lol [15:23:48] (03CR) 10Bking: [C:03+2] cirrus: drop cirrus_saneitize_jobs periodic job (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/1113741 (owner: 10DCausse) [15:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10568060 (10phaultfinder) [15:25:29] RECOVERY - OpenSearch health check for shards on 9200 on relforge1004 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 5, number_of_data_nodes: 5, discovered_master: True, active_primary_shards: 248, active_shards: 497, relocating_shards: 2, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_fli [15:25:29] h: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:28:11] !log ihurbain@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121328|Restore "Add configuration options and global preference for the SUL3 rolllout" (T386836)]], [[gerrit:1121329|Restore "Add configuration options and global preference for the SUL3 rolllout" (T386836)]], [[gerrit:1121330|SharedDomainUtils: Avoid early instantiation of NamespaceInfo (T386836)]], [[gerrit:1121332|SharedDomainUtils: Avoid early i [15:28:11] nstantiation of NamespaceInfo (T386836)]], [[gerrit:1121333|Make sure isSul3Enabled() is a boolean (T384549)]], [[gerrit:1121334|Make sure isSul3Enabled() is a boolean (T384549)]] (duration: 36m 02s) [15:28:15] T386836: Wikibase CI broken with several errors - https://phabricator.wikimedia.org/T386836 [15:28:15] T384549: Create a per-user flag for enabling SUL3 - https://phabricator.wikimedia.org/T384549 [15:28:33] there! [15:28:52] thank you ihurbain! [15:29:21] congrats on your first deploy and running the entire window! hope you do more of it soon! [15:29:41] !log UTC afternoon deploys done [15:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:01] there! thank y'all for the support :) [15:31:38] ihurbain: congrats on your first deployment window \o/ [15:32:16] \o/ [15:32:22] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Broken thumb and can't move file - https://phabricator.wikimedia.org/T386943 (10MGA73) 03NEW [15:33:55] (03PS1) 10Jforrester: [wikifunctionswiki] Give wikilambda-bypass-cache to staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121385 (https://phabricator.wikimedia.org/T379432) [15:34:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121385 (https://phabricator.wikimedia.org/T379432) (owner: 10Jforrester) [15:34:15] I'll be at train log triage in half an hour, so if anything weird pop up there, I'll report back [15:37:14] the one potential problem I can think of is generating too much DB load, since the global preference lookups now happen on all wikis [15:37:37] I'll see if there's a dashboard for checking that [15:43:59] does anyone mind if I deploy one more config change? it should be a no-op (config cleanup) [15:44:05] cc apergos, dancy, andre for the upcoming window [15:44:15] OK w/ me. [15:44:28] no objections here [15:44:37] thanks! [15:44:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115016 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor) [15:45:04] did a quick check with rg to confirm that the old option name doesn’t appear anywhere except in wmf-config/Wikibase.php [15:45:16] (i.e. not in the existing branch directories on deploy1002) [15:45:29] (03Merged) 10jenkins-bot: Remove `tmpEnableMulLanguageCode` setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115016 (https://phabricator.wikimedia.org/T330217) (owner: 10Arthur taylor) [15:45:56] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1115016|Remove `tmpEnableMulLanguageCode` setting (T330217)]] [15:45:59] T330217: MUL - Cleanup soft rollout flag - https://phabricator.wikimedia.org/T330217 [15:46:44] !log update k9s in bookworm-wikimedia thirdparty/k9s to 0.40.5 [15:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:50] !log lucaswerkmeister-wmde@deploy2002 arthurtaylor, lucaswerkmeister-wmde: Backport for [[gerrit:1115016|Remove `tmpEnableMulLanguageCode` setting (T330217)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:49:01] testing [15:49:16] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Broken thumb and can't move file - https://phabricator.wikimedia.org/T386943#10568176 (10MatthewVernon) The problem with the thumbnail is that the image is malformed. If I download it and open it in GIMP, it says: ` Error loading PNG file: IDAT: in... [15:49:41] !log lucaswerkmeister-wmde@deploy2002 arthurtaylor, lucaswerkmeister-wmde: Continuing with sync [15:49:42] lgtm [15:50:35] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Broken thumb and can't move file - https://phabricator.wikimedia.org/T386943#10568177 (10MatthewVernon) [I would expect the usual thing to do would be to upload the fixed image as a new version of the broken one, rather than trying to move somethin... [15:51:30] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test operations in mixed opensearch/elasticsearch cluster - bking@cumin2002 - T380752 [15:51:31] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test operations in mixed opensearch/elasticsearch cluster - bking@cumin2002 - T380752 [15:51:34] T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752 [15:51:37] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=5; selector: name=wikikube-worker2003.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [15:56:39] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1115016|Remove `tmpEnableMulLanguageCode` setting (T330217)]] (duration: 10m 43s) [15:56:43] T330217: MUL - Cleanup soft rollout flag - https://phabricator.wikimedia.org/T330217 [15:56:44] * Lucas_WMDE done deploying [15:58:51] and 5 minutes until train log triage, so that's perfect [15:59:09] (03PS2) 10Scott French: aptrepo: add component/pcre2 for bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1120586 (https://phabricator.wikimedia.org/T386006) [15:59:09] (03PS2) 10Scott French: package_builder: add pbuilder hook for pcre2 component [puppet] - 10https://gerrit.wikimedia.org/r/1120587 (https://phabricator.wikimedia.org/T386006) [15:59:09] (03PS1) 10Scott French: aptrepo: update pcre2 backport from apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1121388 (https://phabricator.wikimedia.org/T386006) [16:00:05] dancy and andre: Time to snap out of that daydream and deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1600). [16:03:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:03:41] !log upload liberica 0.9 to apt.wm.o (bookworm-wikimedia) [16:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:15] (03CR) 10Scott French: "I think this should be the last step once the stable-to-bullseye backports are available in apt-staging. Other than my actually going and " [puppet] - 10https://gerrit.wikimedia.org/r/1121388 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:08:32] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10568237 (10Pppery) a:05Ben.buchenau→03None [16:08:37] (03CR) 10Herron: [C:03+1] hiera: restore thanos retention settings [puppet] - 10https://gerrit.wikimedia.org/r/1121324 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi) [16:09:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10568239 (10phaultfinder) [16:09:54] (03PS1) 10Jgreen: Fix hostgroup and alpha order for analytics role passive checks in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/1121390 (https://phabricator.wikimedia.org/T386259) [16:10:33] (03CR) 10CI reject: [V:04-1] Fix hostgroup and alpha order for analytics role passive checks in nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/1121390 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen) [16:10:46] !log updating liberica to version 0.10 in ulsfo load balancers [16:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:17] (03PS1) 10Elukey: services: update Kartotherian's replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121391 (https://phabricator.wikimedia.org/T386926) [16:12:45] !log elukey@puppetserver1001 conftool action : set/pooled=inactive:weight=5; selector: name=wikikube-worker2003.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:13:00] !log elukey@puppetserver1001 conftool action : set/pooled=inactive:weight=5; selector: name=wikikube-worker1004.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [16:13:36] anyone around who could rotate some accidentally leaked phabricator bot credentials for me? [16:16:52] (03CR) 10Jgiannelos: [C:03+2] Bust cache for recreated pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118890 (https://phabricator.wikimedia.org/T386244) (owner: 10Arlolra) [16:17:18] (03CR) 10SBassett: [C:03+1] [wikifunctionswiki] Give wikilambda-bypass-cache to staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121385 (https://phabricator.wikimedia.org/T379432) (owner: 10Jforrester) [16:17:36] (03PS2) 10Jgreen: Fix hostgroup and alpha order for analytics passive checks in nsca_frack.cfg.erb. [puppet] - 10https://gerrit.wikimedia.org/r/1121390 (https://phabricator.wikimedia.org/T386259) [16:18:02] (03Merged) 10jenkins-bot: Bust cache for recreated pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118890 (https://phabricator.wikimedia.org/T386244) (owner: 10Arlolra) [16:18:13] (03CR) 10CI reject: [V:04-1] Fix hostgroup and alpha order for analytics passive checks in nsca_frack.cfg.erb. [puppet] - 10https://gerrit.wikimedia.org/r/1121390 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen) [16:20:47] (03PS3) 10Jgreen: Fix hostgroup and order for analytics checks in nsca_frack.cfg.erb. [puppet] - 10https://gerrit.wikimedia.org/r/1121390 (https://phabricator.wikimedia.org/T386259) [16:23:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:27:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:27:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:30:46] ^^ all done in T386949 for people following along [16:31:03] !log dancy@deploy2002 Installing scap version "4.137.0" for 204 host(s) [16:32:23] (03CR) 10Filippo Giunchedi: [C:03+2] Fix hostgroup and order for analytics checks in nsca_frack.cfg.erb. [puppet] - 10https://gerrit.wikimedia.org/r/1121390 (https://phabricator.wikimedia.org/T386259) (owner: 10Jgreen) [16:32:51] 10ops-codfw, 06DC-Ops: Install test Mellanox nic into sretest2001 - https://phabricator.wikimedia.org/T386951 (10RobH) 03NEW p:05Triage→03High [16:33:12] 10ops-codfw, 06DC-Ops: Install test Mellanox nic into sretest2001 - https://phabricator.wikimedia.org/T386951#10568366 (10RobH) [16:33:28] (03PS1) 10Vgutierrez: liberica: run cp checks periodically [puppet] - 10https://gerrit.wikimedia.org/r/1121394 [16:34:15] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121394 (owner: 10Vgutierrez) [16:35:34] !log dancy@deploy2002 Installation of scap version "4.137.0" completed for 204 hosts [16:40:42] (03PS1) 10ZhaoFJx: zhwiki: Create abusefilter editor group on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) [16:41:22] (03CR) 10Ssingh: [C:03+1] liberica: run cp checks periodically (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1121394 (owner: 10Vgutierrez) [16:43:18] (03CR) 10Vgutierrez: [C:03+2] liberica: run cp checks periodically [puppet] - 10https://gerrit.wikimedia.org/r/1121394 (owner: 10Vgutierrez) [16:45:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:48:47] (03CR) 10ZhaoFJx: zhwiki: Create abusefilter editor group on zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx) [16:49:03] !log arlolra@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [16:49:51] !log arlolra@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [16:50:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:50:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx) [16:53:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:55:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:35] !log arlolra@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [16:56:22] !log phab1004 (phabricator) - systemctl stop phabricator_stats_job_mfa_check timer and service; systemctl (gerrit:1117489) [16:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:40] !log arlolra@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [17:00:04] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1700). [17:00:05] Krinkle: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:21] o/ [17:00:52] o/ [17:02:06] (03CR) 10RLazarus: [C:03+2] mediawiki: Add rewrite rule to fix serving of /.well-known static files [puppet] - 10https://gerrit.wikimedia.org/r/1120216 (https://phabricator.wikimedia.org/T385520) (owner: 10Krinkle) [17:02:58] I'll deploy to metal mwdebug2001 first just because it's easy, then scap stopping at kubernetes debug hosts, then everywhere [17:03:14] Ack [17:03:20] transient httpbb alerts are expected [17:04:17] https://auth.wikimedia.beta.wmflabs.org/.well-known/assetlinks.json [17:04:30] https://auth.wikimedia.org/.well-known/assetlinks.json [17:04:43] I'll be looking at that one on mwdebug in a minute [17:04:55] !log arlolra@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [17:05:20] live at mwdebug2001 [17:05:31] !log arlolra@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [17:05:46] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:06:43] httpbb passes [17:08:46] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:08:47] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host puppetserver2004.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:10:09] rzl: LGTM [17:10:14] 👍 [17:10:57] !log rzl@deploy2002 Started scap sync-world: T385520 [17:11:00] T385520: Deploy DAL files for seamless credential sharing in Chrome - https://phabricator.wikimedia.org/T385520 [17:12:26] !log rzl@deploy2002 rzl: T385520 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:12:52] httpbb's still passing on k8s-mwdebug, go ahead and test [17:13:14] (03PS1) 10Vgutierrez: liberica: Fix liberica cp check job [puppet] - 10https://gerrit.wikimedia.org/r/1121401 [17:13:31] LGTM on k8s mwdebug [17:13:36] !log rzl@deploy2002 rzl: Continuing with sync [17:13:40] (03CR) 10Ssingh: [C:03+1] liberica: Fix liberica cp check job [puppet] - 10https://gerrit.wikimedia.org/r/1121401 (owner: 10Vgutierrez) [17:13:58] (03CR) 10Vgutierrez: [C:03+2] liberica: Fix liberica cp check job [puppet] - 10https://gerrit.wikimedia.org/r/1121401 (owner: 10Vgutierrez) [17:14:10] PROBLEM - Disk space on netflow1002 is CRITICAL: DISK CRITICAL - free space: / 0MiB (0% inode=93%): /tmp 0MiB (0% inode=93%): /var/tmp 0MiB (0% inode=93%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=netflow1002&var-datasource=eqiad+prometheus/ops [17:16:23] (03CR) 10Eevans: [C:03+2] cassandra: setup 'dev' target for Cassandra 4.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/1121102 (https://phabricator.wikimedia.org/T385819) (owner: 10Eevans) [17:19:23] !log rzl@deploy2002 Finished scap sync-world: T385520 (duration: 09m 01s) [17:19:27] T385520: Deploy DAL files for seamless credential sharing in Chrome - https://phabricator.wikimedia.org/T385520 [17:19:43] worksforme in prod now [17:19:54] sweet [17:20:07] thanks for writing an httpbb test, that made everything easy [17:20:14] :) [17:20:27] puppet window complete! gavel gavel [17:21:30] Like the Law & Order sound? [17:22:51] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching cassandra-dev2001.codfw.wmnet: Upgrade to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [17:23:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:24:24] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 on cloudvirt1047 - https://phabricator.wikimedia.org/T386083#10568570 (10Jhancock.wm) a:03VRiley-WMF [17:24:35] dancy: remember the big freaky klingon gavel from Star Trek VI? closer to that [17:24:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383213#10568572 (10Jhancock.wm) a:03VRiley-WMF [17:24:59] (also --k8s-confirm-diff is still in good shape, thanks for that work) [17:25:16] https://memory-alpha.fandom.com/wiki/Gavel?file=Klingon_Magistrate.jpg [17:25:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission mw[1349-1413] - https://phabricator.wikimedia.org/T375842#10568578 (10Jhancock.wm) a:03VRiley-WMF [17:25:53] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: [Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 on cloudvirt1047 - https://phabricator.wikimedia.org/T386083#10568579 (10Andrew) 05Resolved→03Invalid I'm putting this host back in service and closing the... [17:26:15] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121404 [17:26:29] oh my god of *course* there's a whole article on Gavel [17:26:44] we do love a star trek courtroom drama, I guess that makes sense [17:29:48] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching cassandra-dev2001.codfw.wmnet: Upgrade to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [17:29:51] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [17:29:54] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [17:37:19] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching cassandra-dev200[2-3].codfw.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [17:38:24] (03PS1) 10ZhaoFJx: cowikimedia: Change the workmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121405 (https://phabricator.wikimedia.org/T386872) [17:38:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121405 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx) [17:39:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10568629 (10phaultfinder) [17:41:17] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959 (10BCornwall) 03NEW [17:42:08] cccccbukvgbchuuebhklecbdrrbhvulvgeliecljvdvb [17:42:10] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10568650 (10BCornwall) [17:42:16] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, and 2 others: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10568651 (10BCornwall) [17:42:59] yea, that button is way too close to the keyboard now with the nanon [17:43:05] (03CR) 10BCornwall: "Filed at https://phabricator.wikimedia.org/T386959" [cookbooks] - 10https://gerrit.wikimedia.org/r/1121086 (https://phabricator.wikimedia.org/T373993) (owner: 10BCornwall) [17:47:33] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [17:47:37] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [17:50:22] RECOVERY - ElasticSearch unassigned shard check - 9200 on relforge1003 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [17:51:06] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching cassandra-dev200[2-3].codfw.wmnet: Upgrading to Cassandra 4.1.8 — T385819 - eevans@cumin1002 [17:51:49] (03PS1) 10Vgutierrez: sre: Provide LibericaDiffFPCheck alert [alerts] - 10https://gerrit.wikimedia.org/r/1121409 [17:53:00] (03CR) 10CI reject: [V:04-1] sre: Provide LibericaDiffFPCheck alert [alerts] - 10https://gerrit.wikimedia.org/r/1121409 (owner: 10Vgutierrez) [17:53:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:55:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:55:51] (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-02-17-122018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121411 [17:59:40] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-02-17-122018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121411 (owner: 10BryanDavis) [18:00:05] bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1800) [18:00:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:54] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-02-17-122018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121411 (owner: 10BryanDavis) [18:09:52] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:10:04] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:10:08] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:10:13] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:10:54] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 0.144 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:10:58] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.214 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:11:09] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:14:53] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps instance: populate /etc/openstack/project_id [puppet] - 10https://gerrit.wikimedia.org/r/1121369 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [18:17:57] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:18:05] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:18:24] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:21:41] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10568826 (10ssingh) Hi @wiki_willy: adding you based on our discussion so you can triage it accordingly, thanks! [18:23:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:25:11] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383341#10568841 (10Jhancock.wm) a:03Jhancock.wm [18:28:01] (03PS2) 10Ssingh: sre: Provide LibericaDiffFPCheck alert [alerts] - 10https://gerrit.wikimedia.org/r/1121409 (owner: 10Vgutierrez) [18:29:42] (03CR) 10CI reject: [V:04-1] sre: Provide LibericaDiffFPCheck alert [alerts] - 10https://gerrit.wikimedia.org/r/1121409 (owner: 10Vgutierrez) [18:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10568871 (10phaultfinder) [18:30:59] (03CR) 10Dzahn: [V:03+1 C:03+2] "no effect on prod puppetmasters https://puppet-compiler.wmflabs.org/output/1121079/4960/" [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [18:32:56] (03PS3) 10Ssingh: sre: Provide LibericaDiffFPCheck alert [alerts] - 10https://gerrit.wikimedia.org/r/1121409 (owner: 10Vgutierrez) [18:33:30] (03CR) 10Dzahn: [V:03+1 C:03+2] "Should have said "puppetservers" --> https://puppet-compiler.wmflabs.org/output/1121079/4961/" [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [18:37:32] (03CR) 10Dzahn: [V:03+1 C:03+2] "prod: puppetserver1001, puppetmaster1003, puppetmaster2001 - noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [18:41:31] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10568908 (10Dzahn) Arthur clarified the new key is a yubikey key. I advised to first have this added in addition to the existing key and test things. And that we sho... [18:41:57] cccccbukvgbcvdetuvfrvbuudkljcjrnljgbhjnedkee [18:42:07] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test operations in mixed opensearch/elasticsearch cluster - bking@cumin2002 - T380752: [18:42:08] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: test operations in mixed opensearch/elasticsearch cluster - bking@cumin2002 - T380752: [18:42:11] T380752: Migrate Relforge to Opensearch - https://phabricator.wikimedia.org/T380752 [18:44:33] (03CR) 10Dzahn: [V:03+1 C:03+2] "cloud: puppetmaster-1003.devtools - noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [18:44:59] (03CR) 10Ssingh: "Hi @dzahn@wikimedia.org: I am guessing there is another related commit for this to be followed up in setting cache::alternate_domains; wou" [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [18:45:30] (03CR) 10Dzahn: [V:03+1 C:03+2] "This fixed the previous puppet error for any new puppetserver in cloud but it's just on to the next issue now. ""Unable to create director" [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [18:45:53] (03CR) 10Ssingh: [C:03+1] "Looks good but I was wondering if we should just put in the dashboard links before merging this so that we don't forget." [alerts] - 10https://gerrit.wikimedia.org/r/1121409 (owner: 10Vgutierrez) [18:46:23] (03CR) 10Dzahn: [V:03+1 C:03+2] "seems the puppetserver module has never been tested on cloud before" [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [18:47:19] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10568916 (10wiki_willy) a:03RobH Thanks for creating this task @ssingh. @RobH - can open up a Dell Tech Support ticket to get one of the technicians out to magru and see if they can figure out what might... [18:48:08] (03PS1) 10Stevemunene: Fix team name typo for hadoop worker [alerts] - 10https://gerrit.wikimedia.org/r/1121415 (https://phabricator.wikimedia.org/T386900) [18:48:32] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10568925 (10ssingh) Thanks Willy! @BCornwall has been leading this from the Traffic team and will be the point of contact for this. [18:49:15] (03CR) 10Dzahn: "@sukhe There is not, so far. This is only in response to https://phabricator.wikimedia.org/T274228#10529569 to create the mere possibility" [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [18:50:09] (03CR) 10Dzahn: "in this form this is only meant to be "allow a new option in varnish that did not exist before" and that's it." [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [18:50:25] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:51:20] (03CR) 10Ssingh: "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [18:51:44] (03CR) 10Ssingh: "(This looks good to me for what it's worth.)" [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [18:57:35] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10568948 (10RobH) It was my understanding this issue was resolved by the new temp profile settings on T373993, do we still need to open a case on this? [19:00:04] dancy and andre: That opportune time for a MediaWiki train - Utc-7+Utc-0 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T1900). [19:00:13] uh uh [19:00:14] o/ [19:00:33] * dancy presses the button. [19:00:39] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121417 (https://phabricator.wikimedia.org/T382368) [19:00:41] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121417 (https://phabricator.wikimedia.org/T382368) (owner: 10TrainBranchBot) [19:01:02] dancy: sounds like spiderpig :o [19:01:24] It's on my todo list. [19:01:46] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121417 (https://phabricator.wikimedia.org/T382368) (owner: 10TrainBranchBot) [19:11:26] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.17 refs T382368 [19:11:30] T382368: 1.44.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T382368 [19:12:38] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on netflow1002.eqiad.wmnet with reason: keeping gnmic running in debug mode to observe performance change [19:14:10] RECOVERY - Disk space on netflow1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=netflow1002&var-datasource=eqiad+prometheus/ops [19:16:04] (03CR) 10Andrew Bogott: [C:03+2] realm.pp: remove use of $labsproject [puppet] - 10https://gerrit.wikimedia.org/r/1121343 (owner: 10Andrew Bogott) [19:16:13] (03CR) 10Andrew Bogott: [C:03+2] rename validatelabsfqdn.py to validatecloudvpsfqdn.py [puppet] - 10https://gerrit.wikimedia.org/r/1121342 (owner: 10Andrew Bogott) [19:18:42] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121342 (owner: 10Andrew Bogott) [19:19:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:20:32] 06SRE, 06Data-Engineering, 10DPE-Mediawiki-Content, 10Dumps-Generation, 07Epic: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10569062 (10Ahoelzl) [19:23:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:23:46] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10569088 (10wiki_willy) Hey @RobH - Sukhbir and I were talking at the offsite after the fix was implemented. While increasing the fan speed helped specifically in this scenario, the other sites are able to... [19:24:14] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10569097 (10phaultfinder) [19:25:14] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:27:06] PROBLEM - Ensure traffic_manager is running for instance backend on cp4038 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:28:06] RECOVERY - Ensure traffic_manager is running for instance backend on cp4038 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:30:45] (03CR) 10Kamila Součková: [C:03+1] services: update Kartotherian's replicas to 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121391 (https://phabricator.wikimedia.org/T386926) (owner: 10Elukey) [19:31:29] (03PS2) 10Andrew Bogott: validatecloudvpsfqdn.py: Support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121344 (https://phabricator.wikimedia.org/T379030) [19:31:29] (03PS4) 10Andrew Bogott: wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030) [19:31:29] (03PS4) 10Andrew Bogott: Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030) [19:31:30] (03PS4) 10Andrew Bogott: wmcs puppet-enc: use project id for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1121347 [19:31:31] (03PS1) 10Andrew Bogott: validatecloudvpsfqdn.py: Only support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121423 (https://phabricator.wikimedia.org/T379030) [19:35:55] (03PS3) 10Andrew Bogott: validatecloudvpsfqdn.py: Support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121344 (https://phabricator.wikimedia.org/T379030) [19:35:55] (03PS5) 10Andrew Bogott: wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030) [19:35:55] (03PS5) 10Andrew Bogott: Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030) [19:35:56] (03PS5) 10Andrew Bogott: wmcs puppet-enc: use project id for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1121347 [19:35:57] (03PS2) 10Andrew Bogott: validatecloudvpsfqdn.py: Only support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121423 (https://phabricator.wikimedia.org/T379030) [19:37:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121347 (owner: 10Andrew Bogott) [19:37:27] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121423 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [19:37:32] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [19:37:35] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [19:37:38] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121344 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [19:41:36] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Broken thumb and can't move file - https://phabricator.wikimedia.org/T386943#10569165 (10MGA73) Thank you for checking this! I moved the file to retain edit history and to test if the thumb would work on Commons (there are a number of similar files... [19:44:31] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Broken thumb and can't move file - https://phabricator.wikimedia.org/T386943#10569176 (10MGA73) 05Open→03Resolved a:03MGA73 [19:46:34] All quiet; I'm going to deploy a service update. [19:46:51] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-02-19-134350 to 2025-02-20-140756 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121380 (https://phabricator.wikimedia.org/T383448) (owner: 10Jforrester) [19:48:22] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-02-19-134350 to 2025-02-20-140756 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121380 (https://phabricator.wikimedia.org/T383448) (owner: 10Jforrester) [19:49:36] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:50:09] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:50:51] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 18:00:00 on netflow1002.eqiad.wmnet with reason: keeping gnmic running in debug mode to observe performance change [19:51:40] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [19:52:26] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [19:52:31] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [19:53:22] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [19:54:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:54:55] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2025-02-19-135838 to 2025-02-20-142923 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121381 (https://phabricator.wikimedia.org/T383448) (owner: 10Jforrester) [19:56:09] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-02-19-135838 to 2025-02-20-142923 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121381 (https://phabricator.wikimedia.org/T383448) (owner: 10Jforrester) [20:01:25] (03PS1) 10Eevans: aqs2001: upgrade to Cassandra 4.1.8 (canary) [puppet] - 10https://gerrit.wikimedia.org/r/1121428 (https://phabricator.wikimedia.org/T386969) [20:02:09] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121428 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [20:07:32] (03CR) 10Eevans: [C:03+2] aqs2001: upgrade to Cassandra 4.1.8 (canary) [puppet] - 10https://gerrit.wikimedia.org/r/1121428 (https://phabricator.wikimedia.org/T386969) (owner: 10Eevans) [20:09:35] (03PS1) 10Aklapper: Remove an unused array variable [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121429 [20:10:36] (03CR) 10Aklapper: [V:03+2 C:03+2] Remove an unused array variable [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121429 (owner: 10Aklapper) [20:10:54] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [20:10:58] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [20:11:30] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs2001.codfw.wmnet: Upgrading to Cassandra 4.1.8 (canary) — T385819 - eevans@cumin1002 [20:18:55] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs2001.codfw.wmnet: Upgrading to Cassandra 4.1.8 (canary) — T385819 - eevans@cumin1002 [20:24:40] (03PS1) 10Ahmon Dancy: Use buildkit:wmf-v0.20.0-2 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1121432 (https://phabricator.wikimedia.org/T386955) [20:42:37] (03PS1) 10Aklapper: Rename $editScore to $transaction_score [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121435 [20:43:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:43:36] (03CR) 10Aklapper: [V:03+2 C:03+2] Rename $editScore to $transaction_score [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121435 (owner: 10Aklapper) [20:46:16] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1121324 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi) [20:48:30] (03PS15) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) [20:49:08] (03CR) 10Dzahn: [C:03+2] Use buildkit:wmf-v0.20.0-2 on WMCS and trusted runners [puppet] - 10https://gerrit.wikimedia.org/r/1121432 (https://phabricator.wikimedia.org/T386955) (owner: 10Ahmon Dancy) [20:49:16] (03CR) 10Ahmon Dancy: profile::scap::spiderpig: New profile for setting up SpiderPig (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [20:51:46] (03CR) 10Dzahn: [C:03+2] admin: upgrade arthurtaylor from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1121088 (https://phabricator.wikimedia.org/T386349) (owner: 10Dzahn) [20:52:16] !log welcome new deployer Arthur Taylor (T386349) [20:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:26] (03CR) 10Jforrester: [C:04-1] zhwiki: Create abusefilter editor group on zhwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx) [20:54:10] (03PS1) 10Cathal Mooney: Update policy for K8s BGP to allow a wider range of v4 prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/1121438 (https://phabricator.wikimedia.org/T375845) [20:54:20] !log logmsgbot: are you logging? [20:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:46] but not to phab tickets anymore [20:58:16] mutante: Maybe the token expired for the bot? [20:58:30] ISTR that was an issue with a different tool this week. [20:59:22] (03CR) 10ZhaoFJx: zhwiki: Create abusefilter editor group on zhwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx) [20:59:27] sounds like a possibility, ack [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T2100). Please do the needful. [21:00:05] James_F and ZhaoFJx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:11] I can deploy. [21:00:29] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10569335 (10Dzahn) 05In progress→03Resolved a:03Dzahn ` 20:52 < mutante> !log welcome new deployer Arthur Taylor (T386349) ` ` [deploy1003:~] $ id arthur... [21:00:33] (03CR) 10Jforrester: [C:03+2] Re-update function-schemata sub-module to HEAD (39b22ad) [extensions/WikiLambda] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121366 (owner: 10Jforrester) [21:01:37] Thanks [21:02:19] (03PS2) 10Jforrester: cowikimedia: Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121405 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx) [21:02:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121405 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx) [21:03:20] (03Merged) 10jenkins-bot: cowikimedia: Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121405 (https://phabricator.wikimedia.org/T386872) (owner: 10ZhaoFJx) [21:03:40] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1121405|cowikimedia: Change the wordmark (T386872)]] [21:03:44] T386872: Requesting logo change for co.wikimedia.org - https://phabricator.wikimedia.org/T386872 [21:05:00] (03Merged) 10jenkins-bot: Re-update function-schemata sub-module to HEAD (39b22ad) [extensions/WikiLambda] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121366 (owner: 10Jforrester) [21:06:24] !log jforrester@deploy2002 jforrester, zhaofjx: Backport for [[gerrit:1121405|cowikimedia: Change the wordmark (T386872)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:06:48] ZhaoFJx: Deployed and it looks "fine" assuming that's what they wanted – can you confirm? [21:07:41] not good on my side (k8s-mwdebug) [21:07:54] What's wrong from your end? [21:08:35] there are two wikimedia logos [21:08:36] The logo duplication? [21:08:42] yep [21:08:51] Yes, is that not what they wanted? [21:09:03] (03PS2) 10ZhaoFJx: zhwiki: Create abusefilter editor group on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) [21:09:19] I believe no [21:09:26] * James_F sighs. [21:09:32] alas [21:09:38] OK, we can revert and they can say what file they /actually/ want? [21:09:46] !log jforrester@deploy2002 Sync cancelled. [21:10:03] I will ask them on phabricator [21:10:07] (03PS1) 10Jforrester: Revert "cowikimedia: Change the wordmark" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121440 [21:10:10] Thanks! [21:10:15] (03CR) 10Jforrester: [C:03+2] Revert "cowikimedia: Change the wordmark" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121440 (owner: 10Jforrester) [21:10:42] Could you also check the zhwiki one? I just updated the patchset [21:10:43] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1121395 [21:10:54] (03Merged) 10jenkins-bot: Revert "cowikimedia: Change the wordmark" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121440 (owner: 10Jforrester) [21:10:55] Yeah, looks good, will deploy now. [21:11:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx) [21:11:42] (03Merged) 10jenkins-bot: zhwiki: Create abusefilter editor group on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121395 (https://phabricator.wikimedia.org/T386879) (owner: 10ZhaoFJx) [21:11:57] (03CR) 10Cathal Mooney: "LGTM in general, one question in line." [cookbooks] - 10https://gerrit.wikimedia.org/r/1121086 (https://phabricator.wikimedia.org/T373993) (owner: 10BCornwall) [21:12:13] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1121395|zhwiki: Create abusefilter editor group on zhwiki (T386879)]] [21:12:17] T386879: Create abusefilter editor group on zhwiki - https://phabricator.wikimedia.org/T386879 [21:14:54] !log jforrester@deploy2002 jforrester, zhaofjx: Backport for [[gerrit:1121395|zhwiki: Create abusefilter editor group on zhwiki (T386879)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:15:24] All good! [21:15:25] ZhaoFJx: How's that look for you on debug? [21:15:27] Excellent. [21:15:29] !log jforrester@deploy2002 jforrester, zhaofjx: Continuing with sync [21:16:06] https://zh.wikipedia.org/wiki/Wikipedia:Abusefilter-editor should get filled in at some point. :-) [21:16:34] And I will call them for i18n soon [21:16:38] thanks for mention [21:16:39] Brilliant. [21:17:57] 10ops-codfw, 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10569355 (10bking) 05Open→03In progress p:05Triage→03Medium a:03bking [21:20:11] 10ops-codfw, 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10569364 (10bking) Hello DC Ops, I've created [[ https://docs.google.com/spreadsheets/d/1DfzoKJM... [21:22:08] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121395|zhwiki: Create abusefilter editor group on zhwiki (T386879)]] (duration: 09m 54s) [21:22:12] T386879: Create abusefilter editor group on zhwiki - https://phabricator.wikimedia.org/T386879 [21:22:23] ZhaoFJx: All done for you, I think? Sorry that the first one didn't work out. [21:22:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121385 (https://phabricator.wikimedia.org/T379432) (owner: 10Jforrester) [21:22:47] (03PS1) 10RLazarus: deployment_server: Support multiple Kubernetes configs in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1121443 (https://phabricator.wikimedia.org/T378429) [21:23:30] (03Merged) 10jenkins-bot: [wikifunctionswiki] Give wikilambda-bypass-cache to staff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121385 (https://phabricator.wikimedia.org/T379432) (owner: 10Jforrester) [21:23:37] Yep [21:23:41] Thanks for deployment [21:23:49] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1121385|[wikifunctionswiki] Give wikilambda-bypass-cache to staff (T379432)]] [21:23:52] James_F have a good one [21:23:53] T379432: Create a way to temporarily bypass the results cache on production - https://phabricator.wikimedia.org/T379432 [21:23:57] ZhaoFJx: And you! [21:24:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 12.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:26:34] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1121385|[wikifunctionswiki] Give wikilambda-bypass-cache to staff (T379432)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:26:52] !log jforrester@deploy2002 jforrester: Continuing with sync [21:29:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 17.86% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:33:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:33:24] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121385|[wikifunctionswiki] Give wikilambda-bypass-cache to staff (T379432)]] (duration: 09m 34s) [21:33:28] T379432: Create a way to temporarily bypass the results cache on production - https://phabricator.wikimedia.org/T379432 [21:39:15] (03PS1) 10Aklapper: Do not lower score when setting customfield [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121446 [21:40:11] (03CR) 10Aklapper: [V:03+2 C:03+2] Do not lower score when setting customfield [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121446 (owner: 10Aklapper) [21:43:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:43:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1103:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1103 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:48:01] (03PS1) 10Aklapper: Sort recent user transactions by newest first [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121447 [21:48:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1103:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1103 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:49:05] (03CR) 10Aklapper: [V:03+2 C:03+2] Sort recent user transactions by newest first [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121447 (owner: 10Aklapper) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250220T2200) [22:06:21] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [22:06:25] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [22:07:35] (03PS1) 10Ebrahim: Improve Persian Wikipedia's tagline and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121449 [22:11:06] (03PS1) 10Jdrewniak: Fix 0 tick not firing for session length mixin, and ensure ticks happen every 30 seconds [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121450 (https://phabricator.wikimedia.org/T386495) [22:13:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:14:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3347 MB (3% inode=98%): /tmp 3347 MB (3% inode=98%): /var/tmp 3347 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [22:15:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121450 (https://phabricator.wikimedia.org/T386495) (owner: 10Jdrewniak) [22:21:53] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.02.10 - 2025.02.28): Enable CPU performance governor on Relforge, Cloudelastic, and Elasticsearch hosts - https://phabricator.wikimedia.org/T386860#10569604 (10bking) [22:22:28] (03Merged) 10jenkins-bot: Fix 0 tick not firing for session length mixin, and ensure ticks happen every 30 seconds [extensions/WikimediaEvents] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121450 (https://phabricator.wikimedia.org/T386495) (owner: 10Jdrewniak) [22:22:45] !log jdrewniak@deploy2002 Started scap sync-world: Backport for [[gerrit:1121450|Fix 0 tick not firing for session length mixin, and ensure ticks happen every 30 seconds (T386495)]] [22:22:48] T386495: Fix session tick mixin relating to when events fire - https://phabricator.wikimedia.org/T386495 [22:25:25] !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:1121450|Fix 0 tick not firing for session length mixin, and ensure ticks happen every 30 seconds (T386495)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:28:16] (03PS1) 10RLazarus: deployment_server: Read mwscript-k8s MW image from values, not kube API [puppet] - 10https://gerrit.wikimedia.org/r/1121455 (https://phabricator.wikimedia.org/T378429) [22:30:02] !log jdrewniak@deploy2002 jdrewniak: Continuing with sync [22:36:37] !log jdrewniak@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121450|Fix 0 tick not firing for session length mixin, and ensure ticks happen every 30 seconds (T386495)]] (duration: 13m 52s) [22:36:41] T386495: Fix session tick mixin relating to when events fire - https://phabricator.wikimedia.org/T386495 [22:40:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [22:45:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [22:51:41] (03PS2) 10Ebrahim: Improve Persian Wikipedia's tagline and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121449 [23:03:02] RECOVERY - ElasticSearch unassigned shard check - 9200 on relforge1005 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [23:03:02] RECOVERY - ElasticSearch unassigned shard check - 9200 on relforge1007 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [23:11:38] RECOVERY - ElasticSearch unassigned shard check - 9200 on relforge1006 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [23:20:17] (03PS1) 10Aklapper: Penalize removing all subscribers and edges [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121461 [23:25:58] (03CR) 10Aklapper: [V:03+2 C:03+2] Penalize removing all subscribers and edges [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121461 (owner: 10Aklapper) [23:33:06] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:14] PROBLEM - Router interfaces on mr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.130, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:33:42] PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100% [23:35:14] RECOVERY - Router interfaces on mr1-drmrs is OK: OK: host 185.15.58.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:35:14] (03PS1) 10Aklapper: Differentiate more on account age [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121464 [23:35:52] (03CR) 10Aklapper: [V:03+2 C:03+2] Differentiate more on account age [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121464 (owner: 10Aklapper) [23:49:08] RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 86.41 ms [23:53:44] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.46 ms [23:55:27] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven!" [puppet] - 10https://gerrit.wikimedia.org/r/1121443 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus)