[00:10:35] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:10:57] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 649.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10562012 (10phaultfinder) [00:27:04] (03CR) 10Scott French: "Thanks, Riccardo!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1120648 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French) [00:29:32] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:35] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:33:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1149:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1149 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:38:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1120682 [00:38:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1120682 (owner: 10TrainBranchBot) [00:39:10] (03PS1) 10Andrew Bogott: vendordata.txt: include rudimentary clouds.yaml in initial VM [puppet] - 10https://gerrit.wikimedia.org/r/1120683 (https://phabricator.wikimedia.org/T379030) [00:39:11] (03PS1) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) [00:39:25] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [00:45:04] (03PS1) 10Jdlrobson: Remove init event from Search AB test and also remove ABTestEnrollment.js. [extensions/WikimediaEvents] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120685 (https://phabricator.wikimedia.org/T386243) [00:45:28] (03PS2) 10Jdlrobson: Remove init event from Search AB test and also remove ABTestEnrollment.js. [extensions/WikimediaEvents] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120685 (https://phabricator.wikimedia.org/T386734) [00:46:02] (03CR) 10Brennen Bearnes: [C:03+1] "Yep, definitely in favor at this point, filters out expected noise." [puppet] - 10https://gerrit.wikimedia.org/r/1056221 (https://phabricator.wikimedia.org/T371633) (owner: 10Ahmon Dancy) [00:49:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1120682 (owner: 10TrainBranchBot) [00:59:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10562048 (10phaultfinder) [01:08:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1120686 [01:08:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1120686 (owner: 10TrainBranchBot) [01:19:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10562062 (10phaultfinder) [01:23:44] (03PS2) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) [01:23:55] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [01:29:56] (03PS1) 10Zabe: Prepare satwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120688 (https://phabricator.wikimedia.org/T386619) [01:30:42] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1120686 (owner: 10TrainBranchBot) [01:37:16] (03CR) 10Zabe: [C:03+2] Prepare satwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120688 (https://phabricator.wikimedia.org/T386619) (owner: 10Zabe) [01:37:59] (03Merged) 10jenkins-bot: Prepare satwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120688 (https://phabricator.wikimedia.org/T386619) (owner: 10Zabe) [01:41:27] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1120688|Prepare satwiktionary (T386619)]] [01:41:30] T386619: Create Wiktionary Santali - https://phabricator.wikimedia.org/T386619 [01:44:00] (03PS3) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) [01:44:23] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [01:44:27] !log zabe@deploy2002 zabe: Backport for [[gerrit:1120688|Prepare satwiktionary (T386619)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [01:44:35] !log zabe@deploy2002 zabe: Continuing with sync [01:51:12] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120688|Prepare satwiktionary (T386619)]] (duration: 09m 45s) [01:51:17] T386619: Create Wiktionary Santali - https://phabricator.wikimedia.org/T386619 [01:52:18] (03PS1) 10Zabe: Increase revision-slots cache expiry back to default for 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120689 (https://phabricator.wikimedia.org/T183490) [01:54:31] (03CR) 10Zabe: [C:03+2] Increase revision-slots cache expiry back to default for 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120689 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [01:55:11] (03Merged) 10jenkins-bot: Increase revision-slots cache expiry back to default for 3 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120689 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [01:55:16] (03PS1) 10Zabe: Activate satwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120690 (https://phabricator.wikimedia.org/T386619) [01:55:33] (03CR) 10Zabe: [C:03+2] Activate satwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120690 (https://phabricator.wikimedia.org/T386619) (owner: 10Zabe) [01:56:16] (03Merged) 10jenkins-bot: Activate satwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120690 (https://phabricator.wikimedia.org/T386619) (owner: 10Zabe) [01:56:56] !log zabe@deploy2002 Started scap sync-world: Backport for [[gerrit:1120690|Activate satwiktionary (T386619)]], [[gerrit:1120689|Increase revision-slots cache expiry back to default for 3 wikis (T183490)]] [01:57:02] T386619: Create Wiktionary Santali - https://phabricator.wikimedia.org/T386619 [01:57:02] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [01:59:56] !log zabe@deploy2002 zabe: Backport for [[gerrit:1120690|Activate satwiktionary (T386619)]], [[gerrit:1120689|Increase revision-slots cache expiry back to default for 3 wikis (T183490)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [02:00:56] (03PS4) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) [02:01:06] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [02:01:15] !log zabe@deploy2002 zabe: Continuing with sync [02:05:07] (03CR) 10Scott French: dbctl: pass DbCtlConfiguration to DbConfig (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1120648 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French) [02:07:52] !log zabe@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120690|Activate satwiktionary (T386619)]], [[gerrit:1120689|Increase revision-slots cache expiry back to default for 3 wikis (T183490)]] (duration: 10m 55s) [02:07:58] T386619: Create Wiktionary Santali - https://phabricator.wikimedia.org/T386619 [02:07:58] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [02:08:57] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:10:48] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120691 (https://phabricator.wikimedia.org/T386619) [02:10:49] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120691 (https://phabricator.wikimedia.org/T386619) (owner: 10Zabe) [02:11:32] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120691 (https://phabricator.wikimedia.org/T386619) (owner: 10Zabe) [02:11:50] !log zabe@deploy2002 Started scap sync-world: T386619 [02:15:22] (03PS5) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) [02:15:30] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [02:16:31] RECOVERY - Host ms-be2075 is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms [02:21:35] !log zabe@deploy2002 Finished scap sync-world: T386619 (duration: 09m 44s) [02:21:39] T386619: Create Wiktionary Santali - https://phabricator.wikimedia.org/T386619 [02:22:55] PROBLEM - Host ms-be2075 is DOWN: PING CRITICAL - Packet loss = 100% [02:27:10] (03PS6) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:53:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:36:14] (03CR) 10Subramanya Sastry: [C:03+1] Turn on Parsoid Read Views for 31 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120679 (https://phabricator.wikimedia.org/T386762) (owner: 10Arlolra) [03:36:50] (03PS1) 10RLazarus: deployment_server: Refactor some utility functions into a Job class [puppet] - 10https://gerrit.wikimedia.org/r/1120700 [04:32:21] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:12:25] (03CR) 10Jgiannelos: [C:03+1] Bust cache for recreated pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118890 (https://phabricator.wikimedia.org/T386244) (owner: 10Arlolra) [05:32:37] PROBLEM - Hadoop NodeManager on an-worker1158 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:42:59] (03PS1) 10KartikMistry: Update cxserver to 2025-02-14-191041-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120709 (https://phabricator.wikimedia.org/T386464) [05:44:37] RECOVERY - Hadoop NodeManager on an-worker1158 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:02:52] Deploying MinT. Staging first. [06:06:37] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:12:23] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:27:21] FIRING: [4x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:28:40] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:37:34] (03CR) 10Aklapper: "Hmm, is the timer still in place? Wondering as I still receive this email..." [puppet] - 10https://gerrit.wikimedia.org/r/1117489 (https://phabricator.wikimedia.org/T304792) (owner: 10Aklapper) [06:38:09] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [06:46:48] 06SRE, 07Wikimedia-Incident: 503 Service Unavailable on all production - https://phabricator.wikimedia.org/T386740#10562253 (10Iniquity) >>! In T386740#10561252, @ssingh wrote: >>>! In T386740#10561043, @Iniquity wrote: >> I want to know for the future, this is not the first time I have reported about "Service... [06:50:05] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [06:52:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1023.eqiad.wmnet with OS bookworm [06:52:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10562254 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1023.eqiad.wmnet with OS bookworm [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T0700) [07:05:58] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [07:12:09] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10562275 (10MoritzMuehlenhoff) [07:12:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1023.eqiad.wmnet with reason: host reimage [07:16:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1023.eqiad.wmnet with reason: host reimage [07:17:30] !log Updated MinT to 2025-02-05-115716-production (T383750, T385552) [07:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:34] T383750: MinT: Fails to download models/files from peopleweb.discovery.wmnet - https://phabricator.wikimedia.org/T383750 [07:17:34] T385552: MinT: Add support for Obolo, Central Dusun, Iban and, South Ndebele - https://phabricator.wikimedia.org/T385552 [07:20:33] (03PS2) 10Michael Große: testwiki: enable surfacing structured task experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120646 (https://phabricator.wikimedia.org/T386739) [07:24:03] !log upload haproxy 2.8.14 to apt.wm.o (bullseye-wikimedia) - T386751 [07:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:07] T386751: update haproxy to version 2.8.14 - https://phabricator.wikimedia.org/T386751 [07:29:44] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4052.ulsfo.wmnet,cp4044.ulsfo.wmnet} and A:cp [07:34:24] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4052.ulsfo.wmnet,cp4044.ulsfo.wmnet} and A:cp [07:37:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1023.eqiad.wmnet with OS bookworm [07:38:01] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10562320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1023.eqiad.wmnet with OS bookworm completed: - ganeti102... [07:39:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet [07:39:27] (03PS2) 10Anzx: satwiktionary: add sitename, timezone, projectnamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120701 (https://phabricator.wikimedia.org/T386631) [07:39:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120701 (https://phabricator.wikimedia.org/T386631) (owner: 10Anzx) [07:39:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120646 (https://phabricator.wikimedia.org/T386739) (owner: 10Michael Große) [07:40:11] (03PS2) 10Anzx: uzwikiquote: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120699 (https://phabricator.wikimedia.org/T386569) [07:40:34] (03CR) 10Urbanecm: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120646 (https://phabricator.wikimedia.org/T386739) (owner: 10Michael Große) [07:40:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120699 (https://phabricator.wikimedia.org/T386569) (owner: 10Anzx) [07:47:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1033.eqiad.wmnet [07:49:42] !log installing openjdk-11 security updates [07:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1023.eqiad.wmnet [07:53:22] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo and not P{cp4052.*} and A:cp [07:53:48] (03PS2) 10Anzx: madwiki: add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120888 (https://phabricator.wikimedia.org/T382087) [07:53:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120888 (https://phabricator.wikimedia.org/T382087) (owner: 10Anzx) [07:54:49] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo and not P{cp4044.*} and A:cp [07:55:54] (03PS1) 10Arnaudb: ferm: remove moscovium from allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1120889 (https://phabricator.wikimedia.org/T385777) [07:59:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1023.eqiad.wmnet [08:00:05] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T0800). [08:00:05] anzx and MichaelG_WMF: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:17] o/ [08:00:20] o/ [08:03:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1023.eqiad.wmnet to cluster eqiad and group A [08:03:44] (03PS1) 10Elukey: role::kubernetes::worker: add kartotherian-k8s-ssl to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1120893 (https://phabricator.wikimedia.org/T386648) [08:04:21] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1023.eqiad.wmnet to cluster eqiad and group A [08:04:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1033.eqiad.wmnet to cluster eqiad and group D [08:07:01] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4953/co" [puppet] - 10https://gerrit.wikimedia.org/r/1120893 (https://phabricator.wikimedia.org/T386648) (owner: 10Elukey) [08:07:11] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:07:39] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:07:41] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:08:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1033.eqiad.wmnet to cluster eqiad and group D [08:11:21] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo and not P{cp4052.*} and A:cp [08:15:27] jouncebot: nowandnext [08:15:27] For the next 0 hour(s) and 44 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T0800) [08:15:27] In 0 hour(s) and 44 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T0900) [08:15:58] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo and not P{cp4044.*} and A:cp [08:18:20] (03PS1) 10DCausse: Do not update the search index if the assessment did not change [extensions/PageAssessments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120895 [08:18:32] (03PS1) 10DCausse: Do not update the search index if the assessment did not change [extensions/PageAssessments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120896 [08:19:10] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru and A:cp [08:19:27] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru and A:cp [08:22:34] anzx, MichaelG_WMF do you have a deployer? [08:22:56] no, not yet. [08:23:47] If you could do it, that would be great. Though I could also wait to the next window if necessary (but that is usually quite full) [08:23:58] I think I can deploy [08:24:08] anzx: are you still around? [08:24:13] YaY, thank you dcausse! [08:24:29] dcausse: yes I am around [08:25:06] anzx: can I ship all your 3 config changes at once or would you prefer to test them individually? [08:25:10] (03Abandoned) 10Brouberol: opensearch: include the minor version in the apt component name [puppet] - 10https://gerrit.wikimedia.org/r/1120140 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [08:25:19] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10562405 (10MoritzMuehlenhoff) [08:25:49] dcausse: ship all at once [08:25:55] ack [08:27:16] (03CR) 10DCausse: [C:03+1] satwiktionary: add sitename, timezone, projectnamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120701 (https://phabricator.wikimedia.org/T386631) (owner: 10Anzx) [08:28:56] (03CR) 10DCausse: [C:03+1] uzwikiquote: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120699 (https://phabricator.wikimedia.org/T386569) (owner: 10Anzx) [08:29:17] (03PS1) 10Elukey: role::etcd::v3::aux_k8s_etcd: remove backups [puppet] - 10https://gerrit.wikimedia.org/r/1120899 (https://phabricator.wikimedia.org/T385727) [08:30:03] (03CR) 10DCausse: [C:03+1] madwiki: add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120888 (https://phabricator.wikimedia.org/T382087) (owner: 10Anzx) [08:30:34] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4954/co" [puppet] - 10https://gerrit.wikimedia.org/r/1120899 (https://phabricator.wikimedia.org/T385727) (owner: 10Elukey) [08:32:19] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120701 (https://phabricator.wikimedia.org/T386631) (owner: 10Anzx) [08:32:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120888 (https://phabricator.wikimedia.org/T382087) (owner: 10Anzx) [08:32:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120699 (https://phabricator.wikimedia.org/T386569) (owner: 10Anzx) [08:33:05] (03Merged) 10jenkins-bot: satwiktionary: add sitename, timezone, projectnamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120701 (https://phabricator.wikimedia.org/T386631) (owner: 10Anzx) [08:33:08] (03Merged) 10jenkins-bot: madwiki: add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120888 (https://phabricator.wikimedia.org/T382087) (owner: 10Anzx) [08:33:10] (03Merged) 10jenkins-bot: uzwikiquote: add logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120699 (https://phabricator.wikimedia.org/T386569) (owner: 10Anzx) [08:33:53] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1120701|satwiktionary: add sitename, timezone, projectnamespace (T386631)]], [[gerrit:1120888|madwiki: add namespace aliases (T382087)]], [[gerrit:1120699|uzwikiquote: add logos (T386569)]] [08:33:59] T386631: Post-creation work for satwiktionary - https://phabricator.wikimedia.org/T386631 [08:33:59] T382087: Add Indonesian language fallback aliases for Namespaces in Madurese - https://phabricator.wikimedia.org/T382087 [08:34:00] T386569: Proposed Revisions to the Uzbek Wikiquote Logo - https://phabricator.wikimedia.org/T386569 [08:34:37] (03PS1) 10Brouberol: cirrus: add the opensearch.motd file [puppet] - 10https://gerrit.wikimedia.org/r/1120900 (https://phabricator.wikimedia.org/T380752) [08:37:00] !log dcausse@deploy2002 dcausse, anzx: Backport for [[gerrit:1120701|satwiktionary: add sitename, timezone, projectnamespace (T386631)]], [[gerrit:1120888|madwiki: add namespace aliases (T382087)]], [[gerrit:1120699|uzwikiquote: add logos (T386569)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:37:05] dcausse: checking [08:37:09] (03PS2) 10Elukey: role::etcd::v3::aux_k8s_etcd: remove backups [puppet] - 10https://gerrit.wikimedia.org/r/1120899 (https://phabricator.wikimedia.org/T385727) [08:37:30] (03CR) 10CI reject: [V:04-1] role::etcd::v3::aux_k8s_etcd: remove backups [puppet] - 10https://gerrit.wikimedia.org/r/1120899 (https://phabricator.wikimedia.org/T385727) (owner: 10Elukey) [08:37:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.decommission for hosts moscovium.eqiad.wmnet [08:38:23] (03CR) 10Brouberol: [C:03+2] cirrus: add the opensearch.motd file [puppet] - 10https://gerrit.wikimedia.org/r/1120900 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [08:39:03] (03PS3) 10Elukey: role::etcd::v3::aux_k8s_etcd: remove backups [puppet] - 10https://gerrit.wikimedia.org/r/1120899 (https://phabricator.wikimedia.org/T385727) [08:39:56] dcausse: all patches looks good [08:40:07] anzx: ack, deploying [08:40:13] !log dcausse@deploy2002 dcausse, anzx: Continuing with sync [08:40:15] !log upgrading haproxykafka package on apt repo to 0.3.5 (T374128) [08:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:18] T374128: haproxykafka features - https://phabricator.wikimedia.org/T374128 [08:40:26] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10562456 (10MoritzMuehlenhoff) [08:40:37] (03PS1) 10Arnaudb: rt: remove cname [dns] - 10https://gerrit.wikimedia.org/r/1120901 (https://phabricator.wikimedia.org/T385777) [08:41:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet [08:41:08] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4956/co" [puppet] - 10https://gerrit.wikimedia.org/r/1120899 (https://phabricator.wikimedia.org/T385727) (owner: 10Elukey) [08:41:20] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10562457 (10ops-monitoring-bot) Draining ganeti1025.eqiad.wmnet of running VMs [08:41:27] (03CR) 10Kamila Součková: [C:03+1] role::kubernetes::worker: add kartotherian-k8s-ssl to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1120893 (https://phabricator.wikimedia.org/T386648) (owner: 10Elukey) [08:41:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet [08:41:51] (03CR) 10Elukey: [V:03+1 C:04-1] "WIP sorry" [puppet] - 10https://gerrit.wikimedia.org/r/1120899 (https://phabricator.wikimedia.org/T385727) (owner: 10Elukey) [08:42:12] (03PS1) 10Muehlenhoff: Switch ganeti1025 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1120902 [08:42:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1025.eqiad.wmnet [08:42:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10562458 (10ops-monitoring-bot) Draining ganeti1025.eqiad.wmnet of running VMs [08:42:35] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru and A:cp [08:42:52] !log arnaudb@cumin1002 START - Cookbook sre.dns.netbox [08:45:21] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru and A:cp [08:45:36] !log upgrading haproxykafka to 0.3.5 on cp4037 to test new feature (T374128) [08:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:40] T374128: haproxykafka features - https://phabricator.wikimedia.org/T374128 [08:46:25] (03PS4) 10Elukey: role::etcd::v3::aux_k8s_etcd: remove backups [puppet] - 10https://gerrit.wikimedia.org/r/1120899 (https://phabricator.wikimedia.org/T385727) [08:46:30] !log arnaudb@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moscovium.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:46:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: moscovium.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1002" [08:46:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:46:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts moscovium.eqiad.wmnet [08:46:54] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120701|satwiktionary: add sitename, timezone, projectnamespace (T386631)]], [[gerrit:1120888|madwiki: add namespace aliases (T382087)]], [[gerrit:1120699|uzwikiquote: add logos (T386569)]] (duration: 13m 00s) [08:47:02] T386631: Post-creation work for satwiktionary - https://phabricator.wikimedia.org/T386631 [08:47:02] T382087: Add Indonesian language fallback aliases for Namespaces in Madurese - https://phabricator.wikimedia.org/T382087 [08:47:02] T386569: Proposed Revisions to the Uzbek Wikiquote Logo - https://phabricator.wikimedia.org/T386569 [08:47:07] (03CR) 10Elukey: "reverted back to a simpler patch, I'll do the clean up manually, a more generic code needs some thinking to avoid polluting too many nodes" [puppet] - 10https://gerrit.wikimedia.org/r/1120899 (https://phabricator.wikimedia.org/T385727) (owner: 10Elukey) [08:47:09] anzx: should be live [08:47:13] dcausse: thank you [08:47:18] yw! :) [08:47:50] MichaelG_WMF: going to ship your patch, are you still around? [08:47:55] yes [08:47:58] ack [08:48:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120646 (https://phabricator.wikimedia.org/T386739) (owner: 10Michael Große) [08:49:20] (03PS1) 10Brouberol: opensearcgh:cirrus: include the diffie-hellman parameter file [puppet] - 10https://gerrit.wikimedia.org/r/1120903 (https://phabricator.wikimedia.org/T380752) [08:50:05] (03Merged) 10jenkins-bot: testwiki: enable surfacing structured task experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120646 (https://phabricator.wikimedia.org/T386739) (owner: 10Michael Große) [08:50:31] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1120646|testwiki: enable surfacing structured task experiment (T386739)]] [08:50:35] T386739: Surfacing "Add a link" Structured Tasks: Test Wikipedia Release - https://phabricator.wikimedia.org/T386739 [08:52:09] (03CR) 10Elukey: "Thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/1120602 (https://phabricator.wikimedia.org/T385727) (owner: 10Herron) [08:52:37] (03CR) 10Elukey: [V:03+1 C:03+2] role::kubernetes::worker: add kartotherian-k8s-ssl to the lvs pools [puppet] - 10https://gerrit.wikimedia.org/r/1120893 (https://phabricator.wikimedia.org/T386648) (owner: 10Elukey) [08:53:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1120899 (https://phabricator.wikimedia.org/T385727) (owner: 10Elukey) [08:53:23] !log dcausse@deploy2002 migr, dcausse: Backport for [[gerrit:1120646|testwiki: enable surfacing structured task experiment (T386739)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:53:36] (03CR) 10Elukey: [C:03+2] role::etcd::v3::aux_k8s_etcd: remove backups [puppet] - 10https://gerrit.wikimedia.org/r/1120899 (https://phabricator.wikimedia.org/T385727) (owner: 10Elukey) [08:53:36] * MichaelG_WMF is testing [08:54:19] @dcausse It works as expected, thank you! [08:54:27] (03CR) 10Brouberol: [C:03+2] opensearcgh:cirrus: include the diffie-hellman parameter file [puppet] - 10https://gerrit.wikimedia.org/r/1120903 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [08:54:37] MichaelG_WMF: cool, shipping then [08:54:39] !log dcausse@deploy2002 migr, dcausse: Continuing with sync [08:55:13] (03PS4) 10Anzx: knwiki, knwikisource, tcywikisource: add confirmed user usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120891 (https://phabricator.wikimedia.org/T386781) [08:55:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120891 (https://phabricator.wikimedia.org/T386781) (owner: 10Anzx) [08:56:30] jouncebot: next [08:56:31] In 0 hour(s) and 3 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T0900) [08:56:54] I'll have two backport to ship after this one, hope it's ok to take a bit of the mw train time [08:57:33] but please let me know if not [08:58:16] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp [08:58:26] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp [08:59:24] RECOVERY - Elasticsearch HTTPS for relforge-eqiad-small-alpha on relforge1004 is OK: SSL OK - Certificate relforge1004.eqiad.wmnet valid until 2025-03-12 19:54:00 +0000 (expires in 21 days) https://wikitech.wikimedia.org/wiki/Search [08:59:24] RECOVERY - Elasticsearch HTTPS for relforge-eqiad on relforge1004 is OK: SSL OK - Certificate relforge1004.eqiad.wmnet valid until 2025-03-12 19:54:00 +0000 (expires in 21 days) https://wikitech.wikimedia.org/wiki/Search [09:00:05] dancy and andre: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T0900). [09:00:36] dancy, andre the backport window is running a bit late, sorry about that [09:01:07] dcausse: we don't run the train for the next 10 hours, no problem :) [09:01:12] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120646|testwiki: enable surfacing structured task experiment (T386739)]] (duration: 10m 41s) [09:01:13] ah [09:01:16] T386739: Surfacing "Add a link" Structured Tasks: Test Wikipedia Release - https://phabricator.wikimedia.org/T386739 [09:01:18] andre: good for me :) [09:01:22] hehe [09:02:09] MichaelG_WMF: should be live [09:02:20] !log elukey@cumin1002:~$ sudo cumin --m async 'aux-k8s-etcd*' 'systemctl stop etcd-backup.timer etcd-backup.service' 'rm /lib/systemd/system/etcd-backup.service /lib/systemd/system/etcd-backup.timer' 'systemctl daemon-reload' - T385727 [09:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:24] T385727: etcd: adapt etcd-backup.py for etcd 3.4 - https://phabricator.wikimedia.org/T385727 [09:02:32] @dcausse Thanks! 🙏 [09:02:36] yw! :) [09:02:57] extending the backport window a bit to ship two more patches [09:04:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [extensions/PageAssessments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120895 (owner: 10DCausse) [09:04:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [extensions/PageAssessments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120896 (owner: 10DCausse) [09:04:32] RESOLVED: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:05:46] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:05:46] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:05:48] (03PS1) 10Brouberol: opensearch:cirrus: install curator for opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1120908 (https://phabricator.wikimedia.org/T380752) [09:06:12] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:07:14] (03Merged) 10jenkins-bot: Do not update the search index if the assessment did not change [extensions/PageAssessments] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120895 (owner: 10DCausse) [09:07:14] (03Merged) 10jenkins-bot: Do not update the search index if the assessment did not change [extensions/PageAssessments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120896 (owner: 10DCausse) [09:07:28] !log upgrading haproxykafka to 0.3.5 on ulsfo (T374128) [09:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:31] T374128: haproxykafka features - https://phabricator.wikimedia.org/T374128 [09:07:44] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1120895|Do not update the search index if the assessment did not change]], [[gerrit:1120896|Do not update the search index if the assessment did not change]] [09:08:36] (03CR) 10Brouberol: [C:03+2] opensearch:cirrus: install curator for opensearch [puppet] - 10https://gerrit.wikimedia.org/r/1120908 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [09:09:37] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=wikikube-worker200.*.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [09:10:34] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=wikikube-worker100.*.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [09:10:39] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1120895|Do not update the search index if the assessment did not change]], [[gerrit:1120896|Do not update the search index if the assessment did not change]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:10:59] !log dcausse@deploy2002 dcausse: Continuing with sync [09:11:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.264s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:16:00] !log klausman@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Enable Java security updates - klausman@cumin1002 [09:16:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.264s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:17:36] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120895|Do not update the search index if the assessment did not change]], [[gerrit:1120896|Do not update the search index if the assessment did not change]] (duration: 09m 51s) [09:18:32] !log closing the UTC morning backport window [09:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:25] (03PS1) 10Michael Große: Growth: increase minimum tasks per topic on idwiki; ruwiki => default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120904 (https://phabricator.wikimedia.org/T385343) [09:27:35] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp [09:28:04] (03PS1) 10Brouberol: opensearch:cirrus: pin elasticsearch-curator version [puppet] - 10https://gerrit.wikimedia.org/r/1120914 (https://phabricator.wikimedia.org/T380752) [09:28:25] (03CR) 10CI reject: [V:04-1] opensearch:cirrus: pin elasticsearch-curator version [puppet] - 10https://gerrit.wikimedia.org/r/1120914 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [09:28:31] (03PS2) 10Brouberol: opensearch:cirrus: pin elasticsearch-curator version [puppet] - 10https://gerrit.wikimedia.org/r/1120914 (https://phabricator.wikimedia.org/T380752) [09:28:51] (03CR) 10CI reject: [V:04-1] opensearch:cirrus: pin elasticsearch-curator version [puppet] - 10https://gerrit.wikimedia.org/r/1120914 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [09:28:54] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp [09:29:19] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4957/co" [puppet] - 10https://gerrit.wikimedia.org/r/1120914 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [09:30:03] (03PS3) 10Brouberol: opensearch:cirrus: pin elasticsearch-curator version [puppet] - 10https://gerrit.wikimedia.org/r/1120914 (https://phabricator.wikimedia.org/T380752) [09:33:12] (03CR) 10Brouberol: [C:03+2] opensearch:cirrus: pin elasticsearch-curator version [puppet] - 10https://gerrit.wikimedia.org/r/1120914 (https://phabricator.wikimedia.org/T380752) (owner: 10Brouberol) [09:33:42] !log klausman@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Enable Java security updates - klausman@cumin1002 [09:33:58] !log klausman@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Enable Java security updates - klausman@cumin1002 [09:35:42] !log restart envoy/swift on ms-fe1013 [09:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:27] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp [09:36:33] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw and A:cp [09:37:16] !log restart envoy/swift on ms-fe201[2-4] [09:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1150:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1150 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:40:50] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:40:50] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:43:32] (03PS1) 10Arnaudb: moscovium: remove from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1120917 (https://phabricator.wikimedia.org/T385777) [09:43:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1150:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1150 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:51:31] !log klausman@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Enable Java security updates - klausman@cumin1002 [09:58:15] (03CR) 10Muehlenhoff: [C:03+1] moscovium: remove from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1120917 (https://phabricator.wikimedia.org/T385777) (owner: 10Arnaudb) [09:58:27] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw and A:cp [10:00:06] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw and A:cp [10:00:44] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp [10:00:55] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp [10:09:51] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=30; selector: name=wikikube-worker1138.codfw.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [10:09:59] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=30; selector: name=wikikube-worker1138.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [10:10:32] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=30; selector: name=wikikube-worker1002.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [10:16:54] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [10:24:59] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp [10:30:15] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: move icmp checks under wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1100819 (https://phabricator.wikimedia.org/T381580) (owner: 10Tiziano Fogli) [10:30:36] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100819 (https://phabricator.wikimedia.org/T381580) (owner: 10Tiziano Fogli) [10:31:10] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp [10:31:21] (03CR) 10Arturo Borrero Gonzalez: "let me know how this looks @tfogli@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1100819 (https://phabricator.wikimedia.org/T381580) (owner: 10Tiziano Fogli) [10:32:36] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: move icmp checks under wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1100819 (https://phabricator.wikimedia.org/T381580) (owner: 10Tiziano Fogli) [10:32:47] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100819 (https://phabricator.wikimedia.org/T381580) (owner: 10Tiziano Fogli) [10:33:01] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:33:09] (03PS1) 10Fabfur: haproxykafka: limit memory usage to 5% of total physical memory [puppet] - 10https://gerrit.wikimedia.org/r/1120922 (https://phabricator.wikimedia.org/T386753) [10:33:47] (03PS2) 10Fabfur: haproxykafka: limit memory usage to 5% of total physical memory [puppet] - 10https://gerrit.wikimedia.org/r/1120922 (https://phabricator.wikimedia.org/T386753) [10:34:01] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:36:16] (03CR) 10Urbanecm: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120904 (https://phabricator.wikimedia.org/T385343) (owner: 10Michael Große) [10:38:45] (03PS1) 10Filippo Giunchedi: o11y: promote thanos compact alerts to critical [alerts] - 10https://gerrit.wikimedia.org/r/1120923 [10:38:51] (03CR) 10CI reject: [V:04-1] o11y: promote thanos compact alerts to critical [alerts] - 10https://gerrit.wikimedia.org/r/1120923 (owner: 10Filippo Giunchedi) [10:39:29] RECOVERY - Disk space on titan2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [10:39:40] (03CR) 10Filippo Giunchedi: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/1120923 (owner: 10Filippo Giunchedi) [10:40:33] (03CR) 10Michael Große: "T think, this also needs to set `wgGESurfacingStructuredTasksEnabled` to `true` for this wiki." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120505 (https://phabricator.wikimedia.org/T385343) (owner: 10Sergio Gimeno) [10:42:49] !log elukey@puppetserver1001 conftool action : set/pooled=no; selector: name=maps2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [10:43:01] (03PS1) 10Fabfur: hiera: reasonable message batches number [puppet] - 10https://gerrit.wikimedia.org/r/1120924 (https://phabricator.wikimedia.org/T386753) [10:43:50] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120924 (https://phabricator.wikimedia.org/T386753) (owner: 10Fabfur) [10:44:22] (03CR) 10Vgutierrez: [C:03+1] "commit message nitpick aside, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1120924 (https://phabricator.wikimedia.org/T386753) (owner: 10Fabfur) [10:45:39] (03CR) 10Filippo Giunchedi: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/1120923 (owner: 10Filippo Giunchedi) [10:45:42] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps2006.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [10:45:46] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [10:46:30] (03PS1) 10Urbanecm: [Growth] enwiki: Release Add Link to 15% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120925 (https://phabricator.wikimedia.org/T386029) [10:47:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120904 (https://phabricator.wikimedia.org/T385343) (owner: 10Michael Große) [10:48:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120618 (https://phabricator.wikimedia.org/T386490) (owner: 10Michael Große) [10:48:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120620 (https://phabricator.wikimedia.org/T386490) (owner: 10Michael Große) [10:48:19] (03PS1) 10Vgutierrez: aptrepo,haproxy: Allow installing HAProxy 1.3 on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1120926 (https://phabricator.wikimedia.org/T386796) [10:48:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120643 (owner: 10Michael Große) [10:48:39] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [10:52:24] !log elukey@puppetserver1001 conftool action : set/weight=10; selector: name=wikikube-worker1002.eqiad.wmnet.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [10:53:15] !log elukey@puppetserver1001 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker1002.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [10:53:59] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=maps1006.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [10:55:02] (03PS2) 10Fabfur: hiera: reasonable message batches number [puppet] - 10https://gerrit.wikimedia.org/r/1120924 (https://phabricator.wikimedia.org/T386753) [10:56:00] (03CR) 10Vgutierrez: haproxykafka: limit memory usage to 5% of total physical memory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1120922 (https://phabricator.wikimedia.org/T386753) (owner: 10Fabfur) [10:56:48] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120924 (https://phabricator.wikimedia.org/T386753) (owner: 10Fabfur) [10:57:56] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade network devices to Junos 20+ - https://phabricator.wikimedia.org/T316539#10562858 (10ayounsi) 05Open→03Resolved a:03ayounsi Nop, thanks for the ping. There is now {T364092} [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T1100) [11:00:24] (03PS3) 10Fabfur: hiera, hpk: reasonable message batches number [puppet] - 10https://gerrit.wikimedia.org/r/1120924 (https://phabricator.wikimedia.org/T386753) [11:01:02] (03CR) 10Fabfur: hiera, hpk: reasonable message batches number (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1120924 (https://phabricator.wikimedia.org/T386753) (owner: 10Fabfur) [11:01:09] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120924 (https://phabricator.wikimedia.org/T386753) (owner: 10Fabfur) [11:01:56] (03CR) 10Vgutierrez: [C:03+1] hiera, hpk: reasonable message batches number [puppet] - 10https://gerrit.wikimedia.org/r/1120924 (https://phabricator.wikimedia.org/T386753) (owner: 10Fabfur) [11:04:43] (03CR) 10Fabfur: [C:03+2] hiera, hpk: reasonable message batches number [puppet] - 10https://gerrit.wikimedia.org/r/1120924 (https://phabricator.wikimedia.org/T386753) (owner: 10Fabfur) [11:06:37] (03PS1) 10Ladsgroup: ChangeTagsStore: Lengthen cache times [core] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120928 (https://phabricator.wikimedia.org/T384921) [11:06:43] jouncebot: nowandnext [11:06:43] For the next 0 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T1100) [11:06:43] In 0 hour(s) and 53 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T1200) [11:07:13] (03PS1) 10Ladsgroup: ChangeTagsStore: Lengthen cache times [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120929 (https://phabricator.wikimedia.org/T384921) [11:07:18] (03CR) 10Ladsgroup: [C:03+2] ChangeTagsStore: Lengthen cache times [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120929 (https://phabricator.wikimedia.org/T384921) (owner: 10Ladsgroup) [11:07:22] (03CR) 10Ladsgroup: [C:03+2] ChangeTagsStore: Lengthen cache times [core] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120928 (https://phabricator.wikimedia.org/T384921) (owner: 10Ladsgroup) [11:09:14] !log upgrading haproxykafka to 0.3.5 on all DCs (T374128) [11:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:18] T374128: haproxykafka features - https://phabricator.wikimedia.org/T374128 [11:13:07] (03CR) 10Filippo Giunchedi: "Please take a look, modulo CI failures which are tracked at https://phabricator.wikimedia.org/T386784" [alerts] - 10https://gerrit.wikimedia.org/r/1120923 (owner: 10Filippo Giunchedi) [11:16:16] (03CR) 10Michael Große: [C:03+1] [Growth] enwiki: Release Add Link to 15% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120925 (https://phabricator.wikimedia.org/T386029) (owner: 10Urbanecm) [11:17:49] (03CR) 10CI reject: [V:04-1] ChangeTagsStore: Lengthen cache times [core] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120928 (https://phabricator.wikimedia.org/T384921) (owner: 10Ladsgroup) [11:18:38] (03Merged) 10jenkins-bot: ChangeTagsStore: Lengthen cache times [core] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120929 (https://phabricator.wikimedia.org/T384921) (owner: 10Ladsgroup) [11:19:03] (03Merged) 10jenkins-bot: ChangeTagsStore: Lengthen cache times [core] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120928 (https://phabricator.wikimedia.org/T384921) (owner: 10Ladsgroup) [11:24:28] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1120928|ChangeTagsStore: Lengthen cache times (T384921)]], [[gerrit:1120929|ChangeTagsStore: Lengthen cache times (T384921)]] [11:24:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1025.eqiad.wmnet [11:24:32] T384921: OAuth oauth_registered_consumer table: Reads to the table are exceeding transaction profiler limits at a rate of ~4 per second - https://phabricator.wikimedia.org/T384921 [11:27:29] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1120928|ChangeTagsStore: Lengthen cache times (T384921)]], [[gerrit:1120929|ChangeTagsStore: Lengthen cache times (T384921)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:27:59] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:28:24] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp [11:29:15] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1025.eqiad.wmnet with reason: remove from cluster for reimage [11:29:22] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10562969 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f7811566-34c7-44ec-b90f-ec261439dabd) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [11:29:59] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1025 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1120902 (owner: 10Muehlenhoff) [11:33:00] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1025.eqiad.wmnet [11:34:44] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120928|ChangeTagsStore: Lengthen cache times (T384921)]], [[gerrit:1120929|ChangeTagsStore: Lengthen cache times (T384921)]] (duration: 10m 16s) [11:34:48] T384921: OAuth oauth_registered_consumer table: Reads to the table are exceeding transaction profiler limits at a rate of ~4 per second - https://phabricator.wikimedia.org/T384921 [11:35:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:35:25] (03CR) 10Ladsgroup: [C:04-1] "We have to do this for important reasons, unfortunately we can't just fully revert it back. At least not yet. Can you give us list of para" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [11:40:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:43:25] (03CR) 10Andrew Bogott: [C:03+2] cloudgw1003: take over cloudgw1001 [puppet] - 10https://gerrit.wikimedia.org/r/1114997 (https://phabricator.wikimedia.org/T382356) (owner: 10Arturo Borrero Gonzalez) [11:44:09] OK to do deployment of MinT/machinetranslation service? [11:44:36] jouncebot: nowandnext [11:44:36] For the next 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T1100) [11:44:36] In 0 hour(s) and 15 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T1200) [11:45:03] lgtm? assuming Amir1 doesn't have any further deploys [11:45:14] I don't! [11:45:29] Cool. Thanks. Attempting :) [11:45:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1036.eqiad.wmnet [11:46:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10563016 (10ops-monitoring-bot) Draining ganeti1036.eqiad.wmnet of running VMs [11:46:28] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [11:46:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1036.eqiad.wmnet [11:47:15] (03PS1) 10Muehlenhoff: Switch ganeti1036 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1120934 [11:48:08] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1003.eqiad.wmnet with OS bullseye [11:48:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [11:48:48] (03CR) 10Arnaudb: [C:03+2] moscovium: remove from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1120917 (https://phabricator.wikimedia.org/T385777) (owner: 10Arnaudb) [11:49:04] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp [11:49:50] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1001.eqiad.wmnet with OS bookworm [11:51:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1004.eqiad.wmnet to drbd [11:51:28] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10563029 (10ops-monitoring-bot) VM kubestagemaster1004.eqiad.wmnet switching disk type to drbd [11:52:53] (03PS3) 10Fabfur: haproxykafka: limit memory usage to 5% of total physical memory [puppet] - 10https://gerrit.wikimedia.org/r/1120922 (https://phabricator.wikimedia.org/T386747) [11:53:36] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp [11:55:32] (03CR) 10Fabfur: haproxykafka: limit memory usage to 5% of total physical memory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1120922 (https://phabricator.wikimedia.org/T386747) (owner: 10Fabfur) [11:55:46] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120922 (https://phabricator.wikimedia.org/T386747) (owner: 10Fabfur) [11:59:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet [11:59:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1025.eqiad.wmnet [12:00:04] mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T1200). [12:01:08] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [12:06:31] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage [12:06:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster1004.eqiad.wmnet to drbd [12:06:46] PROBLEM - Host kubestagemaster1004 is DOWN: PING CRITICAL - Packet loss = 100% [12:07:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1036.eqiad.wmnet [12:07:21] FIRING: [2x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:07:30] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10563106 (10ops-monitoring-bot) Draining ganeti1036.eqiad.wmnet of running VMs [12:07:34] RECOVERY - Host kubestagemaster1004 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [12:07:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1036.eqiad.wmnet [12:09:32] RESOLVED: [2x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster1004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagemaster1004.eqiad.wmnet to plain [12:10:10] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10563124 (10ops-monitoring-bot) VM kubestagemaster1004.eqiad.wmnet switching disk type to plain [12:10:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagemaster1004.eqiad.wmnet to plain [12:10:32] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1001.eqiad.wmnet with reason: host reimage [12:10:45] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp [12:12:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1036.eqiad.wmnet [12:12:25] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10563128 (10ops-monitoring-bot) Draining ganeti1036.eqiad.wmnet of running VMs [12:20:53] 06SRE, 06Infrastructure-Foundations, 10netops: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807 (10cmooney) 03NEW p:05Triage→03Low [12:20:59] 06SRE, 06Infrastructure-Foundations, 10netops: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10563148 (10cmooney) [12:21:01] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10563149 (10cmooney) [12:22:17] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams and A:cp [12:25:10] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1120901 (https://phabricator.wikimedia.org/T385777) (owner: 10Arnaudb) [12:25:50] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1120889 (https://phabricator.wikimedia.org/T385777) (owner: 10Arnaudb) [12:26:05] (03CR) 10Arnaudb: [C:03+2] ferm: remove moscovium from allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1120889 (https://phabricator.wikimedia.org/T385777) (owner: 10Arnaudb) [12:27:36] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudgw1003.eqiad.wmnet with OS bullseye [12:27:37] (03CR) 10Arnaudb: [C:03+2] rt: remove cname [dns] - 10https://gerrit.wikimedia.org/r/1120901 (https://phabricator.wikimedia.org/T385777) (owner: 10Arnaudb) [12:27:50] !log arnaudb@dns1004 START - running authdns-update [12:28:39] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1001.eqiad.wmnet with OS bookworm [12:29:12] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1120926 (https://phabricator.wikimedia.org/T386796) (owner: 10Vgutierrez) [12:29:46] !log arnaudb@dns1004 END - running authdns-update [12:30:24] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudgw1003.eqiad.wmnet with OS bullseye [12:31:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1025.eqiad.wmnet with OS bookworm [12:32:12] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10563175 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bookworm [12:46:16] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams and A:cp [12:46:35] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw1003.eqiad.wmnet with reason: host reimage [12:47:17] (03PS1) 10Muehlenhoff: sre.hardware.upgrade-hardware: Mention possibly long run time [cookbooks] - 10https://gerrit.wikimedia.org/r/1120948 (https://phabricator.wikimedia.org/T385873) [12:48:44] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] cloudgw: move icmp checks under wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1100819 (https://phabricator.wikimedia.org/T381580) (owner: 10Tiziano Fogli) [12:50:07] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw1003.eqiad.wmnet with reason: host reimage [12:50:46] !log aborrero@cumin1002 START - Cookbook sre.dns.netbox [12:51:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1025.eqiad.wmnet with reason: host reimage [12:54:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1025.eqiad.wmnet with reason: host reimage [12:54:30] !log aborrero@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw updates - aborrero@cumin1002" [12:54:35] !log aborrero@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw updates - aborrero@cumin1002" [12:54:36] !log aborrero@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:55:41] (03CR) 10Muehlenhoff: [C:03+1] "That sounds plausible, yes" [puppet] - 10https://gerrit.wikimedia.org/r/1119718 (https://phabricator.wikimedia.org/T386297) (owner: 10Jelto) [13:00:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:03:45] (03CR) 10Filippo Giunchedi: "I might be missing something here, though I'm not sure what's wrong with keeping https ?" [puppet] - 10https://gerrit.wikimedia.org/r/1120160 (https://phabricator.wikimedia.org/T385750) (owner: 10Phedenskog) [13:05:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:09:06] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw1003.eqiad.wmnet with OS bullseye [13:11:35] !log aborrero@cumin1002 START - Cookbook sre.dns.wipe-cache vlan1120.cloudgw1003.eqiad1.wikimediacloud.org on all recursors [13:11:38] !log aborrero@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) vlan1120.cloudgw1003.eqiad1.wikimediacloud.org on all recursors [13:13:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1025.eqiad.wmnet with OS bookworm [13:13:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10563237 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1025.eqiad.wmnet with OS bookworm completed: - ganeti102... [13:18:48] (03PS2) 10Anzx: Lift IP cap for edit-a-thon on 2025-02-26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120954 (https://phabricator.wikimedia.org/T386793) [13:18:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120954 (https://phabricator.wikimedia.org/T386793) (owner: 10Anzx) [13:20:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:34] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1120948 (https://phabricator.wikimedia.org/T385873) (owner: 10Muehlenhoff) [13:22:00] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudgw1001.eqiad.wmnet [13:23:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1025.eqiad.wmnet [13:25:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:26:49] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [13:30:30] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [13:31:09] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [13:31:09] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:31:10] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudgw1001.eqiad.wmnet [13:31:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1025.eqiad.wmnet [13:32:02] (03PS1) 10Andrew Bogott: Clean up refs to cloudgw100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1120958 (https://phabricator.wikimedia.org/T386810) [13:32:18] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudgw1002.eqiad.wmnet [13:35:17] (03CR) 10Muehlenhoff: [C:03+2] sre.hardware.upgrade-hardware: Mention possibly long run time [cookbooks] - 10https://gerrit.wikimedia.org/r/1120948 (https://phabricator.wikimedia.org/T385873) (owner: 10Muehlenhoff) [13:35:36] (03CR) 10Andrew Bogott: [C:03+2] Clean up refs to cloudgw100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1120958 (https://phabricator.wikimedia.org/T386810) (owner: 10Andrew Bogott) [13:36:23] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: sre.hardware.upgrade-firmware: Firmware update hangs on Dell PowerEdge R440 - https://phabricator.wikimedia.org/T385873#10563373 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff I've modified the sre.hardware.upgra... [13:37:00] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [13:41:20] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [13:41:31] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:41:33] !log installing libtasn1-6 security updates [13:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:46] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:41:55] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudgw1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [13:41:55] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:41:56] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudgw1002.eqiad.wmnet [13:41:57] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:42:20] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:43:01] (03PS2) 10Muehlenhoff: Bump versions of Java 11/17 production images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120544 [13:43:10] (03CR) 10Muehlenhoff: Bump versions of Java 11/17 production images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1120544 (owner: 10Muehlenhoff) [13:43:52] 10ops-eqiad, 06cloud-services-team, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission cloudgw100[12] - https://phabricator.wikimedia.org/T386810#10563401 (10Andrew) a:05Andrew→03None [13:44:39] (03PS2) 10Arnaudb: rt: discarding modules about request tracker [puppet] - 10https://gerrit.wikimedia.org/r/1117530 (https://phabricator.wikimedia.org/T384595) [13:45:28] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp [13:45:30] (03CR) 10Arnaudb: "I guess we can now move up the relation chain to clean up rt artifacts" [puppet] - 10https://gerrit.wikimedia.org/r/1117530 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [13:45:54] (03PS2) 10Arnaudb: rt: discarding templates [puppet] - 10https://gerrit.wikimedia.org/r/1117531 (https://phabricator.wikimedia.org/T384595) [13:46:40] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120469 (owner: 10PipelineBot) [13:47:38] (03CR) 10Muehlenhoff: rt: discarding modules about request tracker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1117530 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [13:47:49] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120469 (owner: 10PipelineBot) [13:48:07] (03PS3) 10Arnaudb: rt: discarding modules about request tracker [puppet] - 10https://gerrit.wikimedia.org/r/1117530 (https://phabricator.wikimedia.org/T384595) [13:48:13] (03PS3) 10Arnaudb: rt: discarding templates [puppet] - 10https://gerrit.wikimedia.org/r/1117531 (https://phabricator.wikimedia.org/T384595) [13:55:18] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:57:38] (03CR) 10Daimona Eaytoy: [C:04-1] Introduce config setting to disable default event-organizer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120632 (https://phabricator.wikimedia.org/T386290) (owner: 10Daimona Eaytoy) [13:59:47] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T1400). [14:00:05] Daimona, anzx, and MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] o/ [14:00:13] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:00:18] o/ [14:00:23] (03PS2) 10Daimona Eaytoy: Introduce config setting to disable default event-organizer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120632 (https://phabricator.wikimedia.org/T386290) [14:00:27] o/ [14:00:35] I’m trying to figure out which of the changes are deployable [14:00:40] (I can deploy) [14:01:53] Mine are deployable, I just made a last-minute change. [14:02:04] (03CR) 10Lucas Werkmeister (WMDE): Lift IP cap for edit-a-thon on 2025-02-26 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120954 (https://phabricator.wikimedia.org/T386793) (owner: 10Anzx) [14:02:52] (03PS3) 10Anzx: Lift IP cap for edit-a-thon on 2025-02-26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120954 (https://phabricator.wikimedia.org/T386793) [14:03:10] (03CR) 10Anzx: Lift IP cap for edit-a-thon on 2025-02-26 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120954 (https://phabricator.wikimedia.org/T386793) (owner: 10Anzx) [14:03:36] ok, then let’s start with Daimona [14:03:51] is it okay to deploy both of those config changes at once? [14:04:08] *together [14:05:04] I think so. The first one should be a no-op. [14:05:20] alright [14:05:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120632 (https://phabricator.wikimedia.org/T386290) (owner: 10Daimona Eaytoy) [14:05:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120626 (https://phabricator.wikimedia.org/T386290) (owner: 10Daimona Eaytoy) [14:06:15] (03Merged) 10jenkins-bot: Introduce config setting to disable default event-organizer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120632 (https://phabricator.wikimedia.org/T386290) (owner: 10Daimona Eaytoy) [14:07:21] “Gerrit could not merge the change '1120626' as is and could require a rebase” [14:07:33] (03CR) 10Arnaudb: rt: discarding modules about request tracker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1117530 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [14:07:45] ah, it wasn’t rebased [14:07:50] (03PS4) 10Daimona Eaytoy: enwiki, mswikt: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120626 (https://phabricator.wikimedia.org/T386290) [14:07:57] (03CR) 10TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120626 (https://phabricator.wikimedia.org/T386290) (owner: 10Daimona Eaytoy) [14:08:20] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams and A:cp [14:08:54] (03Merged) 10jenkins-bot: enwiki, mswikt: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120626 (https://phabricator.wikimedia.org/T386290) (owner: 10Daimona Eaytoy) [14:09:20] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1120632|Introduce config setting to disable default event-organizer group (T386290)]], [[gerrit:1120626|enwiki, mswikt: Enable the CampaignEvents extension (T386290 T386538)]] [14:09:25] T386290: Enable CampaignEvents Extension on English Wikipedia - https://phabricator.wikimedia.org/T386290 [14:09:25] T386538: Enable CampaignEvents Extension on mswikt - https://phabricator.wikimedia.org/T386538 [14:09:27] anzx: I’m trying to understand the groups change [14:09:35] especially regarding https://phabricator.wikimedia.org/T386781#10562387 [14:10:42] AFAICT that comment is wrong… what do you think? [14:11:42] wait o_O [14:12:00] why is there no "confirmed" group in https://kn.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups&formatversion=2 [14:12:04] but there is one in https://kn.wikipedia.org/w/index.php?title=%E0%B2%B5%E0%B2%BF%E0%B2%B6%E0%B3%87%E0%B2%B7:ListGroupRights&uselang=en ? [14:12:19] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1120632|Introduce config setting to disable default event-organizer group (T386290)]], [[gerrit:1120626|enwiki, mswikt: Enable the CampaignEvents extension (T386290 T386538)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:12:20] yeah comment does seem misleading, since i am creating new groups on those wikis [14:12:33] Daimona: please test [14:12:40] doing [14:13:18] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1117531 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [14:14:13] ok, https://www.wikidata.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups&formatversion=2 vs. https://www.wikidata.org/wiki/Special:ListGroupRights has the same confusing behavior [14:14:27] where a “confirmed” group apparently exists (I can also see it at https://www.wikidata.org/wiki/Special:UserRights/Lucas_Werkmeister_(WMDE)) but not in the API output [14:14:45] (except in add/remove, i.e. other groups are allowed to add to / remove from this group) [14:16:23] ok, it seems to be because wgGroupInheritsPermissions has the confirmed group inherit from the autoconfirmed group [14:16:24] (03CR) 10Muehlenhoff: [C:03+1] "Looks good! Make sure to also remove passwords::misc::rt from" [puppet] - 10https://gerrit.wikimedia.org/r/1117530 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [14:16:25] (on all wikis) [14:16:40] and I guess the siteinfo API doesn’t account for that [14:17:46] (03CR) 10Filippo Giunchedi: "Clearing up my review queue -- also I don't think we should be mimicking check_proc and rely instead of systemd to do the right thing in m" [puppet] - 10https://gerrit.wikimedia.org/r/1004672 (owner: 10Slyngshede) [14:18:22] aaand it’s a known issue T357846 [14:18:23] T357846: siteinfo API module does not correctly process groups defined using $wgGroupInheritsPermissions - https://phabricator.wikimedia.org/T357846 [14:19:33] Lucas_WMDE: everything looks OK AFAICT. As a side note, I still need to figure out why ResourceLoader always reports a module as not existing on the first page load, but that's for later. Probably some caching issue. [14:19:43] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Continuing with sync [14:19:45] ok, thanks! [14:22:50] (03CR) 10Dzahn: "the cache/text part can and should be merged. the gerrit/phab config parts are pretty unrelated and probably warrant a chat before just re" [puppet] - 10https://gerrit.wikimedia.org/r/1117531 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [14:23:44] (03PS3) 10Ssingh: Release dnsdist 1.9.8-1+wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 [14:24:53] (03CR) 10CI reject: [V:04-1] Release dnsdist 1.9.8-1+wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 (owner: 10Ssingh) [14:26:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 20.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:26:19] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120632|Introduce config setting to disable default event-organizer group (T386290)]], [[gerrit:1120626|enwiki, mswikt: Enable the CampaignEvents extension (T386290 T386538)]] (duration: 16m 58s) [14:26:24] T386290: Enable CampaignEvents Extension on English Wikipedia - https://phabricator.wikimedia.org/T386290 [14:26:24] T386538: Enable CampaignEvents Extension on mswikt - https://phabricator.wikimedia.org/T386538 [14:26:38] Noice, thank you :) [14:26:40] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=maps1006.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [14:26:45] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=maps1005.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [14:27:16] (03PS1) 10Filippo Giunchedi: profile: don't require realm production for netbox::data [puppet] - 10https://gerrit.wikimedia.org/r/1120967 [14:27:29] (03PS1) 10Gergő Tisza: CentralAuth: Enable SUL3 signup on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120968 (https://phabricator.wikimedia.org/T384007) [14:27:35] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=wikikube-worker100.*.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [14:27:40] np :) [14:27:57] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=maps2005.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [14:28:02] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=maps2006.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [14:28:24] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=wikikube-worker200.*.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [14:29:09] (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] "It looks like the `groupOverrides` changes shouldn’t be necessary, because the `confirmed` group automatically inherits permissions from t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120891 (https://phabricator.wikimedia.org/T386781) (owner: 10Anzx) [14:30:04] alright, let’s continue with the throttling exception for anzx [14:30:13] (and the groups changes will have to wait a bit) [14:30:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120954 (https://phabricator.wikimedia.org/T386793) (owner: 10Anzx) [14:30:47] (03PS1) 10Bking: relforge: define opensearch datadir as 'opensearch' [puppet] - 10https://gerrit.wikimedia.org/r/1120969 (https://phabricator.wikimedia.org/T380752) [14:31:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web/canary at codfw: 20.83% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:31:15] (03Merged) 10jenkins-bot: Lift IP cap for edit-a-thon on 2025-02-26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120954 (https://phabricator.wikimedia.org/T386793) (owner: 10Anzx) [14:31:45] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1120954|Lift IP cap for edit-a-thon on 2025-02-26 (T386793)]] [14:31:49] T386793: IP Lift for Wikithon at Leeds University Weds 26th February - https://phabricator.wikimedia.org/T386793 [14:32:39] (03CR) 10ArielGlenn: [C:03+1] "Here we go..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120968 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [14:34:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120969 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [14:34:38] Lucas_WMDE: but there is no group default for confirmed user in https://github.com/wikimedia/operations-mediawiki-config/blob/47b79412442f37a096de304fc9de1ea018fbcd9b/wmf-config/core-Permissions.php#L3220 [14:34:40] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, anzx: Backport for [[gerrit:1120954|Lift IP cap for edit-a-thon on 2025-02-26 (T386793)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:35:16] anzx: can you test the IP cap change? [14:35:19] (I’m guessing no ^^) [14:35:27] Lucas_WMDE: no [14:35:30] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, anzx: Continuing with sync [14:35:34] alright, then let’s roll forward with that [14:36:10] anzx: that’s right, the confirmed group is defined via https://github.com/wikimedia/operations-mediawiki-config/blob/47b79412442f37a096de304fc9de1ea018fbcd9b/wmf-config/InitialiseSettings.php#L3285 instead [14:36:41] (03PS1) 10Muehlenhoff: Extend access for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/1120970 [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:36:45] it doesn’t show up in the siteinfo API output due to a bug, but it does exist, you can see it e.g. on Special:UserGroupRights [14:36:51] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet [14:37:36] (03CR) 10Brouberol: [C:03+1] relforge: define opensearch datadir as 'opensearch' [puppet] - 10https://gerrit.wikimedia.org/r/1120969 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [14:39:27] (03CR) 10Volans: [C:03+1] "The hiera lookups seems to have a default in:" [puppet] - 10https://gerrit.wikimedia.org/r/1120967 (owner: 10Filippo Giunchedi) [14:39:33] (03PS5) 10Anzx: knwiki, knwikisource, tcywikisource: add confirmed user usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120891 (https://phabricator.wikimedia.org/T386781) [14:40:22] MichaelG_WMF: should we deploy your changes together or separately? (once we get to them) [14:40:26] (03CR) 10Anzx: "removed `groupOverrides`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120891 (https://phabricator.wikimedia.org/T386781) (owner: 10Anzx) [14:40:42] Lucas_WMDE: yes please [14:40:47] (03CR) 10Bking: [C:03+2] relforge: define opensearch datadir as 'opensearch' [puppet] - 10https://gerrit.wikimedia.org/r/1120969 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [14:40:56] that was supposed to be an exclusive or :P [14:41:06] XD [14:41:10] toghether [14:41:14] ok ^^ [14:41:24] (03CR) 10Muehlenhoff: [C:03+2] Extend access for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/1120970 (owner: 10Muehlenhoff) [14:41:35] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] knwiki, knwikisource, tcywikisource: add confirmed user usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120891 (https://phabricator.wikimedia.org/T386781) (owner: 10Anzx) [14:41:42] though I’ll do ^ first, it looks good to me now [14:41:44] jouncebot: next [14:41:44] In 0 hour(s) and 18 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T1500) [14:41:56] I'll have one more backport soon, will self-deploy [14:42:09] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120954|Lift IP cap for edit-a-thon on 2025-02-26 (T386793)]] (duration: 10m 24s) [14:42:13] T386793: IP Lift for Wikithon at Leeds University Weds 26th February - https://phabricator.wikimedia.org/T386793 [14:42:20] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "thanks! LGTM now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120891 (https://phabricator.wikimedia.org/T386781) (owner: 10Anzx) [14:42:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120891 (https://phabricator.wikimedia.org/T386781) (owner: 10Anzx) [14:42:27] (I hope that it is faster than in the past, because GE no longer depends on Wikibase in CI except for the gate jobs) [14:42:36] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:sessionstore: Apply JDK 11 update - eevans@cumin1002 [14:42:54] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1002.eqiad.wmnet [14:42:59] James_F: backports will probably overrun into your window, I’m guessing that’s okay as usual [14:43:13] MichaelG_WMF: let’s start the backport gate-and-submits already [14:43:24] (03Merged) 10jenkins-bot: knwiki, knwikisource, tcywikisource: add confirmed user usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120891 (https://phabricator.wikimedia.org/T386781) (owner: 10Anzx) [14:43:30] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120618 (https://phabricator.wikimedia.org/T386490) (owner: 10Michael Große) [14:43:34] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120620 (https://phabricator.wikimedia.org/T386490) (owner: 10Michael Große) [14:43:37] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120643 (owner: 10Michael Große) [14:43:46] (but not the config change just yet) [14:43:54] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1120891|knwiki, knwikisource, tcywikisource: add confirmed user usergroup (T386781)]] [14:43:57] T386781: Allow sysops to add/revoke Confirmed user usergroup on knwiki, knwikisource, tcywikisource - https://phabricator.wikimedia.org/T386781 [14:44:19] (03PS4) 10Arnaudb: rt: discarding templates [puppet] - 10https://gerrit.wikimedia.org/r/1117531 (https://phabricator.wikimedia.org/T384595) [14:44:52] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [14:46:12] (03PS1) 10Gergő Tisza: Add configuration options and global preference for the SUL3 rolllout [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120977 (https://phabricator.wikimedia.org/T384549) [14:46:13] (03PS5) 10Arnaudb: rt: sunsetting caching [puppet] - 10https://gerrit.wikimedia.org/r/1117531 (https://phabricator.wikimedia.org/T384595) [14:46:14] (03CR) 10Arnaudb: "Files have been restored, this commit now only impacts cache hieradata" [puppet] - 10https://gerrit.wikimedia.org/r/1117531 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [14:46:44] (03PS1) 10Gergő Tisza: Add configuration options and global preference for the SUL3 rolllout [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120978 (https://phabricator.wikimedia.org/T384549) [14:46:52] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, anzx: Backport for [[gerrit:1120891|knwiki, knwikisource, tcywikisource: add confirmed user usergroup (T386781)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:46:56] Lucas_WMDE: checking [14:46:59] thanks :) [14:47:15] (03PS3) 10Brouberol: airflow: add kafka-{test,jumbo}-eqiad connections to the remaining instances [puppet] - 10https://gerrit.wikimedia.org/r/1118831 (https://phabricator.wikimedia.org/T379676) [14:48:34] changes look good to me so far [14:48:51] Lucas_WMDE: [14:48:58] looks good to me [14:49:00] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, anzx: Continuing with sync [14:49:05] great, thank you! [14:49:07] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4958/co" [puppet] - 10https://gerrit.wikimedia.org/r/1118831 (https://phabricator.wikimedia.org/T379676) (owner: 10Brouberol) [14:49:19] (03CR) 10Brouberol: [V:03+1 C:03+2] airflow: add kafka-{test,jumbo}-eqiad connections to the remaining instances [puppet] - 10https://gerrit.wikimedia.org/r/1118831 (https://phabricator.wikimedia.org/T379676) (owner: 10Brouberol) [14:50:39] (03CR) 10Filippo Giunchedi: [C:03+2] "Thank you for the quick review! -- I can confirm that the profile works fine in realm labs" [puppet] - 10https://gerrit.wikimedia.org/r/1120967 (owner: 10Filippo Giunchedi) [14:53:41] (03CR) 10Arnaudb: "I found the private repo entry, still trying to find the stub in labs/private" [puppet] - 10https://gerrit.wikimedia.org/r/1117530 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [14:53:57] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] [wmcs::kubeadm::core] remove kubeadm-flags.env [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T374193) (owner: 10Raymond Ndibe) [14:54:17] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10563685 (10ayounsi) Thanks ! >>! In T384731#10556225, @fgiunchedi wrote: > Since we have to overwrite `instance`... [14:55:35] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120891|knwiki, knwikisource, tcywikisource: add confirmed user usergroup (T386781)]] (duration: 11m 41s) [14:55:39] T386781: Allow sysops to add/revoke Confirmed user usergroup on knwiki, knwikisource, tcywikisource - https://phabricator.wikimedia.org/T386781 [14:56:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120904 (https://phabricator.wikimedia.org/T385343) (owner: 10Michael Große) [14:56:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120618 (https://phabricator.wikimedia.org/T386490) (owner: 10Michael Große) [14:56:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120620 (https://phabricator.wikimedia.org/T386490) (owner: 10Michael Große) [14:56:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120643 (owner: 10Michael Große) [14:56:24] Lucas_WMDE: thank you [14:56:28] np :) [14:56:59] (03CR) 10Muehlenhoff: [C:03+1] rt: sunsetting caching [puppet] - 10https://gerrit.wikimedia.org/r/1117531 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [14:57:07] (03Merged) 10jenkins-bot: Growth: increase minimum tasks per topic on idwiki; ruwiki => default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120904 (https://phabricator.wikimedia.org/T385343) (owner: 10Michael Große) [14:57:41] (03Merged) 10jenkins-bot: fix(Surfacing): make instrumentation platform-aware [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120618 (https://phabricator.wikimedia.org/T386490) (owner: 10Michael Große) [14:57:43] (03Merged) 10jenkins-bot: feat(Surfacing): track performance metrics with statslib [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120620 (https://phabricator.wikimedia.org/T386490) (owner: 10Michael Große) [14:57:44] (03Merged) 10jenkins-bot: fix(surfacing): add dependency for link-icon in popup header [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120643 (owner: 10Michael Große) [14:58:11] * MichaelG_WMF is here and ready to test when you are :) [14:58:19] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1120904|Growth: increase minimum tasks per topic on idwiki; ruwiki => default (T385343)]], [[gerrit:1120618|fix(Surfacing): make instrumentation platform-aware (T386490)]], [[gerrit:1120620|feat(Surfacing): track performance metrics with statslib (T386490)]], [[gerrit:1120643|fix(surfacing): add dependency for link-icon in popup header]] [14:58:22] (03CR) 10Muehlenhoff: [C:03+1] "It's in private/modules/passwords/manifests/init.pp" [puppet] - 10https://gerrit.wikimedia.org/r/1117530 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [14:58:23] T385343: Surfacing "Add a link" Structured Tasks: Experiment Release (FY24/25 WE1.2.9) - https://phabricator.wikimedia.org/T385343 [14:58:24] T386490: Update Surfacing Add a Link intrumentation and tracking to desktop and statslib - https://phabricator.wikimedia.org/T386490 [14:58:47] Lucas_WMDE: Yeah, fine. [14:59:00] 2x ack :) [14:59:55] We can do my config change another time, that is not urgent and I can just come back about it in the backport window tonight or tomorrow [15:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T1500) [15:00:18] MichaelG_WMF: scap is already running [15:00:45] Lucas_WMDE: ah, that is also fine, thanks! [15:01:18] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, migr: Backport for [[gerrit:1120904|Growth: increase minimum tasks per topic on idwiki; ruwiki => default (T385343)]], [[gerrit:1120618|fix(Surfacing): make instrumentation platform-aware (T386490)]], [[gerrit:1120620|feat(Surfacing): track performance metrics with statslib (T386490)]], [[gerrit:1120643|fix(surfacing): add dependency for link-icon in popup heade [15:01:18] r]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:01:25] there is nothing to test for the config change (it changes behavior of a maintenance script which will be picked up later) [15:01:34] and anything for the backports? [15:01:39] yes! [15:01:53] though testing that backport will be quick [15:02:06] ok [15:03:33] (03PS1) 10Jforrester: wikifunctions: Update orchestrator from 2025-02-12-171406 to 2025-02-19-134350 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120991 (https://phabricator.wikimedia.org/T383631) [15:03:35] (03PS1) 10Jforrester: wikifunctions: Update evaluators from 2025-02-11-155338 to 2025-02-19-135838 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120992 (https://phabricator.wikimedia.org/T383644) [15:03:36] Lucas_WMDE: Looks good with mwdebug! [15:03:46] Ready to move forward from my side [15:03:56] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, migr: Continuing with sync [15:03:58] great, thanks! [15:04:21] (03CR) 10FNegri: [C:03+2] [wmcs::kubeadm::core] remove kubeadm-flags.env [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T374193) (owner: 10Raymond Ndibe) [15:04:32] !log upgrading eventgate-analytics in eqiad to node20 - T383814 [15:04:35] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [15:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:36] T383814: Upgrade eventgate-wikimedia to node20 - https://phabricator.wikimedia.org/T383814 [15:04:50] ottomata: This is our services window. :-P [15:05:07] James_F: is it ok to do another half an hour or so of MediaWiki backports? AIUI it doesn't interfere with the Wikifunctions window [15:05:26] tgr|away: Yes, it shouldn't be an issue. [15:05:33] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [15:05:36] James_F: ah I'm sorry. i usually look but didn't today. waited until afternoon backport was over. [15:05:45] James_F: it shouldn't be related or interfere at all [15:05:47] ottomata: Afternoon backport also isn't over. [15:06:09] well crap sorry. just waited until scheduled window time was over. [15:06:17] I mean, the /window/ is over, but the deploying isn't. [15:06:19] i should have checked in. [15:06:26] No worries. :-) [15:06:33] will do next time. [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:24] thanks James (for letting us continue with MW backports) [15:07:32] a scap backport and a helm-based deploy running in parallel should be fine though, right? [15:07:40] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Update orchestrator from 2025-02-12-171406 to 2025-02-19-134350 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120991 (https://phabricator.wikimedia.org/T383631) (owner: 10Jforrester) [15:07:49] tgr|away: Depends on what they talk to, but in this case yes. [15:08:43] (03PS3) 10Awight: [beta] Enable Community Configuration for Cite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120466 (https://phabricator.wikimedia.org/T385597) [15:08:52] (03Merged) 10jenkins-bot: wikifunctions: Update orchestrator from 2025-02-12-171406 to 2025-02-19-134350 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120991 (https://phabricator.wikimedia.org/T383631) (owner: 10Jforrester) [15:10:32] !log gengh@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:10:32] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120904|Growth: increase minimum tasks per topic on idwiki; ruwiki => default (T385343)]], [[gerrit:1120618|fix(Surfacing): make instrumentation platform-aware (T386490)]], [[gerrit:1120620|feat(Surfacing): track performance metrics with statslib (T386490)]], [[gerrit:1120643|fix(surfacing): add dependency for link-icon in popup header]] [15:10:32] (duration: 12m 13s) [15:10:39] T385343: Surfacing "Add a link" Structured Tasks: Experiment Release (FY24/25 WE1.2.9) - https://phabricator.wikimedia.org/T385343 [15:10:40] T386490: Update Surfacing Add a Link intrumentation and tracking to desktop and statslib - https://phabricator.wikimedia.org/T386490 [15:10:40] * Lucas_WMDE done deploying [15:10:44] tgr|away: over to you [15:10:53] (03CR) 10Awight: [C:03+2] "Beta deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120217 (https://phabricator.wikimedia.org/T373307) (owner: 10WMDE-Fisch) [15:11:01] (03CR) 10Awight: [C:03+2] "Beta deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120466 (https://phabricator.wikimedia.org/T385597) (owner: 10Awight) [15:11:13] !log gengh@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:11:45] (03Merged) 10jenkins-bot: [beta] Change sub-referencing feature flag to new name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120217 (https://phabricator.wikimedia.org/T373307) (owner: 10WMDE-Fisch) [15:11:48] (03CR) 10CI reject: [V:04-1] [beta] Enable Community Configuration for Cite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120466 (https://phabricator.wikimedia.org/T385597) (owner: 10Awight) [15:12:11] (03PS1) 10Andrew Bogott: wmcs-novastats-cephleaks: don't crash if trying to delete a missing file [puppet] - 10https://gerrit.wikimedia.org/r/1120996 (https://phabricator.wikimedia.org/T383796) [15:12:22] thx [15:12:38] (03CR) 10Gergő Tisza: [C:03+2] Add configuration options and global preference for the SUL3 rolllout [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120977 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [15:12:40] (03CR) 10Gergő Tisza: [C:03+2] Add configuration options and global preference for the SUL3 rolllout [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120978 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [15:12:47] !log gengh@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:12:56] (03CR) 10Andrew Bogott: [C:03+2] wmcs-novastats-cephleaks: don't crash if trying to delete a missing file [puppet] - 10https://gerrit.wikimedia.org/r/1120996 (https://phabricator.wikimedia.org/T383796) (owner: 10Andrew Bogott) [15:13:20] (03PS4) 10Awight: [beta] Enable Community Configuration for Cite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120466 (https://phabricator.wikimedia.org/T385597) [15:13:35] !log gengh@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:13:48] (03CR) 10Awight: [C:03+2] "Beta deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120466 (https://phabricator.wikimedia.org/T385597) (owner: 10Awight) [15:13:51] !log gengh@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:14:36] (03Merged) 10jenkins-bot: [beta] Enable Community Configuration for Cite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120466 (https://phabricator.wikimedia.org/T385597) (owner: 10Awight) [15:14:45] !log gengh@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:15:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web/next (k8s) 1.008s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:17:02] (03CR) 10Pcoombe: "Just "search" and "uselang" should be all" [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [15:18:09] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Update evaluators from 2025-02-11-155338 to 2025-02-19-135838 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120992 (https://phabricator.wikimedia.org/T383644) (owner: 10Jforrester) [15:18:26] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:18:39] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:sessionstore: Apply JDK 11 update - eevans@cumin1002 [15:19:18] (03Merged) 10jenkins-bot: wikifunctions: Update evaluators from 2025-02-11-155338 to 2025-02-19-135838 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1120992 (https://phabricator.wikimedia.org/T383644) (owner: 10Jforrester) [15:19:26] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:20:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-web/next (k8s) 1.008s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [15:20:52] !log gengh@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:20:57] 06SRE, 07LDAP, 13Patch-For-Review: ldap-admins POSIX group does not actually give any permissions to its members - https://phabricator.wikimedia.org/T386472#10563839 (10MoritzMuehlenhoff) I did a little Puppet archeology: * The name of the modify-ldap-user command was moved from sbin to bin in Puppet in 2016... [15:21:10] (03CR) 10Muehlenhoff: [C:04-1] "Let's put this on hold until the discussion on https://phabricator.wikimedia.org/T386472 is complete" [puppet] - 10https://gerrit.wikimedia.org/r/1120592 (https://phabricator.wikimedia.org/T386472) (owner: 10Dzahn) [15:21:55] !log gengh@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:22:06] (03Merged) 10jenkins-bot: Add configuration options and global preference for the SUL3 rolllout [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120977 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [15:22:15] (03Merged) 10jenkins-bot: Add configuration options and global preference for the SUL3 rolllout [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120978 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [15:22:46] !log gengh@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:23:33] !log gengh@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:23:50] !log gengh@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:24:40] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:24:50] !log gengh@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:25:11] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:25:22] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:26:15] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:29:02] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1120977|Add configuration options and global preference for the SUL3 rolllout (T384549 T377144 T384552 T384215)]], [[gerrit:1120978|Add configuration options and global preference for the SUL3 rolllout (T384549 T377144 T384552 T384215)]] [15:29:10] T384549: Create a per-user flag for enabling SUL3 - https://phabricator.wikimedia.org/T384549 [15:29:10] T377144: Create method for deterministically opting new users into SUL3 rollout - https://phabricator.wikimedia.org/T377144 [15:29:11] T384552: Create method for staged opt-in of new users into SUL3 rollout - https://phabricator.wikimedia.org/T384552 [15:29:11] T384215: Create method for staged opt-in of existing users into SUL3 rollout - https://phabricator.wikimedia.org/T384215 [15:30:56] awight: please do a git rebase after merging beta patches next time, unexpected patches confuse scap backport [15:32:38] (ftr awight had asked about this in -releng and I wasn’t sure if just +2ing was okay or not – good to know) [15:32:53] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=wikikube-worker2001.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [15:33:08] awight: Another way to put that is: Always run `scap backport` for any operations/mediawiki-config change, even if they're beta-only. `scap backport` is smart enough to shortcut the deployment if it sees a beta-only config change. [15:33:50] oh cool, wasn't aware of that [15:33:53] tgr|away: Oof, sorry that I chose the busiest possible moment to "sneak" some cruft into the mix. [15:34:01] oh will it do the rebase and etc first, then bail? that's nice [15:34:15] Yeah [15:34:26] Please never use git commands in /srv/mediawiki-staging again. :-)) [15:34:33] yay! [15:34:34] :-D [15:34:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10563987 (10phaultfinder) [15:37:06] (03CR) 10Arnaudb: [C:03+2] rt: discarding modules about request tracker [puppet] - 10https://gerrit.wikimedia.org/r/1117530 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [15:38:59] (03CR) 10Arnaudb: [C:03+2] rt: sunsetting caching [puppet] - 10https://gerrit.wikimedia.org/r/1117531 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [15:40:00] Lucas_WMDE: do you by any chance have an idea what this error means? https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2025.02.19?id=whDaHpUBLmySI1N_YsRT [15:40:20] * Lucas_WMDE looks [15:40:28] oh [15:40:28] oh fuck [15:40:32] it's breaking one of the scap tests, but it's not obvious to me how it could be related to the patch being deployed [15:40:41] this is bizarre [15:40:45] only happening on mwdebug though [15:40:49] we just started seeing errors like this in CI too https://phabricator.wikimedia.org/T386836 [15:40:58] but how on earth could it sneak into production now [15:41:18] cold cache or something like that? [15:42:02] you can see it at https://test.wikidata.org/wiki/Q232463 on mwdebug, but on the normal servers it works [15:42:29] I have a feeling it must be caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1115497 ? [15:42:43] if we started seeing it in CI soon after that was merged on master, and it’s also whining on mwdebug during backport [15:42:48] even if I have no idea yet how it could be related [15:42:52] * Lucas_WMDE looks at the change [15:43:06] (I seem to recall that Wikibase CI does indeed pull in CentralAuth through some transitive dependency) [15:43:29] that would make sense, but nothing in that patch interferes with content model registration afaik [15:43:56] not directly, that's certain [15:44:09] it’s probably something pretty arcane [15:44:18] anything else from that req id that might lead us to an earlier message? [15:44:20] e.g. I could imagine that your patch causes some services to be initialized in a different order [15:44:31] and now some Wikibase hook runs too late to register the content models [15:46:04] let me try to see how wikibase registers those content models [15:47:40] sigh no related log entries in logstash [15:49:06] (03PS1) 10Scott French: setup.py: add with-dbctl extra to conftool dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1121021 [15:49:30] one of the CA hooks runs on SetupAfterCache which is quite early, and the patch changes its dependencies [15:50:05] yeah, and Wikibase also registers its content models in onSetupAfterCache [15:50:06] the new dependencies are PreferencesFactory and UserNameUtils [15:50:08] (03CR) 10Scott French: dbctl: pass DbCtlConfiguration to DbConfig (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1120648 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French) [15:50:09] I think I remember that being an issue before [15:50:10] (03CR) 10Scott French: [C:03+2] dbctl: pass DbCtlConfiguration to DbConfig [software/spicerack] - 10https://gerrit.wikimedia.org/r/1120648 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French) [15:50:39] maybe https://phabricator.wikimedia.org/T288819 [15:50:47] anyway, if it's CI reproducible, I'll just roll back [15:51:14] thanks for the quick response [15:51:28] yeah UserNameUtils pulls in NamespaceInfo [15:51:39] via ContentLanguage -> LanguageFactory -> NamespaceInfo [15:51:39] (03PS4) 10Ssingh: Release dnsdist 1.9.8-1+wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 [15:51:47] tgr|away: ack, thanks [15:52:19] I’m *really* glad this was caught on mwdebug [15:53:01] (03PS1) 10TrainBranchBot: Revert "Add configuration options and global preference for the SUL3 rolllout" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121023 [15:53:02] (03CR) 10TrainBranchBot: "tgr@deploy2002 created a revert of this change as I823add719e2eaa8889c9f1676492c1cfe3d23a1c" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1120977 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [15:53:09] (03PS1) 10TrainBranchBot: Revert "Add configuration options and global preference for the SUL3 rolllout" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121024 [15:53:10] (03CR) 10TrainBranchBot: "tgr@deploy2002 created a revert of this change as I2719f041db9f2a62aaf82001979a9296bba8b835" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1120978 (https://phabricator.wikimedia.org/T384549) (owner: 10Gergő Tisza) [15:54:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121023 (owner: 10TrainBranchBot) [15:54:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121024 (owner: 10TrainBranchBot) [15:58:52] (03CR) 10Dzahn: [C:03+1] rt: sunsetting caching [puppet] - 10https://gerrit.wikimedia.org/r/1117531 (https://phabricator.wikimedia.org/T384595) (owner: 10Arnaudb) [15:59:10] btw I just tested it and real Wikidata would also have been broken (e.g. https://www.wikidata.org/wiki/Q42) [15:59:10] (03CR) 10Dzahn: [C:03+1] rt: remove cname [dns] - 10https://gerrit.wikimedia.org/r/1120901 (https://phabricator.wikimedia.org/T385777) (owner: 10Arnaudb) [15:59:37] (03CR) 10Dzahn: [C:03+1] ferm: remove moscovium from allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1120889 (https://phabricator.wikimedia.org/T385777) (owner: 10Arnaudb) [15:59:53] (03CR) 10Dzahn: [C:03+1] moscovium: remove from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1120917 (https://phabricator.wikimedia.org/T385777) (owner: 10Arnaudb) [16:00:28] 06SRE, 06Infrastructure-Foundations, 10netops: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10564100 (10cmooney) [16:00:57] (03CR) 10David Caro: nova vendordata: set fqdn from project_name rather than project_id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [16:01:51] (03Merged) 10jenkins-bot: Revert "Add configuration options and global preference for the SUL3 rolllout" [extensions/CentralAuth] (wmf/1.44.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1121023 (owner: 10TrainBranchBot) [16:02:34] (03Merged) 10jenkins-bot: Revert "Add configuration options and global preference for the SUL3 rolllout" [extensions/CentralAuth] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121024 (owner: 10TrainBranchBot) [16:02:47] (03PS1) 10Scott French: setup.py: add with-dbctl extra to conftool dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1121021 [16:02:47] (03CR) 10Scott French: "Thanks in advance for the review, Riccardo!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1121021 (owner: 10Scott French) [16:02:56] (03PS1) 10Giuseppe Lavagetto: kartotherian: add extra FQDN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121035 [16:03:06] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1121023|Revert "Add configuration options and global preference for the SUL3 rolllout"]], [[gerrit:1121024|Revert "Add configuration options and global preference for the SUL3 rolllout"]] [16:03:26] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1121021 (owner: 10Scott French) [16:06:07] !log tgr@deploy2002 tgr, trainbranchbot: Backport for [[gerrit:1121023|Revert "Add configuration options and global preference for the SUL3 rolllout"]], [[gerrit:1121024|Revert "Add configuration options and global preference for the SUL3 rolllout"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:07:18] (03CR) 10CI reject: [V:04-1] Release dnsdist 1.9.8-1+wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 (owner: 10Ssingh) [16:08:17] (03CR) 10David Caro: [C:03+1] "Got a question, LGTM anyhow" [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [16:08:51] (03PS2) 10Giuseppe Lavagetto: kartotherian: add extra FQDN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121035 [16:09:10] !log tgr@deploy2002 tgr, trainbranchbot: Continuing with sync [16:10:49] 10ops-eqiad, 06SRE, 10Ceph, 06cloud-services-team, and 2 others: evaluate new drives in cloudcephosd102[123] - https://phabricator.wikimedia.org/T386725#10564210 (10Andrew) p:05Triage→03Medium [16:11:42] (03PS5) 10Ssingh: Release dnsdist 1.9.8-1+wmf12u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 [16:12:31] (03Merged) 10jenkins-bot: dbctl: pass DbCtlConfiguration to DbConfig [software/spicerack] - 10https://gerrit.wikimedia.org/r/1120648 (https://phabricator.wikimedia.org/T383324) (owner: 10Scott French) [16:15:11] (03CR) 10Scott French: [C:03+2] "Thanks, Riccard" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1121021 (owner: 10Scott French) [16:15:36] (03CR) 10Pppery: "Could you explain what those reasons are? The initial patch is completely lacking any reasoning." [puppet] - 10https://gerrit.wikimedia.org/r/1080357 (https://phabricator.wikimedia.org/T318285) (owner: 10Simon04) [16:15:50] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121023|Revert "Add configuration options and global preference for the SUL3 rolllout"]], [[gerrit:1121024|Revert "Add configuration options and global preference for the SUL3 rolllout"]] (duration: 12m 43s) [16:19:03] (03PS3) 10Giuseppe Lavagetto: kartotherian: add extra FQDN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121035 [16:19:09] (03PS3) 10Muehlenhoff: openssh: Remove code to disable NIST key exchange [puppet] - 10https://gerrit.wikimedia.org/r/1074381 [16:20:45] (03PS6) 10BryanDavis: toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) [16:22:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1074381 (owner: 10Muehlenhoff) [16:26:50] (03CR) 10BryanDavis: [C:03+2] toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [16:26:50] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10564274 (10cmooney) >>! In T384731#10563685, @ayounsi wrote: >>! In T384731#10556225, @fgiunchedi wrote: >> I also... [16:28:04] (03CR) 10Elukey: [C:03+2] kartotherian: add extra FQDN [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121035 (owner: 10Giuseppe Lavagetto) [16:28:18] (03Merged) 10jenkins-bot: toolhub: Add pod.kubernetes.io/sidecars annotation to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1119198 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [16:29:22] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [16:30:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1036.eqiad.wmnet [16:30:36] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [16:30:42] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [16:30:45] (03CR) 10Ssingh: "Ready for review." [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/1120607 (owner: 10Ssingh) [16:30:52] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [16:31:07] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply [16:31:49] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/kartotherian: sync [16:31:51] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/kartotherian: sync [16:32:11] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [16:32:15] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [16:32:31] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [16:32:55] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [16:33:59] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [16:34:25] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=wikikube-worker200.*.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:35:14] 06SRE, 06serviceops, 10Wikimedia-Mailing-lists: Set up memcached for mailman3 - https://phabricator.wikimedia.org/T282931#10564303 (10jijiki) 05Open→03Stalled [16:35:30] (03Merged) 10jenkins-bot: setup.py: add with-dbctl extra to conftool dependency [software/spicerack] - 10https://gerrit.wikimedia.org/r/1121021 (owner: 10Scott French) [16:37:45] 06SRE, 06serviceops, 10Wikimedia-Mailing-lists: Set up memcached for mailman3 - https://phabricator.wikimedia.org/T282931#10564314 (10jijiki) p:05Medium→03Low [16:38:27] !log elukey@puppetserver1001 conftool action : set/pooled=yes; selector: name=wikikube-worker100.*.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [16:39:33] (03CR) 10Jgiannelos: [C:03+1] "@hnowlan@wikimedia.org Looks good to me but can you also take a look? We can do the deployments." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118890 (https://phabricator.wikimedia.org/T386244) (owner: 10Arlolra) [16:44:41] (03PS1) 10Aklapper: Rename a variable to be clearer [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121050 [16:44:53] (03CR) 10Aklapper: [V:03+2 C:03+2] Rename a variable to be clearer [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121050 (owner: 10Aklapper) [16:49:44] (03CR) 10Scott French: [C:03+1] "Looks good. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1120700 (owner: 10RLazarus) [16:52:07] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=wikikube-worker100.*.eqiad.wmnet,dc=eqiad,cluster=maps,service=kartotherian-k8s-ssl [16:52:17] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=wikikube-worker200.*.codfw.wmnet,dc=codfw,cluster=maps,service=kartotherian-k8s-ssl [16:59:40] 06SRE, 06Traffic: Define an event stream and schema for haproxy_requestctl analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10564431 (10Fabfur) >>! In T383392#10560361, @Ottomata wrote: > @Fabfur {T383914} has been deployed, so it should be possible to remove the `meta.domain` field... [17:01:12] (03PS1) 10DCausse: Revert "cirrus: enable mlr-2025 for select wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120534 (owner: 10Gmodena) [17:02:06] (03CR) 10Vgutierrez: "lvs2013 (low-traffic LVS) hasn't any IPIP services till now, so we had IPIP support disabled there, enabling it deploys ipip-multiqueue-op" [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [17:05:13] (03CR) 10Jdlrobson: [C:04-1] Update Search AB test config, increase bucketing/sampling rates for eu/ca, deploy to testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120609 (https://phabricator.wikimedia.org/T386734) (owner: 10Bernard Wang) [17:09:26] (03PS1) 10Aklapper: Add some comments in editscore section [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121054 [17:10:05] (03CR) 10Aklapper: [V:03+2 C:03+2] Add some comments in editscore section [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121054 (owner: 10Aklapper) [17:13:55] (03PS4) 10Cwhite: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) [17:15:21] (03CR) 10Krinkle: [C:03+1] "LGTM for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite) [17:18:24] (03CR) 10Hnowlan: [C:04-1] "Change makes sense to me, but the chart will need a version bump for this to be rolled out successfully." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118890 (https://phabricator.wikimedia.org/T386244) (owner: 10Arlolra) [17:28:47] (03CR) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [17:29:04] (03PS2) 10Andrew Bogott: vendordata.txt: include rudimentary clouds.yaml in initial VM [puppet] - 10https://gerrit.wikimedia.org/r/1120683 (https://phabricator.wikimedia.org/T379030) [17:29:04] (03PS7) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) [17:31:10] (03PS1) 10Gergő Tisza: NewUserMessage: Enable on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121055 [17:31:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121055 (owner: 10Gergő Tisza) [17:33:38] PROBLEM - Hadoop NodeManager on an-worker1105 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:44:38] RECOVERY - Hadoop NodeManager on an-worker1105 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:45:22] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:47:01] (03CR) 10MVernon: [C:03+1] "Thanks for the explanation :)" [puppet] - 10https://gerrit.wikimedia.org/r/1120496 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [17:47:21] (03CR) 10MVernon: [C:03+1] hiera: Enable IPIP on ms-fe@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1120603 (https://phabricator.wikimedia.org/T385564) (owner: 10Vgutierrez) [17:47:56] (03CR) 10RLazarus: [C:03+2] deployment_server: Refactor some utility functions into a Job class [puppet] - 10https://gerrit.wikimedia.org/r/1120700 (owner: 10RLazarus) [17:54:26] (03CR) 10Dzahn: [C:03+2] logspam: Consolidate CurlFactory cURL errors [puppet] - 10https://gerrit.wikimedia.org/r/1056221 (https://phabricator.wikimedia.org/T371633) (owner: 10Ahmon Dancy) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T1800) [18:11:45] Lucas_WMDE: we'd like to backport the CentralAuth patch today or tomorrow as it's needed for SUL3 rollout. Would you feel comfortable with the MediaWikiServices patch also being backported, or should we look for a workaround for now? [18:31:16] (03CR) 10Daimona Eaytoy: [C:03+1] "Thank you! LGTM now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) (owner: 10LD) [18:31:23] (03CR) 10CI reject: [V:04-1] frwiki: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) (owner: 10LD) [18:32:20] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for Melos - https://phabricator.wikimedia.org/T386581#10565042 (10KFrancis) Hello all, the NDA is out for signatures. I'll confirm when it's complete. [18:42:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [18:43:34] (03PS2) 10Jdlrobson: Footer: Wikimedia icon should collapse at lower resolutions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) [18:43:40] (03CR) 10Jdlrobson: Footer: Wikimedia icon should collapse at lower resolutions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [18:45:57] (03PS3) 10Bernard Wang: Update Search AB test config, increase bucketing/sampling rates for eu/ca, deploy to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120609 (https://phabricator.wikimedia.org/T386734) [18:46:00] (03CR) 10Jdlrobson: [C:03+1] Update Search AB test config, increase bucketing/sampling rates for eu/ca, deploy to testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120609 (https://phabricator.wikimedia.org/T386734) (owner: 10Bernard Wang) [18:46:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120609 (https://phabricator.wikimedia.org/T386734) (owner: 10Bernard Wang) [18:51:06] (03PS1) 10Jdlrobson: Lazy image loading Grade C fallback is broken [extensions/MobileFrontend] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121077 (https://phabricator.wikimedia.org/T386400) [18:51:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, February 19 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/MobileFrontend] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121077 (https://phabricator.wikimedia.org/T386400) (owner: 10Jdlrobson) [18:54:45] !log fab@deploy2002 Started deploy [airflow-dags/research@95b14c7]: (no justification provided) [18:54:54] !log fab@deploy2002 Finished deploy [airflow-dags/research@95b14c7]: (no justification provided) (duration: 00m 11s) [18:57:40] (03PS3) 10Arlolra: Bust cache for recreated pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1118890 (https://phabricator.wikimedia.org/T386244) [19:00:05] dancy and andre: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T1900). [19:00:07] (03CR) 10Ladsgroup: "LGTM, haven't tested it but I will do later. Want me to deploy it today?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [19:01:34] (03CR) 10Herron: "Thanks for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/1120602 (https://phabricator.wikimedia.org/T385727) (owner: 10Herron) [19:05:19] (03CR) 10Herron: [C:03+1] "makes sense to me 👍" [alerts] - 10https://gerrit.wikimedia.org/r/1120923 (owner: 10Filippo Giunchedi) [19:05:44] o/ [19:07:31] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121078 (https://phabricator.wikimedia.org/T382368) [19:07:33] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121078 (https://phabricator.wikimedia.org/T382368) (owner: 10TrainBranchBot) [19:08:39] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, and 3 others: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731#10565308 (10cmooney) >>! In T384731#10563685, @ayounsi wrote: > Is it possible to duplicate the metric, before the... [19:08:43] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121078 (https://phabricator.wikimedia.org/T382368) (owner: 10TrainBranchBot) [19:18:03] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.17 refs T382368 [19:18:07] T382368: 1.44.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T382368 [19:18:45] (03PS1) 10Dzahn: puppetserver: fix puppet dir dependency issue in cloudvps masters [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) [19:19:08] (03CR) 10CI reject: [V:04-1] puppetserver: fix puppet dir dependency issue in cloudvps masters [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [19:20:44] (03PS2) 10Dzahn: puppetserver: fix puppet dir dependency issue in cloudvps masters [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) [19:21:06] (03CR) 10CI reject: [V:04-1] puppetserver: fix puppet dir dependency issue in cloudvps masters [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [19:22:21] wow, kudos, CI detected that I typed "pupppet" instead of puppet :) [19:23:06] Nice work jenkinsbot [19:23:24] (03PS3) 10Dzahn: puppetserver: fix puppet dir dependency issue in cloudvps masters [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) [19:23:54] and whoever added that string to the typos file after doing it before [19:24:14] (03CR) 10Jforrester: [C:03+1] Footer: Wikimedia icon should collapse at lower resolutions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [19:27:15] !log fab@deploy2002 Started deploy [airflow-dags/research@b5ce354]: (no justification provided) [19:28:01] !log fab@deploy2002 Finished deploy [airflow-dags/research@b5ce354]: (no justification provided) (duration: 00m 46s) [19:28:21] jouncebot: nowandnext [19:28:21] For the next 1 hour(s) and 31 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T1900) [19:28:21] In 1 hour(s) and 31 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T2100) [19:29:57] dancy: are you done with train. i need to restart jenkins [19:30:09] ^ sorry, that's a question :) [19:30:20] Yep. Train looks good. [19:34:14] (03PS1) 10Daimona Eaytoy: Enable $wgCampaignEventsEnableEventInvitation on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121080 (https://phabricator.wikimedia.org/T383800) [19:35:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121080 (https://phabricator.wikimedia.org/T383800) (owner: 10Daimona Eaytoy) [19:35:13] !log restarting jenkins to fix git related issues following java update (T386755) [19:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:16] T386755: Multiple *-pipeline-test jobs failing to load pipelinelib with git error - https://phabricator.wikimedia.org/T386755 [19:36:33] doing a "safe" restart so this might be awhile. the build queue is going to fill up quite a bit as well [19:51:20] fyi a bad type hint made it into JsonConfig on the next and pretest-cut branches [19:51:36] wanna make sure that doesn't make it to production (it's live on beta, revert is merging) [19:53:14] Thanks bvibber! [19:53:50] The images created from those branches do not currently run anywhere. [19:54:51] yay [19:57:00] will the release branches be cut straight from master later? if so we should be good then :D [19:57:26] oh fun. the queued castor-save-workspace-cache builds are blocking the completion of all the other jobs. i will cancel them [19:57:34] !log fab@deploy2002 Started deploy [airflow-dags/research@b5ce354]: (no justification provided) [19:57:43] !log fab@deploy2002 Finished deploy [airflow-dags/research@b5ce354]: (no justification provided) (duration: 00m 10s) [19:58:05] !log cancelling queued castor builds to unblock completed builds and jenkins restart [19:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:42] !log fab@deploy2002 Started deploy [airflow-dags/research@b5ce354]: (no justification provided) [19:58:51] !log fab@deploy2002 Finished deploy [airflow-dags/research@b5ce354]: (no justification provided) (duration: 00m 10s) [19:59:17] (03PS1) 10BCornwall: provision: Adjust thermal profile for F4 [cookbooks] - 10https://gerrit.wikimedia.org/r/1121086 (https://phabricator.wikimedia.org/T373993) [19:59:34] !log fab@deploy2002 Started deploy [airflow-dags/research@b5ce354]: (no justification provided) [19:59:37] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, and 2 others: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10565447 (10BCornwall) I don't see any change in performance - The throttling notifications only come sparingly so I doubt we'd see much of a difference until resources be... [19:59:43] !log fab@deploy2002 Finished deploy [airflow-dags/research@b5ce354]: (no justification provided) (duration: 00m 10s) [20:01:26] !log fab@deploy2002 Started deploy [airflow-dags/research@b5ce354]: (no justification provided) [20:01:36] !log fab@deploy2002 Finished deploy [airflow-dags/research@b5ce354]: (no justification provided) (duration: 00m 11s) [20:03:03] !log fab@deploy2002 Started deploy [airflow-dags/research@b5ce354]: (no justification provided) [20:03:26] PROBLEM - jenkins_service_running on contint1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [20:03:42] !log fab@deploy2002 Finished deploy [airflow-dags/research@b5ce354]: (no justification provided) (duration: 00m 40s) [20:03:48] !log restarting jenkins via systemctl due to crash [20:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:26] RECOVERY - jenkins_service_running on contint1002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [20:06:16] ah! thanks dduvall [20:06:17] !log jenkins successfully restarted via `systemctl restart jenkins` [20:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:26] mutante: np [20:06:49] seems the "safe" restart was not so safe :) [20:06:58] hah, ack :) [20:08:53] (03CR) 10Dzahn: "yep, sounds good. on hold." [puppet] - 10https://gerrit.wikimedia.org/r/1120592 (https://phabricator.wikimedia.org/T386472) (owner: 10Dzahn) [20:11:28] (03CR) 10Dzahn: [C:04-1] "Brandon said I should not return "normal" but nothing as it's a special value" [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [20:12:57] (03PS3) 10Dzahn: varnish: create new policy that allows websockets but also caches [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) [20:15:05] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10565491 (10Dzahn) Arthur confirmed via email that this is the correct key and it has not been used elsewhere / in cloud before. Checking that box off as well. [20:15:51] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10565492 (10Dzahn) [20:17:57] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset / Hive interfaces (like Hue) that do access private data for Mariya Shilova - https://phabricator.wikimedia.org/T386754#10565495 (10Dzahn) [20:19:09] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset / Hive interfaces (like Hue) that do access private data for Mariya Shilova - https://phabricator.wikimedia.org/T386754#10565497 (10Dzahn) Hello @Ahoelzl, this request will need your approval. Please comment her... [20:20:35] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset / Hive interfaces (like Hue) that do access private data for Mariya Shilova - https://phabricator.wikimedia.org/T386754#10565499 (10Dzahn) Hello @MShilova_WMF, please take a look at L3 and sign it if you agree. [20:21:15] (03CR) 10Dzahn: varnish: create new policy that allows websockets but also caches [puppet] - 10https://gerrit.wikimedia.org/r/1117941 (https://phabricator.wikimedia.org/T274228) (owner: 10Dzahn) [20:23:21] (03PS1) 10Bking: cirrus: add commands to configure opensearch keystore [puppet] - 10https://gerrit.wikimedia.org/r/1121087 (https://phabricator.wikimedia.org/T380752) [20:23:43] (03CR) 10CI reject: [V:04-1] cirrus: add commands to configure opensearch keystore [puppet] - 10https://gerrit.wikimedia.org/r/1121087 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [20:24:52] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10565521 (10Dzahn) Noticed now Arthur already has other non-deployment but production shell access, using this key: ` ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIL/OQjQqWzDvDCW9JNQxNAXEwlJ1BL2D... [20:28:00] (03PS1) 10Dzahn: admin: upgrade arthurtaylor from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1121088 (https://phabricator.wikimedia.org/T386349) [20:29:16] (03CR) 10Dzahn: "This assumes the existing prod access key stays the same. (The new access request lists a new key)." [puppet] - 10https://gerrit.wikimedia.org/r/1121088 (https://phabricator.wikimedia.org/T386349) (owner: 10Dzahn) [20:30:26] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment for arthurtaylor - https://phabricator.wikimedia.org/T386349#10565532 (10Dzahn) 05Open→03In progress [20:46:37] Hi, I am sorry for the inconvenience caused, I didn't realize that I cannot use the phab to test, I will be using https://phab.wmflabs.org/ to test from now on. [20:53:45] (03CR) 10JHathaway: [C:03+1] puppetserver: fix puppet dir dependency issue in cloudvps masters [puppet] - 10https://gerrit.wikimedia.org/r/1121079 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [20:58:16] (03PS2) 10Bking: cirrus: add commands to configure opensearch keystore [puppet] - 10https://gerrit.wikimedia.org/r/1121087 (https://phabricator.wikimedia.org/T380752) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T2100). [21:00:05] tgr and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:20] o/ [21:00:45] o/ [21:03:04] o/ [21:03:09] hi - i can deploy [21:03:39] (03PS2) 10Gergő Tisza: NewUserMessage: Enable on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121055 [21:03:46] (03PS1) 10Arlolra: Revert parsoid read views on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121092 (https://phabricator.wikimedia.org/T356718) [21:04:08] o/ [21:04:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121055 (owner: 10Gergő Tisza) [21:05:00] thanks cjming :) [21:05:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, February 20 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121092 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra) [21:05:51] (03Merged) 10jenkins-bot: NewUserMessage: Enable on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121055 (owner: 10Gergő Tisza) [21:06:20] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1121055|NewUserMessage: Enable on test2wiki]] [21:07:11] cjming: I just tacked on a config change, hopefully we can squeeze that in [21:08:36] tgr: on test servers if you want to check [21:08:45] arlolra: np! [21:09:24] !log cjming@deploy2002 tgr, cjming: Backport for [[gerrit:1121055|NewUserMessage: Enable on test2wiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:10:13] (03PS3) 10Jdlrobson: Footer: Wikimedia icon should collapse at lower resolutions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) [21:10:27] tgr: ok to sync? [21:10:36] thanks cjming! looks good [21:10:40] !log cjming@deploy2002 tgr, cjming: Continuing with sync [21:11:16] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:11:57] (03PS3) 10Bking: cirrus: add commands to configure opensearch keystore [puppet] - 10https://gerrit.wikimedia.org/r/1121087 (https://phabricator.wikimedia.org/T380752) [21:12:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:12:17] (03CR) 10CI reject: [V:04-1] cirrus: add commands to configure opensearch keystore [puppet] - 10https://gerrit.wikimedia.org/r/1121087 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [21:14:42] (03CR) 10Ahmon Dancy: [C:03+1] admin: upgrade arthurtaylor from restricted to deployment [puppet] - 10https://gerrit.wikimedia.org/r/1121088 (https://phabricator.wikimedia.org/T386349) (owner: 10Dzahn) [21:17:12] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121055|NewUserMessage: Enable on test2wiki]] (duration: 10m 52s) [21:17:58] Jdlrobson: ok if i do your 2 config patches together? i'm a little time-crunched [21:18:31] i can also do separately - np [21:19:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 16.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:20:40] cjming: yep [21:20:43] they can all go out together [21:20:53] cool - thx! [21:21:05] (03PS4) 10Bernard Wang: Update Search AB test config, increase bucketing/sampling rates for eu/ca, deploy to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120609 (https://phabricator.wikimedia.org/T386734) [21:23:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10565691 (10phaultfinder) [21:24:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 16.07% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:24:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.261s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:26:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [21:26:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120609 (https://phabricator.wikimedia.org/T386734) (owner: 10Bernard Wang) [21:27:35] (03Merged) 10jenkins-bot: Footer: Wikimedia icon should collapse at lower resolutions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [21:27:40] (03Merged) 10jenkins-bot: Update Search AB test config, increase bucketing/sampling rates for eu/ca, deploy to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120609 (https://phabricator.wikimedia.org/T386734) (owner: 10Bernard Wang) [21:28:07] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1119579|Footer: Wikimedia icon should collapse at lower resolutions (T384619)]], [[gerrit:1120609|Update Search AB test config, increase bucketing/sampling rates for eu/ca, deploy to testwiki (T386734)]] [21:28:12] T384619: Update skins to support different logos at different resolutions - https://phabricator.wikimedia.org/T384619 [21:28:12] T386734: Deploy updated Search A/B test to eu/ca/test wiki - https://phabricator.wikimedia.org/T386734 [21:28:47] (03CR) 10Clare Ming: [C:03+2] Lazy image loading Grade C fallback is broken [extensions/MobileFrontend] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121077 (https://phabricator.wikimedia.org/T386400) (owner: 10Jdlrobson) [21:29:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.261s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:30:26] Jdlrobson: your config patches are up on test servers if you'd like to check [21:30:41] cjming: on it [21:31:10] !log cjming@deploy2002 jdlrobson, cjming, bwang: Backport for [[gerrit:1119579|Footer: Wikimedia icon should collapse at lower resolutions (T384619)]], [[gerrit:1120609|Update Search AB test config, increase bucketing/sampling rates for eu/ca, deploy to testwiki (T386734)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:31:49] cjming: unfortunately https://gerrit.wikimedia.org/r/1119579 doesn't look like it's working correctly :( I messed up the syntax [21:31:58] other one looks good though [21:32:04] sorry.. what's best in this situation? [21:32:31] shoot - i should done it separately - i guess can i sync and revert 1119579 ? [21:32:43] i can also do a follow up if helpful [21:32:50] let's do a follow up [21:32:57] i'm assuming it will be quick [21:33:01] so i can sync for now? [21:33:14] yes [21:33:18] !log cjming@deploy2002 jdlrobson, cjming, bwang: Continuing with sync [21:33:31] (as long as we revert https://gerrit.wikimedia.org/r/1119579 quickly after) [21:34:04] (03PS4) 10Bking: cirrus: add commands to configure opensearch keystore [puppet] - 10https://gerrit.wikimedia.org/r/1121087 (https://phabricator.wikimedia.org/T380752) [21:34:22] sure thing - i guess it's just still broken? [21:34:25] (03CR) 10CI reject: [V:04-1] cirrus: add commands to configure opensearch keystore [puppet] - 10https://gerrit.wikimedia.org/r/1121087 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [21:35:58] no no it's working now but https://gerrit.wikimedia.org/r/1119579 is very broken [21:36:19] oh whoops - ok - i'll revert as soon as it finishes syncing [21:36:48] and then do your other backport [21:37:57] (03PS5) 10Bking: cirrus: add commands to configure opensearch keystore [puppet] - 10https://gerrit.wikimedia.org/r/1121087 (https://phabricator.wikimedia.org/T380752) [21:38:19] (03CR) 10CI reject: [V:04-1] cirrus: add commands to configure opensearch keystore [puppet] - 10https://gerrit.wikimedia.org/r/1121087 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [21:39:41] (03Merged) 10jenkins-bot: Lazy image loading Grade C fallback is broken [extensions/MobileFrontend] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121077 (https://phabricator.wikimedia.org/T386400) (owner: 10Jdlrobson) [21:39:53] (03PS6) 10Bking: cirrus: add commands to configure opensearch keystore [puppet] - 10https://gerrit.wikimedia.org/r/1121087 (https://phabricator.wikimedia.org/T380752) [21:39:55] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1119579|Footer: Wikimedia icon should collapse at lower resolutions (T384619)]], [[gerrit:1120609|Update Search AB test config, increase bucketing/sampling rates for eu/ca, deploy to testwiki (T386734)]] (duration: 11m 47s) [21:40:00] T384619: Update skins to support different logos at different resolutions - https://phabricator.wikimedia.org/T384619 [21:40:00] T386734: Deploy updated Search A/B test to eu/ca/test wiki - https://phabricator.wikimedia.org/T386734 [21:40:14] (03PS1) 10TrainBranchBot: Revert "Footer: Wikimedia icon should collapse at lower resolutions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121094 [21:40:14] (03CR) 10TrainBranchBot: "cjming@deploy2002 created a revert of this change as I6e16295ded46abf6ad2f7245921315ffca20d8b5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1119579 (https://phabricator.wikimedia.org/T384619) (owner: 10Jdlrobson) [21:40:14] (03CR) 10CI reject: [V:04-1] cirrus: add commands to configure opensearch keystore [puppet] - 10https://gerrit.wikimedia.org/r/1121087 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [21:40:30] cjming: looking at it [21:40:46] looking at what? [21:40:51] the follow up [21:40:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121094 (owner: 10TrainBranchBot) [21:41:26] Jdlrobson: reverting 1119579 now [21:41:27] (03PS7) 10Bking: cirrus: add commands to configure opensearch keystore [puppet] - 10https://gerrit.wikimedia.org/r/1121087 (https://phabricator.wikimedia.org/T380752) [21:41:41] (03Merged) 10jenkins-bot: Revert "Footer: Wikimedia icon should collapse at lower resolutions" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121094 (owner: 10TrainBranchBot) [21:42:13] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1121094|Revert "Footer: Wikimedia icon should collapse at lower resolutions"]] [21:45:12] !log cjming@deploy2002 trainbranchbot, cjming: Backport for [[gerrit:1121094|Revert "Footer: Wikimedia icon should collapse at lower resolutions"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:45:16] !log cjming@deploy2002 trainbranchbot, cjming: Continuing with sync [21:45:38] cjming: ok looks like it is fixed in production now phew [21:46:11] ya - sorry about that - tried to cut corners and it ends up taking longer anyway [21:46:54] !log fab@deploy2002 Started deploy [airflow-dags/research@b5ce354]: (no justification provided) [21:47:34] !log fab@deploy2002 Finished deploy [airflow-dags/research@b5ce354]: (no justification provided) (duration: 01m 19s) [21:47:41] Jdlrobson: as soon as revert finishes, i'll move onto your backport - should be quick [21:48:31] cjming: sounds good [21:49:32] (03PS1) 10BryanDavis: toolhub: Add config for crawler jobs history limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121095 (https://phabricator.wikimedia.org/T292861) [21:49:33] (03PS1) 10BryanDavis: toolhub: Reduce crawler history limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121096 (https://phabricator.wikimedia.org/T292861) [21:49:36] (03PS1) 10BryanDavis: toolhub: Bump container to 2025-02-19-214003-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121097 (https://phabricator.wikimedia.org/T292861) [21:51:37] (03PS1) 10Jdlrobson: Take 2: Footer: Wikimedia icon should collapse at lower resolutions"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121098 (https://phabricator.wikimedia.org/T384619) [21:52:01] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121094|Revert "Footer: Wikimedia icon should collapse at lower resolutions"]] (duration: 09m 48s) [21:52:37] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1121077|Lazy image loading Grade C fallback is broken (T386400)]] [21:52:41] T386400: [Regression] Lazy image loading Grade C fallback is broken - https://phabricator.wikimedia.org/T386400 [21:54:53] Jdlrobson: backport on mwdebug if you want to check [21:54:57] on it [21:55:39] !log cjming@deploy2002 cjming, jdlrobson: Backport for [[gerrit:1121077|Lazy image loading Grade C fallback is broken (T386400)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:56:02] (03PS2) 10Arlolra: Revert parsoid read views on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121092 (https://phabricator.wikimedia.org/T356718) [21:56:20] arlolra: still around? i can do your patch next [21:56:29] yup, thanks [21:57:40] cjming: please sync [21:57:44] !log cjming@deploy2002 cjming, jdlrobson: Continuing with sync [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T2200) [22:00:38] Jdlrobson: i need to run in a few minutes after i do arlolra's config patch -- sorry for the footer icon debacle - i thought there'd be more time to fix the revert [22:01:09] but the rest of your changes should be live -- backport will be live shortly [22:01:26] np [22:01:28] the footer icon can wait [22:01:31] not urgent! [22:01:47] cool - thx [22:02:18] if Abstract Wikipedia folks are around - is it ok to do one more config patch? [22:04:11] I think they would agree in the abstract [22:04:19] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121077|Lazy image loading Grade C fallback is broken (T386400)]] (duration: 11m 41s) [22:04:23] T386400: [Regression] Lazy image loading Grade C fallback is broken - https://phabricator.wikimedia.org/T386400 [22:04:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121092 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra) [22:05:16] (03Merged) 10jenkins-bot: Revert parsoid read views on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121092 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra) [22:05:42] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1121092|Revert parsoid read views on frwiktionary (T356718 T386272)]] [22:05:47] T356718: Support nested special page transclusion - https://phabricator.wikimedia.org/T356718 [22:05:48] T386272: Parsoid Read Views to Wiktionary deploy ~2025-02-13 - https://phabricator.wikimedia.org/T386272 [22:06:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.378s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:08:12] arlolra: on test servers if you want to verify - lmk if i can sync [22:08:15] (03CR) 10Bking: [V:03+2 C:03+2] "self-merging in the interest of time, as this will not affect any production hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1121087 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [22:08:35] cjming: looks good, please continue [22:08:43] !log cjming@deploy2002 arlolra, cjming: Backport for [[gerrit:1121092|Revert parsoid read views on frwiktionary (T356718 T386272)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:09:19] !log cjming@deploy2002 arlolra, cjming: Continuing with sync [22:10:16] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset / Hive interfaces (like Hue) that do access private data for Mariya Shilova - https://phabricator.wikimedia.org/T386754#10565822 (10MShilova_WMF) Thank you, @Dzahn . I confirm that I signed the document. [22:11:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid/main (k8s) 1.378s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:15:53] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121092|Revert parsoid read views on frwiktionary (T356718 T386272)]] (duration: 10m 10s) [22:16:02] arlolra: should be live :) [22:16:06] cjming: thank you! [22:16:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:16:10] yw! [22:16:46] !log end of UTC late backport window [22:18:07] (03CR) 10BryanDavis: [C:03+2] toolhub: Add config for crawler jobs history limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121095 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [22:19:06] (03CR) 10BryanDavis: [C:03+2] toolhub: Reduce crawler history limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121096 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [22:19:20] (03Merged) 10jenkins-bot: toolhub: Add config for crawler jobs history limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121095 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [22:20:15] (03Merged) 10jenkins-bot: toolhub: Reduce crawler history limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121096 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [22:20:50] (03CR) 10BryanDavis: [C:03+2] toolhub: Bump container to 2025-02-19-214003-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121097 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [22:22:05] (03Merged) 10jenkins-bot: toolhub: Bump container to 2025-02-19-214003-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121097 (https://phabricator.wikimedia.org/T292861) (owner: 10BryanDavis) [22:25:26] (03CR) 10Alexandros Kosiaris: [C:04-1] "Minor pedantic comment, plus waiting for Moritz's answer on the sysuser thing, but otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [22:30:02] !log fab@deploy2002 Started deploy [airflow-dags/research@b5ce354]: (no justification provided) [22:30:21] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [22:30:37] !log fab@deploy2002 Finished deploy [airflow-dags/research@b5ce354]: (no justification provided) (duration: 00m 38s) [22:30:59] (03CR) 10Alexandros Kosiaris: [C:04-1] "I 've just noticed that what also be required is an include of" [puppet] - 10https://gerrit.wikimedia.org/r/1094531 (https://phabricator.wikimedia.org/T383945) (owner: 10Ahmon Dancy) [22:31:31] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [22:34:03] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply [22:35:18] PROBLEM - Host ganeti1025 is DOWN: PING CRITICAL - Packet loss = 100% [22:35:34] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:35:40] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [22:35:48] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [22:36:16] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [22:36:16] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:36:44] RECOVERY - Host ganeti1025 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [22:36:58] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [22:37:06] PROBLEM - Host ml-serve-ctrl1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:37:24] PROBLEM - Host centrallog1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:37:36] PROBLEM - Etcd cluster health on kubestagemaster1004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [22:38:24] RECOVERY - Host centrallog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [22:38:30] PROBLEM - Host netboxdb1003 is DOWN: PING CRITICAL - Packet loss = 100% [22:38:36] RECOVERY - Etcd cluster health on kubestagemaster1004 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [22:39:22] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:39:32] FIRING: [3x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:39:34] (03PS1) 10Bking: cirrus: rename s3 resources [puppet] - 10https://gerrit.wikimedia.org/r/1121101 (https://phabricator.wikimedia.org/T380752) [22:39:36] RECOVERY - Host ml-serve-ctrl1001 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [22:40:00] RECOVERY - Host netboxdb1003 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [22:40:20] RECOVERY - BFD status on cr1-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:40:30] (03PS2) 10Bking: cirrus: rename s3 resources [puppet] - 10https://gerrit.wikimedia.org/r/1121101 (https://phabricator.wikimedia.org/T380752) [22:40:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121101 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [22:41:18] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.92 ms [22:41:20] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:42:04] PROBLEM - Host lvs1017 is DOWN: PING CRITICAL - Packet loss = 100% [22:42:21] RESOLVED: [3x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:42:44] RECOVERY - Host lvs1017 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [22:44:14] (03CR) 10Bking: [C:03+2] cirrus: rename s3 resources [puppet] - 10https://gerrit.wikimedia.org/r/1121101 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [22:44:36] (03CR) 10Bking: [V:03+2 C:03+2] "self-merging, as this does not affect production hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1121101 (https://phabricator.wikimedia.org/T380752) (owner: 10Bking) [22:49:00] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:49:00] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:49:50] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:49:50] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:52:04] (03PS1) 10Eevans: cassandra: setup 'dev' target for Cassandra 4.1.8 [puppet] - 10https://gerrit.wikimedia.org/r/1121102 (https://phabricator.wikimedia.org/T385819) [22:52:20] ok confirmed our temp revert of the broken type hints has hit beta, and all is well in JsonConfig-land <3 [22:52:20] after more thorough testing the fixed version patch will be restored [22:56:34] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to Dashboards in Superset / Hive interfaces (like Hue) that do access private data for Mariya Shilova - https://phabricator.wikimedia.org/T386754#10565955 (10Dzahn) [22:56:38] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [22:57:20] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250219T2300) [23:01:40] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 123.14 ms [23:02:22] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 94.29 ms [23:04:43] (03PS1) 10Bking: Revert "cirrus: rename s3 resources" [puppet] - 10https://gerrit.wikimedia.org/r/1121104 [23:06:22] (03CR) 10Ryan Kemper: [C:03+1] "we fixed the puppetserver hiera secret path which made this code unnecessary" [puppet] - 10https://gerrit.wikimedia.org/r/1121104 (owner: 10Bking) [23:06:35] (03CR) 10Bking: [V:03+2 C:03+2] Revert "cirrus: rename s3 resources" [puppet] - 10https://gerrit.wikimedia.org/r/1121104 (owner: 10Bking) [23:07:19] (03PS1) 10Bking: Revert "cirrus: add commands to configure opensearch keystore" [puppet] - 10https://gerrit.wikimedia.org/r/1121106 [23:07:24] (03CR) 10Ryan Kemper: [C:03+1] "we fixed the puppetserver hiera secret path which made this code unnecessary" [puppet] - 10https://gerrit.wikimedia.org/r/1121106 (owner: 10Bking) [23:07:29] (03CR) 10Ryan Kemper: [C:03+2] Revert "cirrus: add commands to configure opensearch keystore" [puppet] - 10https://gerrit.wikimedia.org/r/1121106 (owner: 10Bking) [23:07:31] (03CR) 10Ryan Kemper: [V:03+2 C:03+2] Revert "cirrus: add commands to configure opensearch keystore" [puppet] - 10https://gerrit.wikimedia.org/r/1121106 (owner: 10Bking) [23:18:43] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391312 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:23:43] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to codfw RIPE Atlas anchor: failures over threshold for measurement 32391312 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:42:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10566082 (10Papaul) @VRiley-WMF any updates on those 2 hosts?