[00:02:31] !log tgr@deploy2002 tgr: Continuing with sync [00:08:55] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124866|Roll out SUL3 signup to 1% of users on most group 1 wikis (T384007)]] (duration: 29m 13s) [00:08:58] T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007 [00:09:26] !log UTC late deploys done [00:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:44] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 636.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:20:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1049:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1049 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:21:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [00:25:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1049:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1049 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:25:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [00:27:22] !incidents [00:27:22] 5712 (UNACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [00:27:28] !ack 5712 [00:27:28] 5712 (ACKED) HaproxyUnavailable cache_text global sre (thanos-rule) [00:27:45] looking [00:29:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10608276 (10Neobeta61) i would recommend updating to the specs on drivers in my screenshot. I do not see the same issue on the same kern... [00:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10608277 (10phaultfinder) [00:29:56] afk but I can be back home soon if needed [00:30:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [00:30:47] at a glance this looks self-resolved though so I'm not going to rush [00:31:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1049:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1049 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:33:47] !log zabe@mwmaint2002:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php commonswiki --delete /home/zabe/text_table_cleanup/commonswiki --sleep 0.5 # T183490 [00:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:50] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [00:36:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1049:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1049 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1124898 [00:38:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1124898 (owner: 10TrainBranchBot) [00:40:09] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10608285 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1181.eqiad.wmnet with OS bullseye executed with errors: - an-worker1181 (**FAIL... [00:40:21] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye [00:40:30] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10608286 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1181.eqiad.wmnet with OS bullseye [00:47:55] (03PS3) 10Scott French: mw-(api-ext|web): serve 5% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124849 (https://phabricator.wikimedia.org/T383845) [00:50:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1124898 (owner: 10TrainBranchBot) [00:55:00] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1181.eqiad.wmnet with reason: host reimage [00:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:58:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1181.eqiad.wmnet with reason: host reimage [01:09:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1124900 [01:09:10] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1124900 (owner: 10TrainBranchBot) [01:18:54] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.eqiad.wmnet with OS bullseye [01:19:36] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:19:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [01:19:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1181.eqiad.wmnet with OS bullseye [01:20:05] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10608333 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1181.eqiad.wmnet with OS bullseye completed: - an-worker1181 (**WARN**) - Rem... [01:22:07] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10608334 (10Jclark-ctr) [01:27:35] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1124900 (owner: 10TrainBranchBot) [01:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:55:55] (03PS1) 10Daimona Eaytoy: Use namespaced Title class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124911 (https://phabricator.wikimedia.org/T388085) [01:57:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124911 (https://phabricator.wikimedia.org/T388085) (owner: 10Daimona Eaytoy) [02:04:51] (03CR) 10Reedy: [C:03+1] Use namespaced Title class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124911 (https://phabricator.wikimedia.org/T388085) (owner: 10Daimona Eaytoy) [02:21:20] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [02:22:14] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [02:23:41] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:26:00] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [02:26:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:27:15] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [02:27:25] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:32:02] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2045 to codfw - jhancock@cumin2002" [02:32:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2045 to codfw - jhancock@cumin2002" [02:32:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:14] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [02:37:20] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:37:44] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 40.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:39:15] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1008.eqiad.wmnet with OS bullseye [02:41:35] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2049 to codfw - jhancock@cumin2002" [02:41:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2049 to codfw - jhancock@cumin2002" [02:41:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [02:42:21] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2045 [02:42:22] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2046 [02:42:23] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2047 [02:42:24] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2048 [02:42:25] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2049 [02:42:26] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti2050 [02:42:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2045 [02:42:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2046 [02:42:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2047 [02:42:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2048 [02:43:02] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (backup1013, ...), Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [02:43:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2049 [02:43:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti2050 [03:03:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:03:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:03:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:03:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:03:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:03:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:04:27] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:05:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:05:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:06:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:06:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:08:10] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:09:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2050.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:09:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2048.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:09:40] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2049.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:09:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:10:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:10:03] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:10:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2047.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:10:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:10:55] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:14:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:25:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:26:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:36:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2046.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:48:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:48:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:49:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:54:18] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [03:58:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [04:04:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [04:20:41] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [04:24:32] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1178 - vriley@cumin1002" [04:24:38] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1178 - vriley@cumin1002" [04:24:38] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:25:29] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1178 [04:26:46] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1178 [04:28:32] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1178.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:31:40] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [04:34:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10608528 (10phaultfinder) [04:36:06] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [an-worker1179] - vriley@cumin1002" [04:36:12] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [an-worker1179] - vriley@cumin1002" [04:36:12] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:37:08] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1179 [04:38:15] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [04:38:25] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1179 [04:39:42] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1179.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:42:39] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [an-worker1182] - vriley@cumin1002" [04:42:44] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [an-worker1182] - vriley@cumin1002" [04:42:44] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:44:21] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1182.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:45:10] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1178.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:45:38] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1179.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:46:36] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10608540 (10VRiley-WMF) [04:47:07] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1179.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:52:31] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1179.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [04:56:27] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [04:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:00:22] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1182.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [05:01:06] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [an-worker1180] - vriley@cumin1002" [05:01:11] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [an-worker1180] - vriley@cumin1002" [05:01:11] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:01:27] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [05:02:20] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1180 [05:04:24] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1180 [05:06:27] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [an-worker1184] - vriley@cumin1002" [05:06:32] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [an-worker1184] - vriley@cumin1002" [05:06:32] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:07:18] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1184 [05:07:30] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1184 [05:08:11] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [05:08:22] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1180.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [05:09:05] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [05:13:20] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [an-worker1185] - vriley@cumin1002" [05:13:25] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt [an-worker1185] - vriley@cumin1002" [05:13:26] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:14:23] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [05:15:36] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10608547 (10VRiley-WMF) [05:19:21] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [05:20:34] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [05:25:09] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [05:28:01] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1180.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [05:42:06] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1184.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [05:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:08:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2152', diff saved to https://phabricator.wikimedia.org/P74108 and previous config saved to /var/cache/conftool/dbconfig/20250306-060842-marostegui.json [06:08:52] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2152.codfw.wmnet [06:09:37] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1193 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1124936 (https://phabricator.wikimedia.org/T388093) [06:10:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1193 with weight 0 T388093', diff saved to https://phabricator.wikimedia.org/P74109 and previous config saved to /var/cache/conftool/dbconfig/20250306-061052-marostegui.json [06:10:56] T388093: Switchover s8 master (db1209 -> db1193) - https://phabricator.wikimedia.org/T388093 [06:10:58] PROBLEM - Etcd cluster health on dse-k8s-etcd1001 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [06:11:15] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s8 T388093 [06:11:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db1193 from API/vslow/dump T388093', diff saved to https://phabricator.wikimedia.org/P74110 and previous config saved to /var/cache/conftool/dbconfig/20250306-061133-marostegui.json [06:11:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:12:12] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-etcd1001.eqiad.wmnet with OS bookworm [06:12:46] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1193 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/1124936 (https://phabricator.wikimedia.org/T388093) (owner: 10Gerrit maintenance bot) [06:15:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2152.codfw.wmnet [06:16:22] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2152.codfw.wmnet with reason: Index rebuild [06:16:31] !log Starting s8 eqiad failover from db1209 to db1193 - T388093 [06:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:34] T388093: Switchover s8 master (db1209 -> db1193) - https://phabricator.wikimedia.org/T388093 [06:16:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1193 to s8 primary T388093', diff saved to https://phabricator.wikimedia.org/P74111 and previous config saved to /var/cache/conftool/dbconfig/20250306-061650-marostegui.json [06:17:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1209 T388093', diff saved to https://phabricator.wikimedia.org/P74112 and previous config saved to /var/cache/conftool/dbconfig/20250306-061736-marostegui.json [06:19:51] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db1209.eqiad.wmnet [06:22:48] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-etcd1001.eqiad.wmnet with reason: host reimage [06:25:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1209.eqiad.wmnet [06:25:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1209.eqiad.wmnet with reason: Index rebuild [06:25:55] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-etcd1001.eqiad.wmnet with reason: host reimage [06:34:26] (03PS1) 10Marostegui: installserver: Do not reimage db1255 [puppet] - 10https://gerrit.wikimedia.org/r/1125029 (https://phabricator.wikimedia.org/T381475) [06:37:35] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db1255 [puppet] - 10https://gerrit.wikimedia.org/r/1125029 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [06:41:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:44:10] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [debs/benthos] - 10https://gerrit.wikimedia.org/r/1124894 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [06:47:43] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1124877 (https://phabricator.wikimedia.org/T388042) (owner: 10Dzahn) [06:50:14] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:50:16] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:50:16] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T0700) [07:00:05] marostegui, Amir1, and federico3: It is that lovely time of the day again! You are hereby commanded to deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T0700). [07:12:28] 06SRE, 10LDAP-Access-Requests: Grant Access to logstash-access for jhuneidi - https://phabricator.wikimedia.org/T388044#10608673 (10MoritzMuehlenhoff) Requests to the logstash-access LDAP group are handled within Wikimedia IDM: @jeena please log into https://idm.wikimedia.org and request the group by following... [07:13:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:19:58] RECOVERY - Etcd cluster health on dse-k8s-etcd1001 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [07:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10608687 (10phaultfinder) [07:24:48] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-etcd1001.eqiad.wmnet with OS bookworm [07:25:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1032.eqiad.wmnet with OS bookworm [07:25:06] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10608690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1032.eqiad.wmnet with OS bookworm [07:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:38:38] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host dse-k8s-etcd1003.eqiad.wmnet with OS bookworm [07:39:36] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [07:40:48] (03PS1) 10Hashar: Remove obsolete CirrusSearch config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 [07:41:47] (03CR) 10Hashar: "Those settings no more exist in CirrusSearch, see linked changes for reference and:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar) [07:42:38] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1032.eqiad.wmnet with reason: host reimage [07:43:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:43:36] (03CR) 10Jelto: [C:03+2] miscweb: add support for external-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123738 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [07:45:18] (03Merged) 10jenkins-bot: miscweb: add support for external-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123738 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [07:46:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1032.eqiad.wmnet with reason: host reimage [07:46:39] (03CR) 10Elukey: [C:03+1] sre.hosts.provision: disable HostHeaderCheck [cookbooks] - 10https://gerrit.wikimedia.org/r/1124845 (https://phabricator.wikimedia.org/T382416) (owner: 10Volans) [07:46:55] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. The current settings continue to be absolutely fine." [puppet] - 10https://gerrit.wikimedia.org/r/1115801 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [07:48:42] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10608738 (10Ben.buchenau) Hi @KFrancis , yes this is my correct name. Many thanks! [07:48:58] (03CR) 10Slyngshede: [C:03+2] C:apereo_cas Specify encryption algorithms for CAS 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1115801 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [07:51:04] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1233.eqiad.wmnet onto db1254.eqiad.wmnet [07:52:11] (03PS1) 10Hashar: Fix wgCirrusSearchSimilarityProfiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 [07:52:33] !log stevemunene@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-etcd1003.eqiad.wmnet with reason: host reimage [07:55:53] (03PS1) 10KartikMistry: MinT: Increase rediness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125093 (https://phabricator.wikimedia.org/T386889) [07:56:33] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-etcd1003.eqiad.wmnet with reason: host reimage [07:57:23] (03PS1) 10Muehlenhoff: apereo_cas: Remove some obsolete version checks [puppet] - 10https://gerrit.wikimedia.org/r/1125094 [08:00:06] Amir1, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T0800). [08:00:06] No Gerrit patches in the queue for this window AFAICS. [08:01:32] (03PS1) 10Hashar: Drop CodeEditorEnableCore flag: always true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125095 [08:04:42] (03CR) 10Hashar: "`$wgCodeEditorEnableCore` was found to no more exist. For context see:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125095 (owner: 10Hashar) [08:04:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1032.eqiad.wmnet with OS bookworm [08:05:03] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10608765 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1032.eqiad.wmnet with OS bookworm completed: - ganeti103... [08:09:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet [08:09:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125094 (owner: 10Muehlenhoff) [08:11:06] (03PS1) 10Hashar: Remove Cognate legacy settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) [08:11:48] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:12:16] (03CR) 10Vgutierrez: Fix previous commit (031 comment) [debs/benthos] - 10https://gerrit.wikimedia.org/r/1124894 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [08:13:01] (03CR) 10Hashar: "`$wgCognateDb` and `$wgCognateCluster` were found to no more exist. For reference:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) (owner: 10Hashar) [08:25:37] (03CR) 10Michael Große: [C:03+1] [Growth] Set default api lookahead size to 10 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120180 (https://phabricator.wikimedia.org/T325990) (owner: 10Sergio Gimeno) [08:26:39] (03PS1) 10Jelto: deployment_server: add puppetdb rsync to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1125098 (https://phabricator.wikimedia.org/T350794) [08:29:52] PROBLEM - Host ganeti1032 is DOWN: PING CRITICAL - Packet loss = 100% [08:35:37] ^ inactive server, being debugged [08:37:56] (03CR) 10DCausse: "thanks! seems like you remove only one out of the 3 you identified, is this expected?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar) [08:39:30] jouncebot: nowandnext [08:39:30] For the next 0 hour(s) and 20 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T0800) [08:39:30] In 0 hour(s) and 20 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T0900) [08:40:29] o/ I'd like to deploy a config patch I forgot to add to the back window, please let me know if you have objections to this [08:41:16] !log adding https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1121666 to the "UTC morning backport window" [08:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121666 (https://phabricator.wikimedia.org/T271776) (owner: 10DCausse) [08:43:31] (03Merged) 10jenkins-bot: cirrus: configure wgCirrusSearchLanguageKeywordExtraFields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121666 (https://phabricator.wikimedia.org/T271776) (owner: 10DCausse) [08:44:30] !log dcausse@deploy2002 Started scap sync-world: Backport for [[gerrit:1121666|cirrus: configure wgCirrusSearchLanguageKeywordExtraFields (T271776)]] [08:44:36] T271776: Allow limiting lexeme searches by language - https://phabricator.wikimedia.org/T271776 [08:47:14] (03CR) 10David Caro: [C:03+1] "LGTM thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1124800 (https://phabricator.wikimedia.org/T354762) (owner: 10Filippo Giunchedi) [08:47:41] !log dcausse@deploy2002 dcausse: Backport for [[gerrit:1121666|cirrus: configure wgCirrusSearchLanguageKeywordExtraFields (T271776)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:50:07] !log dcausse@deploy2002 dcausse: Continuing with sync [08:51:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [08:51:52] RECOVERY - Host ganeti1032 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [08:52:42] (03CR) 10Volans: [C:03+2] sre.hosts.provision: disable HostHeaderCheck [cookbooks] - 10https://gerrit.wikimedia.org/r/1124845 (https://phabricator.wikimedia.org/T382416) (owner: 10Volans) [08:54:03] (03PS2) 10Jelto: deployment_server: add puppetdb rsync to external_services [puppet] - 10https://gerrit.wikimedia.org/r/1125098 (https://phabricator.wikimedia.org/T350794) [08:55:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet [08:56:23] !log dcausse@deploy2002 Finished scap sync-world: Backport for [[gerrit:1121666|cirrus: configure wgCirrusSearchLanguageKeywordExtraFields (T271776)]] (duration: 11m 53s) [08:56:26] !log volans@cumin1002 START - Cookbook sre.dns.netbox [08:56:26] T271776: Allow limiting lexeme searches by language - https://phabricator.wikimedia.org/T271776 [08:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:59:51] (03Merged) 10jenkins-bot: sre.hosts.provision: disable HostHeaderCheck [cookbooks] - 10https://gerrit.wikimedia.org/r/1124845 (https://phabricator.wikimedia.org/T382416) (owner: 10Volans) [09:00:05] hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T0900) [09:00:07] trainnnn [09:03:36] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125100 (https://phabricator.wikimedia.org/T386214) [09:03:37] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125100 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [09:04:27] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125100 (https://phabricator.wikimedia.org/T386214) (owner: 10TrainBranchBot) [09:09:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1032.eqiad.wmnet to cluster eqiad and group A [09:10:22] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1032.eqiad.wmnet to cluster eqiad and group A [09:12:27] (03PS2) 10Fabfur: Fix previous commit [debs/benthos] - 10https://gerrit.wikimedia.org/r/1124894 (https://phabricator.wikimedia.org/T256098) [09:12:49] (03CR) 10Fabfur: [C:03+1] Fix previous commit (031 comment) [debs/benthos] - 10https://gerrit.wikimedia.org/r/1124894 (https://phabricator.wikimedia.org/T256098) (owner: 10Fabfur) [09:13:22] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.19 refs T386214 [09:13:25] T386214: 1.44.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T386214 [09:17:43] fabfur: I can surely add CI on operations/debs/benthos :) [09:19:11] hashar thanks! It's planned to be migrated to Gitlab? [09:20:47] nop [09:21:10] and the migration is pretty much paused after we have determinated Gitlab does not fit some repositories/workflows/teams requirements ( https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/3Y2RZKZGYXZWCHY7OEZJMXCFLOZC5G3J/ ) [09:22:52] !log installing openssh security updates [09:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:33] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: add sync-data script [puppet] - 10https://gerrit.wikimedia.org/r/1124749 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [09:24:37] fabfur: I am adding CI with https://gerrit.wikimedia.org/r/c/integration/config/+/1125102 it will not vote verified-1 for now though so that is non blocking :) [09:24:45] (03PS2) 10Filippo Giunchedi: prometheus: add sync-data script [puppet] - 10https://gerrit.wikimedia.org/r/1124749 (https://phabricator.wikimedia.org/T383232) [09:24:52] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] prometheus: add sync-data script [puppet] - 10https://gerrit.wikimedia.org/r/1124749 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [09:24:55] hashar do you need a +1 from me? [09:25:08] if you want :) [09:25:11] but it is not necessary [09:25:20] (03PS2) 10Filippo Giunchedi: prometheus: replace prometheus::migration with prometheus-sync-data [puppet] - 10https://gerrit.wikimedia.org/r/1124751 (https://phabricator.wikimedia.org/T383232) [09:25:37] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] prometheus: replace prometheus::migration with prometheus-sync-data [puppet] - 10https://gerrit.wikimedia.org/r/1124751 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [09:25:43] that should catch compilation failures hopefully [09:26:00] (03PS17) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [09:27:30] (03PS18) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [09:28:35] !log deploy additional grants to m1 T387892 [09:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:38] T387892: Decommission backup1001, backup1002, backup2001, backup2002 (and their arrays) - https://phabricator.wikimedia.org/T387892 [09:29:25] (03CR) 10Jcrespo: [C:03+2] dbbackups: Add additional m1 grants for backup[12]013 stats user [puppet] - 10https://gerrit.wikimedia.org/r/1124834 (https://phabricator.wikimedia.org/T387892) (owner: 10Jcrespo) [09:29:33] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/benthos] - 10https://gerrit.wikimedia.org/r/1125103 [09:29:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [09:32:16] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1124821 (https://phabricator.wikimedia.org/T387179) (owner: 10Arturo Borrero Gonzalez) [09:32:43] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/1124843 (https://phabricator.wikimedia.org/T386808) (owner: 10Kamila Součková) [09:33:17] (03CR) 10David Caro: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1124821 (https://phabricator.wikimedia.org/T387179) (owner: 10Arturo Borrero Gonzalez) [09:33:29] (03CR) 10CI reject: [V:04-1] sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [09:33:41] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: cloudvirt: increase conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/1124821 (https://phabricator.wikimedia.org/T387179) (owner: 10Arturo Borrero Gonzalez) [09:34:49] (03PS1) 10Isabelle Hurbain-Palatin: Fix nested refs with the same name but a different group [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125104 (https://phabricator.wikimedia.org/T387800) [09:35:11] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-etcd1003.eqiad.wmnet with OS bookworm [09:35:47] (03CR) 10Filippo Giunchedi: [C:03+2] "Thank you for the reviews!" [alerts] - 10https://gerrit.wikimedia.org/r/1124790 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:36:20] (03CR) 10Filippo Giunchedi: [C:03+2] sre: route AlertLintProblem to the alert file team [alerts] - 10https://gerrit.wikimedia.org/r/1124800 (https://phabricator.wikimedia.org/T354762) (owner: 10Filippo Giunchedi) [09:38:39] (03CR) 10Hashar: "recheck with backports ( https://gerrit.wikimedia.org/r/c/integration/config/+/1125105 )" [debs/benthos] - 10https://gerrit.wikimedia.org/r/1125103 (owner: 10Hashar) [09:38:55] (03PS1) 10Filippo Giunchedi: prometheus: ship sync-data in bin/ not sbin/ [puppet] - 10https://gerrit.wikimedia.org/r/1125107 (https://phabricator.wikimedia.org/T383232) [09:41:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125104 (https://phabricator.wikimedia.org/T387800) (owner: 10Isabelle Hurbain-Palatin) [09:41:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:42:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125104 (https://phabricator.wikimedia.org/T387800) (owner: 10Isabelle Hurbain-Palatin) [09:42:36] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:44:28] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53658 bytes in 1.326 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:44:46] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 8.601 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:45:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10608980 (10MoritzMuehlenhoff) [09:46:08] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10608981 (10MoritzMuehlenhoff) [09:46:40] !log disabling iDrac's WebServer.HostHeaderCheck on the remaining hosts that have it - T382416 [09:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:44] T382416: Globally disable IDRAC.WebServer.HostHeaderCheck - https://phabricator.wikimedia.org/T382416 [09:47:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1035.eqiad.wmnet [09:47:24] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10608990 (10ops-monitoring-bot) Draining ganeti1035.eqiad.wmnet of running VMs [09:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:48:24] (03PS1) 10Muehlenhoff: Switch ganeti1035 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1125109 [09:49:15] (03PS1) 10Ilias Sarantopoulos: ml-services: update reference-quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125110 (https://phabricator.wikimedia.org/T387019) [09:49:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1035.eqiad.wmnet [09:51:19] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10609003 (10MoritzMuehlenhoff) [09:51:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1035.eqiad.wmnet [09:51:56] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10609004 (10ops-monitoring-bot) Draining ganeti1035.eqiad.wmnet of running VMs [09:52:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10609005 (10MatthewVernon) It's not the same kernel, though - you've got `5.14.0-503.11.1.el9_5` from RHEL, and we have `5.10.234-1` fro... [09:55:16] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/benthos] - 10https://gerrit.wikimedia.org/r/1125103 (owner: 10Hashar) [09:58:01] (03PS2) 10Volans: sre.ganeti: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124394 [09:58:08] (03CR) 10Volans: [C:03+2] sre.ganeti: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124394 (owner: 10Volans) [09:58:50] (03PS2) 10Volans: sre.network: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124396 [09:58:53] (03CR) 10Volans: [C:03+2] sre.network: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124396 (owner: 10Volans) [09:59:42] (03PS2) 10Volans: sre.swift: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124398 [09:59:46] (03CR) 10Volans: [C:03+2] sre.swift: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124398 (owner: 10Volans) [10:00:34] (03CR) 10Marostegui: "This looks good to me, but test it with db2230 (test-s4 host) first." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [10:01:03] (03CR) 10Volans: sre.hosts: use new run_cookbook features (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1124395 (owner: 10Volans) [10:01:45] (03CR) 10Btullis: [C:03+1] hiera,druid: Enable IPIP on druid-public-broker@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307) (owner: 10Vgutierrez) [10:03:31] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10609022 (10MatthewVernon) @Neobeta61 could you be clearer as to which drivers you think should be updated to which version(s), please?... [10:04:25] (03Merged) 10jenkins-bot: sre.ganeti: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124394 (owner: 10Volans) [10:05:45] (03Merged) 10jenkins-bot: sre.network: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124396 (owner: 10Volans) [10:06:15] (03Merged) 10jenkins-bot: sre.swift: use new run_cookbook features [cookbooks] - 10https://gerrit.wikimedia.org/r/1124398 (owner: 10Volans) [10:10:40] !log volans@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:10:42] !log volans@cumin1002 START - Cookbook sre.dns.netbox [10:10:56] (03PS1) 10Jcrespo: dbbackups: Prepare backup1002, backup2002 for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1125114 (https://phabricator.wikimedia.org/T387892) [10:14:02] (03PS2) 10Jcrespo: dbbackups: Prepare backup1002, backup2002 for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/1125114 (https://phabricator.wikimedia.org/T387892) [10:14:40] !log volans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Unblock others adds an-worker1186 - volans@cumin1002" [10:14:46] !log volans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Unblock others adds an-worker1186 - volans@cumin1002" [10:14:46] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:16:10] !log Drop phabricator_search.search_documentfield_BKUP T387174 [10:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:13] T387174: Drop phabricator_search.search_documentfield_BKUP table - https://phabricator.wikimedia.org/T387174 [10:18:20] (03PS1) 10Filippo Giunchedi: sre: non-greedy match for AlertLintProblem [alerts] - 10https://gerrit.wikimedia.org/r/1125116 (https://phabricator.wikimedia.org/T354762) [10:19:35] (03CR) 10Jgiannelos: Fix nested refs with the same name but a different group (031 comment) [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125104 (https://phabricator.wikimedia.org/T387800) (owner: 10Isabelle Hurbain-Palatin) [10:20:25] (03CR) 10Filippo Giunchedi: [C:03+2] sre: non-greedy match for AlertLintProblem [alerts] - 10https://gerrit.wikimedia.org/r/1125116 (https://phabricator.wikimedia.org/T354762) (owner: 10Filippo Giunchedi) [10:22:27] (03CR) 10Volans: "Nice to see the conversion of hardcoded commands to their related Instance functionalities. General approach LGTM, left few minor comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [10:23:02] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118080 (https://phabricator.wikimedia.org/T385869) (owner: 10Clément Goubert) [10:23:35] (03CR) 10Isabelle Hurbain-Palatin: Fix nested refs with the same name but a different group (031 comment) [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125104 (https://phabricator.wikimedia.org/T387800) (owner: 10Isabelle Hurbain-Palatin) [10:25:32] (03CR) 10Jgiannelos: Fix nested refs with the same name but a different group (031 comment) [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125104 (https://phabricator.wikimedia.org/T387800) (owner: 10Isabelle Hurbain-Palatin) [10:28:21] (03PS19) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [10:28:46] (03CR) 10Jgiannelos: [C:03+1] Fix nested refs with the same name but a different group [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125104 (https://phabricator.wikimedia.org/T387800) (owner: 10Isabelle Hurbain-Palatin) [10:30:55] (03PS20) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [10:32:36] (03CR) 10Gkyziridis: [C:03+1] "Thnx for working on that." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125110 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [10:32:40] (03PS1) 10Hnowlan: trafficserver: use mobileapps/pcs directly on more wikis [puppet] - 10https://gerrit.wikimedia.org/r/1125118 (https://phabricator.wikimedia.org/T387277) [10:37:03] (03PS1) 10Jelto: Remove profile::kubernetes::* from role::ci [puppet] - 10https://gerrit.wikimedia.org/r/1125119 (https://phabricator.wikimedia.org/T288629) [10:37:54] (03CR) 10Jelto: "check-experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125119 (https://phabricator.wikimedia.org/T288629) (owner: 10Jelto) [10:40:38] (03CR) 10Jelto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125119 (https://phabricator.wikimedia.org/T288629) (owner: 10Jelto) [10:40:50] (03PS21) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [10:41:18] (03PS1) 10Btullis: Update site.pp and preseed.yaml for new Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1125120 (https://phabricator.wikimedia.org/T386390) [10:41:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [10:43:08] (03CR) 10Hashar: Remove obsolete CirrusSearch config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar) [10:43:14] (03PS2) 10Hashar: Remove obsolete CirrusSearch config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 [10:44:26] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1125120 (https://phabricator.wikimedia.org/T386390) (owner: 10Btullis) [10:45:47] (03CR) 10DCausse: [C:03+1] "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125091 (owner: 10Hashar) [10:49:56] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:50:42] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:51:45] federico3: ^ [10:52:18] oops, sorry, fixed [10:53:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74116 and previous config saved to /var/cache/conftool/dbconfig/20250306-105335-root.json [10:54:58] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:55:42] FIRING: AlertLintProblem: Linting problems found for CirrusSearchJobQueueLagTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [10:55:44] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:57:36] (03CR) 10Btullis: [C:03+2] Update site.pp and preseed.yaml for new Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1125120 (https://phabricator.wikimedia.org/T386390) (owner: 10Btullis) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T1100) [11:01:36] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:02:18] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host, interfaces up: 114, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:02:18] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:05:29] (03CR) 10Jgiannelos: [C:03+1] trafficserver: use mobileapps/pcs directly on more wikis [puppet] - 10https://gerrit.wikimedia.org/r/1125118 (https://phabricator.wikimedia.org/T387277) (owner: 10Hnowlan) [11:06:03] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10609265 (10BTullis) >>! In T377878#10598755, @VRiley-WMF wrote: > @BTullis is there a specific RAID that is supposed to be placed onto these servers? Hi @VRiley-WMF - Apologies for the delay i... [11:08:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74118 and previous config saved to /var/cache/conftool/dbconfig/20250306-110841-root.json [11:09:26] (03CR) 10Fabfur: [C:03+1] trafficserver: use mobileapps/pcs directly on more wikis [puppet] - 10https://gerrit.wikimedia.org/r/1125118 (https://phabricator.wikimedia.org/T387277) (owner: 10Hnowlan) [11:09:36] ACKNOWLEDGEMENT - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis These are old machines with expired RAID batteries, which will not be replaced. https://wikitech.wikimedia.org/wiki/Mega [11:09:37] nitoring [11:09:37] ACKNOWLEDGEMENT - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis These are old machines with expired RAID batteries, which will not be replaced. https://wikitech.wikimedia.org/wiki/Mega [11:09:37] nitoring [11:13:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10609302 (10BTullis) [11:14:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10609304 (10BTullis) [11:16:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21), 13Patch-For-Review: Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10609310 (10BTullis) a:05BTullis→03Jclark-ctr [11:18:47] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: druid::public::worker@eqiad [11:18:54] (03CR) 10Vgutierrez: [C:03+2] hiera,druid: Enable IPIP on druid-public-broker@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1124113 (https://phabricator.wikimedia.org/T387307) (owner: 10Vgutierrez) [11:18:58] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2230.codfw.wmnet [11:18:58] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for db2230.codfw.wmnet [11:20:16] (03PS22) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [11:23:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74121 and previous config saved to /var/cache/conftool/dbconfig/20250306-112346-root.json [11:23:48] 06SRE, 10Observability-Alerting, 10SRE Observability (FY2024/2025-Q3): Ops-monitoring-bot creating duplicate tasks for the same RAID failure - https://phabricator.wikimedia.org/T387754#10609388 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I've reenabled the event handler since {T382984} is resol... [11:24:24] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=99) for role: druid::public::worker@eqiad [11:25:20] (03PS23) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [11:26:23] (03CR) 10Filippo Giunchedi: [C:03+2] alertmanager: remove 'default' receiver when duplicated [puppet] - 10https://gerrit.wikimedia.org/r/1124733 (https://phabricator.wikimedia.org/T353457) (owner: 10Filippo Giunchedi) [11:26:47] (03PS1) 10Hashar: Remove obsolete $wgFlowMaintenanceMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125124 [11:27:13] (03CR) 10Hnowlan: [C:03+2] trafficserver: use mobileapps/pcs directly on more wikis [puppet] - 10https://gerrit.wikimedia.org/r/1125118 (https://phabricator.wikimedia.org/T387277) (owner: 10Hnowlan) [11:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:28:50] (03PS24) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [11:29:01] (03PS1) 10Vgutierrez: hiera: Enable IPIP on druid-public-broker@eqiad take two [puppet] - 10https://gerrit.wikimedia.org/r/1125125 (https://phabricator.wikimedia.org/T387307) [11:29:28] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125125 (https://phabricator.wikimedia.org/T387307) (owner: 10Vgutierrez) [11:29:32] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2230.codfw.wmnet [11:29:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10609430 (10phaultfinder) [11:29:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 10%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74124 and previous config saved to /var/cache/conftool/dbconfig/20250306-112955-root.json [11:33:16] (03CR) 10Stevemunene: [C:03+1] hiera: Enable IPIP on druid-public-broker@eqiad take two [puppet] - 10https://gerrit.wikimedia.org/r/1125125 (https://phabricator.wikimedia.org/T387307) (owner: 10Vgutierrez) [11:33:25] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable IPIP on druid-public-broker@eqiad take two [puppet] - 10https://gerrit.wikimedia.org/r/1125125 (https://phabricator.wikimedia.org/T387307) (owner: 10Vgutierrez) [11:34:08] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: druid::public::worker@eqiad [11:34:43] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2230.codfw.wmnet [11:34:47] (03CR) 10CI reject: [V:04-1] sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [11:36:34] !log Migrating 12 wikis to use mobileapps/pcs without restbase [11:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:57] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:38:50] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:38:50] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:38:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74125 and previous config saved to /var/cache/conftool/dbconfig/20250306-113852-root.json [11:39:03] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [11:39:03] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: druid::public::worker@eqiad [11:42:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10609489 (10BTullis) >>! In T385485#10550375, @BTullis wrote: > There might be efficiency gains on the DC Ops side if we were to sche... [11:44:15] !log applying interface-specific arp policer on cr2-magru to IX.BR sub-interface ae0.3347 (T384774) [11:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:18] T384774: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774 [11:45:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 25%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74127 and previous config saved to /var/cache/conftool/dbconfig/20250306-114501-root.json [11:53:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1209 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74129 and previous config saved to /var/cache/conftool/dbconfig/20250306-115357-root.json [11:55:22] (03CR) 10Vgutierrez: [C:04-2] "this functionality shouldn't be implemented inside acme_chief::cert" [puppet] - 10https://gerrit.wikimedia.org/r/1112773 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [11:57:06] (03CR) 10AikoChou: [C:03+1] "I have a small question about autoscaling.knative.dev/target" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125110 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [11:57:54] (03CR) 10Ladsgroup: [C:03+1] "Thanks! Do you want me to deploy it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) (owner: 10Hashar) [12:00:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 50%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74130 and previous config saved to /var/cache/conftool/dbconfig/20250306-120007-root.json [12:07:57] (03PS25) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [12:08:39] (03CR) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [12:13:32] (03PS1) 10Gergő Tisza: Enable SUL3 signup for 10% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125130 (https://phabricator.wikimedia.org/T384007) [12:13:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125130 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [12:15:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 75%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74131 and previous config saved to /var/cache/conftool/dbconfig/20250306-121512-root.json [12:15:50] !log imported lshw 02.19.git.2021.06.19.996aaad9c7-2~bpo11+1 to component/lshw T383557 [12:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:53] T383557: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557 [12:16:22] (03PS1) 10Gergő Tisza: Enable SUL3 signup for all group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125134 (https://phabricator.wikimedia.org/T384007) [12:16:35] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, March 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125134 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [12:18:26] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125137 [12:19:04] (03PS26) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [12:20:59] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db1197.eqiad.wmnet [12:21:01] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db1197 - Upgrading db1197 [12:21:18] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1197 - Upgrading db1197 [12:24:56] !log installing krb5 security updates [12:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:10] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db1197 gradually with 4 steps - Upgrading db1197 [12:30:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2152 (re)pooling @ 100%: Repooling after rebuild index', diff saved to https://phabricator.wikimedia.org/P74134 and previous config saved to /var/cache/conftool/dbconfig/20250306-123017-root.json [12:43:02] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 139 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:44:04] 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10609670 (10cmooney) After a good deal of back and forth with JTAC they were able to point us in the right direction. By default the MX platfrom has a built-in "ar... [12:56:48] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@fa4513d]: say hello to image suggestions v1.0.0 [12:57:17] (03CR) 10Sergio Gimeno: [C:03+1] Growth: remove unused config wgGENewcomerTasksOresTopicConfigTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124836 (owner: 10Michael Große) [12:57:28] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@fa4513d]: say hello to image suggestions v1.0.0 (duration: 01m 09s) [12:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:58:08] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:59:08] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T1300) [13:01:34] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:02:33] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:07:45] (03PS1) 10Marostegui: dbproxy102[2,4]: Test db1250 [puppet] - 10https://gerrit.wikimedia.org/r/1125145 (https://phabricator.wikimedia.org/T388024) [13:08:26] (03CR) 10Marostegui: [C:03+2] dbproxy102[2,4]: Test db1250 [puppet] - 10https://gerrit.wikimedia.org/r/1125145 (https://phabricator.wikimedia.org/T388024) (owner: 10Marostegui) [13:11:04] (03PS1) 10Marostegui: Revert "dbproxy102[2,4]: Test db1250" [puppet] - 10https://gerrit.wikimedia.org/r/1125146 [13:11:16] !log installing gst-plugins-base1.0 security updates [13:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:40] (03CR) 10Marostegui: [C:03+2] Revert "dbproxy102[2,4]: Test db1250" [puppet] - 10https://gerrit.wikimedia.org/r/1125146 (owner: 10Marostegui) [13:13:40] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1197 gradually with 4 steps - Upgrading db1197 [13:13:40] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1197.eqiad.wmnet [13:14:52] RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [13:16:35] (03PS1) 10Marostegui: db1250: Make it master in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1125150 (https://phabricator.wikimedia.org/T388024) [13:16:58] (03CR) 10Marostegui: [C:03+2] db1250: Make it master in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1125150 (https://phabricator.wikimedia.org/T388024) (owner: 10Marostegui) [13:17:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1124736 (owner: 10Slyngshede) [13:24:02] (03PS3) 10Sergio Gimeno: [Growth] Set default api lookahead size to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120180 (https://phabricator.wikimedia.org/T325990) [13:24:36] (03CR) 10Sergio Gimeno: [Growth] Set default api lookahead size to 10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120180 (https://phabricator.wikimedia.org/T325990) (owner: 10Sergio Gimeno) [13:28:01] (03CR) 10Federico Ceratto: "I tested the CR at the current version doing a real update on db1197 and it worked without manual intervention. See phabricator for detail" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [13:29:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [13:29:49] (03CR) 10Michael Große: [Growth] Set default api lookahead size to 10 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120180 (https://phabricator.wikimedia.org/T325990) (owner: 10Sergio Gimeno) [13:35:06] (03CR) 10Marostegui: "Thanks for running this. @rcoccioli@wikimedia.org you ok with the latest changes done after your comments?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [13:36:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10609829 (10MoritzMuehlenhoff) Hmmh, I'm not sure why these have eight drives? These are config C, so they should simply have 4x960G SSDs, right? Did Super... [13:41:55] (03CR) 10Volans: "@marostegui@wikimedia.org there are (-49, +99) lines difference between the PS I reviewed earlier and the current one. It's basically a t" [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [13:42:42] (03CR) 10Marostegui: "Yeah, I wasn't implying it has to be today." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [13:46:26] (03CR) 10Muehlenhoff: [C:03+2] Install lshw backport from component/lshw [puppet] - 10https://gerrit.wikimedia.org/r/1124764 (https://phabricator.wikimedia.org/T380295) (owner: 10Muehlenhoff) [13:47:21] jouncebot: nowandnext [13:47:22] For the next 0 hour(s) and 12 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T1300) [13:47:22] In 0 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T1400) [13:47:35] (03CR) 10Volans: "If changes were done in smaller, incremental CRs, with a clear and limited scope it would be much easier and quicker to review them IMHO." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [13:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:47:58] (03CR) 10Hashar: [C:03+2] Fix nested refs with the same name but a different group [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125104 (https://phabricator.wikimedia.org/T387800) (owner: 10Isabelle Hurbain-Palatin) [13:48:11] I have +2ed the Cite patch in the interest of time [13:48:24] why thank you very much [13:48:43] (i must admit i was not optimistic in getting this in the afternoon backport :) ) [13:49:30] (03Merged) 10jenkins-bot: Fix nested refs with the same name but a different group [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125104 (https://phabricator.wikimedia.org/T387800) (owner: 10Isabelle Hurbain-Palatin) [13:49:32] I am contemplating dropping the windows :b [13:49:44] in favor of self deploying! [13:50:11] then we have a bunch of configuration patches sent by volunteers and they need someone to drive the deploy [13:50:37] + I guess it is good for SREs to know when the cluster is going to explode (only 3 times per day instead of at any arbitrary point of time) [13:50:44] so hmm I don't know [13:50:44] ^ this :P [13:51:01] and/or to not explode the cluster with an ill-timed deployment that happens at a critical time [13:51:05] of course if devs actually managed the cluster instead of SRE, they could self fix the cluster [13:51:14] but then devs would not have time to do software development [13:51:21] but we could get SRE to do the development, then [13:51:22] any way [13:51:23] ... [13:51:25] are you trying to dev SREs out of a job [13:52:16] developers developers developers! :b [13:52:23] :D [13:52:29] * ihurbain throws a chair [13:53:02] * hashar waits for CI to process the chair trajectory and wipe it if it affects prod [13:55:46] (03CR) 10Slyngshede: [C:03+2] Show existing approvals on permission approval pages [software/bitu] - 10https://gerrit.wikimedia.org/r/1124736 (owner: 10Slyngshede) [13:58:39] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1125094 (owner: 10Muehlenhoff) [13:59:09] (03CR) 10Slyngshede: [C:03+1] "cloudinfra is running CAS 6.6.12+wmf11u2 so we should be fine." [puppet] - 10https://gerrit.wikimedia.org/r/1125094 (owner: 10Muehlenhoff) [13:59:15] ihurbain: ah that Cite patch is already merged [13:59:36] meep meep [13:59:45] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1125104|Fix nested refs with the same name but a different group (T387800)]] [13:59:48] T387800: Nested ref support is broken in Cite-Parsoid - https://phabricator.wikimedia.org/T387800 [13:59:58] \o/ thank you! [13:59:59] (03CR) 10Hashar: "I will deploy it together with another related change. Thank you for the review!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125097 (https://phabricator.wikimedia.org/T348526) (owner: 10Hashar) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T1400). [14:00:05] ollie_wmde, sergi0, MichaelG_WMF, Daimona, ihurbain, and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] (03PS2) 10Volans: cli: log an eventual exception to stderr [software/cumin] - 10https://gerrit.wikimedia.org/r/1114456 (https://phabricator.wikimedia.org/T384539) (owner: 10TheAnarcat) [14:00:08] (03PS1) 10Volans: docs: removed deprecated call to sphinx_rtd_theme [software/cumin] - 10https://gerrit.wikimedia.org/r/1125157 [14:00:08] (03PS1) 10Volans: query: do not error on no match in first subquery [software/cumin] - 10https://gerrit.wikimedia.org/r/1125158 [14:00:10] o/ [14:00:13] o/ [14:00:15] I can probably deploy in a minute or two [14:00:24] I am deploying isabelle patch to Cite [14:00:30] o/ [14:00:32] for the rest of patches I don't know [14:00:52] mine should be purely a no-op: removing unused config [14:01:06] (03Merged) 10jenkins-bot: Show existing approvals on permission approval pages [software/bitu] - 10https://gerrit.wikimedia.org/r/1124736 (owner: 10Slyngshede) [14:01:09] (03CR) 10Ilias Sarantopoulos: ml-services: update reference-quality models (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125110 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [14:01:17] ah I have a few patches for unused config as well [14:01:19] o/ [14:01:32] Me too! [14:01:34] based mon matmarex message and https://wikitech.wikimedia.org/wiki/Technical_debt/Unused_config#Results [14:01:34] Hello I'm here for 1123007 [14:01:45] so I guess those unused-config patches can go last [14:02:36] I am wondering whether maybe we should have another window late in the morning [14:02:47] !log hashar@deploy2002 hashar, ihurbain: Backport for [[gerrit:1125104|Fix nested refs with the same name but a different group (T387800)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:02:56] ihurbain: your patch to Cite is live on debug servers [14:02:57] (03PS1) 10Filippo Giunchedi: pontoon: note pipx requirements [puppet] - 10https://gerrit.wikimedia.org/r/1125159 [14:03:07] let me have a look [14:03:32] ollie_wmde: for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1123007 that can be merged any time since that only affects the beta cluster :) [14:04:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10609902 (10BTullis) As we have not yet completed the installation of {T377878} - then I think it makes sense to swap the 4TB drives... [14:04:29] hashar: Oh, okay - when would you like to do it, I don't have +2 access? [14:04:38] yeah I will do it [14:04:40] ok, I can confirm that I can deploy today :) [14:04:57] then I am wondering how to deploy all of those [14:05:05] cause one after the other is going to take a while [14:05:09] then some are trivial [14:05:10] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: note pipx requirements [puppet] - 10https://gerrit.wikimedia.org/r/1125159 (owner: 10Filippo Giunchedi) [14:05:20] 👋 is it too late to schedule a config deploy? I accidentally scheduled mine for next week 🤦‍♂️ [14:05:46] well join the fun https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T1400 :) [14:05:47] (03CR) 10Ilias Sarantopoulos: ml-services: update reference-quality models (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125110 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [14:05:50] itamarWMDE: +1 [14:07:31] ihurbain: is the Cite patch ok? [14:07:42] hashar: still looking, sorry [14:07:47] no worries :) [14:07:51] I am preparing the next batch [14:07:54] hashar: thank you! [14:08:16] hashar: there's no other window in the next two hours so should be fine [14:08:33] (I can take over deploys at the end of the hour if needed) [14:08:36] I should look at how busy this window is. If it tend to be popular maybe we need another one on thursday [14:08:45] or make it a two hours one [14:09:20] in general IMO we have way too few backport windows [14:09:25] (or too short) [14:09:46] yeah I think we should revisit that concept eventually [14:09:48] hashar: mmmh. the issue i think should be fixed is not fixed. [14:10:01] ihurbain: c'est balot ça [14:10:23] i do not believe it broke anything, but.... but then i don't know what to do >_< [14:10:39] we can always abort and rollback [14:10:52] did i put it on the right branch [14:10:55] (probably) [14:11:09] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1233.eqiad.wmnet onto db1254.eqiad.wmnet [14:11:09] wmf/1.44.0-wmf.19 [14:11:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10609948 (10BTullis) >>! In T385485#10550375, @BTullis wrote: > There are two main options around shuffling the data around, based on... [14:11:16] (03PS1) 10Vgutierrez: cumin: Remove lvs-eqsin alias [puppet] - 10https://gerrit.wikimedia.org/r/1125162 (https://phabricator.wikimedia.org/T384477) [14:11:19] that looks reasonable [14:11:20] which is deployed on all wikis [14:12:16] or maybe the parse is cached? :b [14:12:48] i've tried to slap that and to create a new page. i'm having doubts on my mwdebug setup (and my sanity) [14:13:01] gimme another 3 minutes before rollback plz? [14:13:08] (03PS2) 10Kamila Součková: prometheus: charmuseum relabel config [puppet] - 10https://gerrit.wikimedia.org/r/1124843 (https://phabricator.wikimedia.org/T386808) [14:13:13] yeah no worries [14:13:34] if the requests are served by mwdebug, the HTTP response should have a `Server: mwdebugXXXX...` header [14:15:20] (03PS3) 10Kamila Součková: prometheus: charmuseum relabel config [puppet] - 10https://gerrit.wikimedia.org/r/1124843 (https://phabricator.wikimedia.org/T386808) [14:16:00] (03CR) 10Kamila Součková: prometheus: charmuseum relabel config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124843 (https://phabricator.wikimedia.org/T386808) (owner: 10Kamila Součková) [14:17:22] hashar: okay, rollback, it should work, it doesn't [14:17:29] !log hashar@deploy2002 Sync cancelled. [14:17:39] does it really need a rollback? [14:17:40] done [14:18:07] (03PS1) 10Hashar: Revert "Fix nested refs with the same name but a different group" [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125164 [14:18:12] if it doesn’t break anything I would’ve thought we can skip the revert [14:18:18] but ok I’m too late with that comment anyway [14:18:22] PROBLEM - Disk space on deploy2002 is CRITICAL: DISK CRITICAL - free space: /srv 15981 MB (5% inode=71%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [14:18:40] hmm deploy2002 is full [14:18:43] :/ [14:18:49] pff [14:18:53] (03CR) 10Kamila Součková: [C:03+2] prometheus: charmuseum relabel config [puppet] - 10https://gerrit.wikimedia.org/r/1124843 (https://phabricator.wikimedia.org/T386808) (owner: 10Kamila Součková) [14:19:27] Filesystem Size Used Avail Use% Mounted on [14:19:27] /dev/mapper/vg0-srv 277G 247G 16G 95% /srv [14:19:49] that’s a lot of disk space in /srv/deployment/analytics (28.5 GiB) [14:20:13] but as a short-term fix it might be easier to prune wmf.17 if we don’t need it anymore? [14:21:04] (03CR) 10Btullis: [C:03+1] hiera,wdqs: Enable IPIP for wdqs(-ssl|-heavy-queries)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123667 (https://phabricator.wikimedia.org/T387314) (owner: 10Vgutierrez) [14:21:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10610006 (10Papaul) @VRiley-WMF I checked on the packing slip, it said the each server has 4 drives but when i login to 1053 in the BIOS i see only 2 drive... [14:21:28] (03PS1) 10Bking: cloudelastic: include lvs profile for opensearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125165 (https://phabricator.wikimedia.org/T387904) [14:21:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125165 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [14:21:57] looking at https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=deploy2002&var-datasource=thanos&var-cluster=misc&viewPanel=12&from=now-7d&to=now [14:22:15] deploy2002 /srv has at 90% usage for a while [14:22:33] /srv/deployment/analytics/refinery is mostly due to git fat [14:23:37] (03PS2) 10Bking: cloudelastic: include lvs profile for opensearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125165 (https://phabricator.wikimedia.org/T387904) [14:23:37] I've dropped a few large files I had on deploy2002 (specifically logs from long maintenance script runs that were verbose) [14:23:57] I am not even sure why the git fat objects are on the deployment server. Supposedly they were only fetched from the targets [14:24:09] Dreamy_Jazz: thanks, but the warning is about /srv so I doubt that helped :/ [14:24:26] Ah, assumed the warning was more generic [14:24:41] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125165 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [14:24:47] Anyway, still good to clean up. [14:25:12] T328472 [14:25:17] ah we got refinery moved to git-lfs [14:25:56] (03CR) 10DCausse: [C:03+1] cloudelastic: include lvs profile for opensearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125165 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [14:26:01] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): openstack galera no recent writes 2025-03-04, suspected network hardware problem - https://phabricator.wikimedia.org/T387828#10610025 (10fnegri) [14:26:41] !log deploy2002: cleaned obsolete git-fat objects for analytics/refinery , that moved to git-lfs - T328472 [14:26:59] went from 16G to 38G available [14:27:01] * hashar flexes [14:27:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125164 (owner: 10Hashar) [14:27:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123007 (https://phabricator.wikimedia.org/T385592) (owner: 10Ollie Shotton) [14:27:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124836 (owner: 10Michael Große) [14:27:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124893 (https://phabricator.wikimedia.org/T387025) (owner: 10Daimona Eaytoy) [14:27:20] !log bking@cumin2002 conftool action : set/pooled=no; selector: name=cloudelastic1007.eqiad.wmnet [14:27:28] Nice! [14:27:44] RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - HTTPS on cloudelastic.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 717 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Search%23Administration [14:27:44] RECOVERY - WMF Cloud -Omega Cluster- - Public Internet Port - SSL Expiry on cloudelastic.wikimedia.org is OK: OK - Certificate cloudelastic.wikimedia.org will expire on Sun 11 May 2025 11:48:24 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Search%23Administration [14:27:44] that is the Cite rollback for ihurbain + 3 trivial patches (beta, config removals) [14:27:59] (03Merged) 10jenkins-bot: Test new term store config in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123007 (https://phabricator.wikimedia.org/T385592) (owner: 10Ollie Shotton) [14:28:04] (03Merged) 10jenkins-bot: Growth: remove unused config wgGENewcomerTasksOresTopicConfigTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124836 (owner: 10Michael Große) [14:28:05] (03Merged) 10jenkins-bot: Drop $wmgCampaignEventsProgramsAndEventsDashboardEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124893 (https://phabricator.wikimedia.org/T387025) (owner: 10Daimona Eaytoy) [14:28:09] (03CR) 10Bking: [C:03+2] cloudelastic: include lvs profile for opensearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125165 (https://phabricator.wikimedia.org/T387904) (owner: 10Bking) [14:28:33] hashar: nice, want to do that on deploy1003 too? ^^ [14:28:48] (doesn’t have a disk warning but has the same amount of git-fat files AFAICT) [14:29:52] * Lucas_WMDE in another meeting, can’t deploy anymore [14:30:27] I will file a task about removing git fat objects [14:31:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-cloudelastic is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [14:33:02] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cloudelastic1007.eqiad.wmnet [14:34:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUns [14:35:02] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1187.eqiad.wmnet with OS bullseye [14:35:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1187.eqiad.wmnet with OS... [14:35:51] pff what a long window [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:22] RECOVERY - Disk space on deploy2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy2002&var-datasource=codfw+prometheus/ops [14:39:31] it feels like the job is stuck at https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-php81/686/consoleFull [14:41:24] (03Merged) 10jenkins-bot: Revert "Fix nested refs with the same name but a different group" [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125164 (owner: 10Hashar) [14:41:44] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1125164|Revert "Fix nested refs with the same name but a different group"]], [[gerrit:1123007|Test new term store config in beta (T385592)]], [[gerrit:1124836|Growth: remove unused config wgGENewcomerTasksOresTopicConfigTitle]], [[gerrit:1124893|Drop $wmgCampaignEventsProgramsAndEventsDashboardEnabled (T387025)]] [14:42:08] Well it got unstuck. But I'm seeing a warning in the test output that doesn't look good, and also does not make any tests fail ("Total size of styles modules is 20.3kB") [14:43:25] sergi0: I will do your patch once that batch has finished [14:43:59] Ah I see, it is intentionally only printing the message: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/50d550ad2082ac23db2ae01ae0117169aed3af03 [14:44:14] great, should be quick to test [14:44:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10610113 (10BTullis) >>! In T385485#10609489, @BTullis wrote: >>>! In T385485#10550375, @BTullis wrote: >> There might be efficiency... [14:45:45] I will push Daimona patch removing permissions 1124879 and itamarWMDE patch removing some route (1122990) [14:46:04] ty! [14:46:11] so a batch of three [14:46:42] which would leave " [config] 1124911 (deploy commands) Use namespaced Title class - task T388085" and " [config] 1125130 (deploy commands) Enable SUL3 signup for 10% of group 1 users - task T384007" [14:46:43] T388085: Upcoming production error: Error: Class "Title" not found - https://phabricator.wikimedia.org/T388085 [14:46:44] T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007 [14:46:51] which I think should be done independently [14:47:22] (03CR) 10Jelto: [C:03+1] "lgtm, one comment in-line but probably just worth trying." [puppet] - 10https://gerrit.wikimedia.org/r/1124857 (https://phabricator.wikimedia.org/T388041) (owner: 10Dzahn) [14:47:23] !log hashar@deploy2002 ollieshotton, migr, daimona, hashar: Backport for [[gerrit:1125164|Revert "Fix nested refs with the same name but a different group"]], [[gerrit:1123007|Test new term store config in beta (T385592)]], [[gerrit:1124836|Growth: remove unused config wgGENewcomerTasksOresTopicConfigTitle]], [[gerrit:1124893|Drop $wmgCampaignEventsProgramsAndEventsDashboardEnabled (T387025)]] synced to the testservers (h [14:47:23] ttps://wikitech.wikimedia.org/wiki/Mwdebug) [14:47:27] T385592: Test the new term store config in beta cluster - https://phabricator.wikimedia.org/T385592 [14:47:28] T387025: MediaWiki\Extension\CampaignEvents\TrackingTool\ToolNotFoundException: No tool with DB ID 1 - https://phabricator.wikimedia.org/T387025 [14:47:41] !log hashar@deploy2002 ollieshotton, migr, daimona, hashar: Continuing with sync [14:50:22] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Relabel Relforge hosts to Elastic hosts - https://phabricator.wikimedia.org/T388133 (10bking) 03NEW [14:52:42] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: ship sync-data in bin/ not sbin/ [puppet] - 10https://gerrit.wikimedia.org/r/1125107 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [14:53:49] I really wonder what is the bottleneck with that k8s deployment [14:53:55] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1125164|Revert "Fix nested refs with the same name but a different group"]], [[gerrit:1123007|Test new term store config in beta (T385592)]], [[gerrit:1124836|Growth: remove unused config wgGENewcomerTasksOresTopicConfigTitle]], [[gerrit:1124893|Drop $wmgCampaignEventsProgramsAndEventsDashboardEnabled (T387025)]] (duration: 12m 10s) [14:53:59] T385592: Test the new term store config in beta cluster - https://phabricator.wikimedia.org/T385592 [14:53:59] T387025: MediaWiki\Extension\CampaignEvents\TrackingTool\ToolNotFoundException: No tool with DB ID 1 - https://phabricator.wikimedia.org/T387025 [14:54:21] I am doing the next batch [14:54:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120180 (https://phabricator.wikimedia.org/T325990) (owner: 10Sergio Gimeno) [14:54:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124879 (https://phabricator.wikimedia.org/T386738) (owner: 10Daimona Eaytoy) [14:54:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122990 (https://phabricator.wikimedia.org/T383774) (owner: 10Itamar Givon) [14:54:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:40] (03Merged) 10jenkins-bot: [Growth] Set default api lookahead size to 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120180 (https://phabricator.wikimedia.org/T325990) (owner: 10Sergio Gimeno) [14:55:42] FIRING: AlertLintProblem: Linting problems found for CirrusSearchJobQueueLagTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [14:55:42] (03Merged) 10jenkins-bot: Revert "Let sysops add/remove the event-organizer group by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124879 (https://phabricator.wikimedia.org/T386738) (owner: 10Daimona Eaytoy) [14:55:49] (03Merged) 10jenkins-bot: Remove unused route file from Wikibase REST API configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122990 (https://phabricator.wikimedia.org/T383774) (owner: 10Itamar Givon) [14:56:06] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1120180|[Growth] Set default api lookahead size to 10 (T325990)]], [[gerrit:1124879|Revert "Let sysops add/remove the event-organizer group by default" (T386738)]], [[gerrit:1122990|Remove unused route file from Wikibase REST API configuration (T383774)]] [14:56:12] T325990: Incorrect paging in GrowthExperiments suggested edits module - https://phabricator.wikimedia.org/T325990 [14:56:12] T386738: Consider removal of $wgAddGroups and $wgRemoveGroups added for the event-organizer group in WMF config - https://phabricator.wikimedia.org/T386738 [14:56:12] T383774: Remove v0 routes and the corresponding test - https://phabricator.wikimedia.org/T383774 [14:57:53] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::public@codfw [14:57:55] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1187.eqiad.wmnet with reason: host reimage [14:57:59] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP for wdqs(-ssl|-heavy-queries)@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1123667 (https://phabricator.wikimedia.org/T387314) (owner: 10Vgutierrez) [14:58:48] so hmm [14:58:49] !log hashar@deploy2002 hashar, sgimeno, itamar, daimona: Backport for [[gerrit:1120180|[Growth] Set default api lookahead size to 10 (T325990)]], [[gerrit:1124879|Revert "Let sysops add/remove the event-organizer group by default" (T386738)]], [[gerrit:1122990|Remove unused route file from Wikibase REST API configuration (T383774)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:58:54] I have a power outage [14:59:08] the patches were/are being synced on the testservers [14:59:12] Another ordinary day at the deployment office :P [14:59:29] and since I did not run screen/tmux [14:59:40] oh no [14:59:42] I guess scap is waiting for me to confirm [14:59:56] unless well I find a way to attach to pts/16 somehow :b [15:00:16] sergi0: Daimona: itamarWMDE: I think your patches are now on the debug servers [15:00:22] checking [15:00:35] someone with root should be able to take over your terminal, right? [15:00:40] checking too [15:01:11] well [15:01:12] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): serve 5% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124849 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [15:01:19] or I can send 'y\n' to stdin of the process :b [15:01:51] my change, lgtm [15:02:14] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [15:02:25] looks alright [15:02:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1187.eqiad.wmnet with reason: host reimage [15:02:51] "Another ordinary day at the deployment office :P" pretty much Daimona ! :b [15:03:27] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-codfw or A:lvs-secondary-codfw) and A:bullseye and A:lvs [15:03:27] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::public@codfw [15:03:32] Oh sorry, testing my patch now. [15:03:39] no worries [15:03:59] tgr_: with stock tools I don't think so [15:04:01] once you are done I will try sending `y\n' to /proc/$(pidof scap)/fd/0 [15:04:17] it is still open [15:04:27] and I can watch scap progress from logstash [15:05:00] I can maybe gdb hook the scap process [15:05:07] buuuuut [15:05:14] claime: apt install reptyr on deploy2002? (/hj) [15:05:25] * claime bonks Lucas_WMDE [15:05:30] :P [15:06:02] Looks good to me! [15:06:04] (03CR) 10Filippo Giunchedi: [C:03+1] zuul: remove gearman wait queue monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124857 (https://phabricator.wikimedia.org/T388041) (owner: 10Dzahn) [15:07:42] well my echo y > /dev/pts did not work [15:07:49] I guess that prints to the terminal itself [15:08:00] ok let me try to hook scap via gdb [15:08:17] :/ [15:09:56] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.migrate-service-ipip for role: wdqs::public@eqiad [15:10:01] if it is too complicated, I'd kill scap and retry (but with screen this time) [15:10:13] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP for wdqs(-ssl|-heavy-queries)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123668 (https://phabricator.wikimedia.org/T387314) (owner: 10Vgutierrez) [15:10:18] (03PS2) 10Vgutierrez: hiera,wdqs: Enable IPIP for wdqs(-ssl|-heavy-queries)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123668 (https://phabricator.wikimedia.org/T387314) [15:10:51] (03CR) 10Vgutierrez: [C:03+2] hiera,wdqs: Enable IPIP for wdqs(-ssl|-heavy-queries)@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1123668 (https://phabricator.wikimedia.org/T387314) (owner: 10Vgutierrez) [15:11:15] yeah that's not working [15:12:33] (03PS1) 10Effie Mouzeli: trafficserver: respect PHP_ENGINE_STICKY cookie value [puppet] - 10https://gerrit.wikimedia.org/r/1125176 (https://phabricator.wikimedia.org/T383845) [15:13:44] hashar: guess you'll have to rerun it [15:14:22] I really recommend setting up your ssh config to start/reattach a tmux or a screen automatically [15:14:40] (03CR) 10CI reject: [V:04-1] trafficserver: respect PHP_ENGINE_STICKY cookie value [puppet] - 10https://gerrit.wikimedia.org/r/1125176 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [15:15:12] RemoteCommand tmux new -A -s [15:15:18] (03CR) 10Bking: [C:03+2] cirrus: drop cirrus_saneitize_jobs periodic job (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/1113461 (owner: 10DCausse) [15:18:17] I am back [15:18:24] 4G died as well for some reason [15:18:51] so here is no more any scap process belonging to me [15:19:57] and I don't think they got deployed [15:20:04] so I am retrying [15:20:13] this time with `screen` [15:20:34] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1120180|[Growth] Set default api lookahead size to 10 (T325990)]], [[gerrit:1124879|Revert "Let sysops add/remove the event-organizer group by default" (T386738)]], [[gerrit:1122990|Remove unused route file from Wikibase REST API configuration (T383774)]] [15:20:39] T325990: Incorrect paging in GrowthExperiments suggested edits module - https://phabricator.wikimedia.org/T325990 [15:20:39] T386738: Consider removal of $wgAddGroups and $wgRemoveGroups added for the event-organizer group in WMF config - https://phabricator.wikimedia.org/T386738 [15:20:40] T383774: Remove v0 routes and the corresponding test - https://phabricator.wikimedia.org/T383774 [15:20:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:21:00] claime: thank you claime [15:21:18] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [15:22:08] after that the patches left to deploy are 1124911 (use namespace Title class, which I guess would solve the "Class Title" not found in beta) [15:22:23] which sounds super scary given Title has been around for 22 years :) [15:22:26] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on (A:lvs-low-traffic-eqiad or A:lvs-secondary-eqiad) and A:bullseye and A:lvs [15:22:26] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.migrate-service-ipip (exit_code=0) for role: wdqs::public@eqiad [15:22:50] and 1125130 for tgr which I guess needs close monitoring [15:24:10] Status code: expected 200, got 503. [15:24:23] httpbb failed on mwdebug1002 [15:24:26] !log hashar@deploy2002 itamar, sgimeno, daimona, hashar: Backport for [[gerrit:1120180|[Growth] Set default api lookahead size to 10 (T325990)]], [[gerrit:1124879|Revert "Let sysops add/remove the event-organizer group by default" (T386738)]], [[gerrit:1122990|Remove unused route file from Wikibase REST API configuration (T383774)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:24:41] I think it is know that the debug servers are slow / do not respond immediately [15:24:42] !log hashar@deploy2002 itamar, sgimeno, daimona, hashar: Continuing with sync [15:25:01] the other hosts worked so I am happily ignoring it [15:25:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:27:35] (03Abandoned) 10Fabfur: acmecerts: new param to use tmpfs storage for certificates [puppet] - 10https://gerrit.wikimedia.org/r/1112773 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [15:27:38] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:28:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:29:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:29:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1187.eqiad.wmnet with OS bullseye [15:29:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1187.eqiad.wmnet with OS bull... [15:30:58] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1120180|[Growth] Set default api lookahead size to 10 (T325990)]], [[gerrit:1124879|Revert "Let sysops add/remove the event-organizer group by default" (T386738)]], [[gerrit:1122990|Remove unused route file from Wikibase REST API configuration (T383774)]] (duration: 10m 23s) [15:31:03] T325990: Incorrect paging in GrowthExperiments suggested edits module - https://phabricator.wikimedia.org/T325990 [15:31:04] T386738: Consider removal of $wgAddGroups and $wgRemoveGroups added for the event-organizer group in WMF config - https://phabricator.wikimedia.org/T386738 [15:31:04] T383774: Remove v0 routes and the corresponding test - https://phabricator.wikimedia.org/T383774 [15:32:11] ok done [15:32:21] (03PS1) 10DCausse: team-search-platform: drop CirrusSearchJobQueueLagTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1125178 [15:32:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124911 (https://phabricator.wikimedia.org/T388085) (owner: 10Daimona Eaytoy) [15:32:41] hurrah, ty hashar [15:32:58] hashar: thanks! was following the suspense up close :) [15:34:00] (03Merged) 10jenkins-bot: Use namespaced Title class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124911 (https://phabricator.wikimedia.org/T388085) (owner: 10Daimona Eaytoy) [15:34:18] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1124911|Use namespaced Title class (T388085)]] [15:34:21] T388085: Upcoming production error: Error: Class "Title" not found - https://phabricator.wikimedia.org/T388085 [15:34:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10610406 (10phaultfinder) [15:35:54] hashar: removing the Title alias is scary (this patch is trying to fix an issue caused by that). Switching to the new namespace is not scary, it has been done in core/extension code already. [15:35:56] (03CR) 10Jforrester: "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124911 (https://phabricator.wikimedia.org/T388085) (owner: 10Daimona Eaytoy) [15:35:56] oh that Title patch is just for robots.txt [15:36:11] tgr_: ah cool thank you !! [15:36:52] Well, dropping the actual Title class might be fun too. No more tech debt! [15:37:04] but [15:37:11] IT HAS BEEN WORKING FOR TWENTY TWO YEARS!!! [15:37:20] leave the stable software alone! [15:37:41] robots.php generates https://en.wikipedia.org/robots.txt on the fly isn't it ? [15:38:21] Is there such as a thing as stable software? :D [15:38:39] * hashar looks at `ed` [15:38:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:39:32] !log hashar@deploy2002 hashar, daimona: Backport for [[gerrit:1124911|Use namespaced Title class (T388085)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:39:35] T388085: Upcoming production error: Error: Class "Title" not found - https://phabricator.wikimedia.org/T388085 [15:40:00] Ah BTW, since the error is in beta, it's not testable in production. Guess I'll trigger a beta update. [15:40:24] so it seems: https://gerrit.wikimedia.org/g/operations/puppet/+/234d9b6c1560b915a924537c78e99483b2bd3ab5/modules/mediawiki/templates/apache/mediawiki-vhost.conf.erb#54 [15:40:26] hashar: "ed" had a new release in 2025 :P [15:40:32] https://download.savannah.gnu.org/releases/ed/?C=M&O=D [15:41:02] Daimona: cause production still has Title isn't it? [15:41:11] bblack: good to know it is still maintained! [15:41:38] Aye [15:41:43] it is being rolled at https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/537989/console [15:42:03] * hashar writes a note about Jenkins not requiring one to use wrap scap with `screen` [15:42:22] Yep I triggered that. [15:42:36] Didn't wanna wait 2 minutes [15:43:19] I should check SpiderPig ( https://phabricator.wikimedia.org/F57689745 ) [15:43:30] a web frontend to scap [15:45:01] of course the scap sync world is waiting for the automatically triggered beta-code-update-eqiad to run :) [15:45:11] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:45:22] https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/196664/console [15:45:26] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:46:03] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1118080 (https://phabricator.wikimedia.org/T385869) (owner: 10Clément Goubert) [15:46:16] (03PS26) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [15:48:52] Alright, it works! [15:49:24] (03PS1) 10Bking: cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) [15:49:55] coool [15:50:00] !log hashar@deploy2002 hashar, daimona: Continuing with sync [15:50:36] (03CR) 10CI reject: [V:04-1] cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking) [15:51:01] (03CR) 10Bking: [C:03+2] team-search-platform: drop CirrusSearchJobQueueLagTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1125178 (owner: 10DCausse) [15:51:11] (03PS1) 10Hnowlan: conftool: empty jobrunner and videoscaler pools [puppet] - 10https://gerrit.wikimedia.org/r/1125181 (https://phabricator.wikimedia.org/T354791) [15:51:11] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:53:26] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [15:55:26] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:56:18] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124911|Use namespaced Title class (T388085)]] (duration: 22m 00s) [15:56:21] T388085: Upcoming production error: Error: Class "Title" not found - https://phabricator.wikimedia.org/T388085 [15:57:35] Daimona: done! [15:58:00] Et voilà, merci hashar \o/ [15:58:01] (03CR) 10Clément Goubert: [C:03+1] conftool: empty jobrunner and videoscaler pools [puppet] - 10https://gerrit.wikimedia.org/r/1125181 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:58:05] tgr_: the last one is your https://gerrit.wikimedia.org/r/c/1125130/ [15:58:25] but I confess my brain is fried after the last two hours of up & down and I need a break [15:58:37] Quite an intense deployment window today. [15:59:31] hashar: thanks for deploying the rest! I can take over [15:59:34] jouncebot: next [15:59:34] In 0 hour(s) and 0 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T1600) [15:59:38] awesome thank you! [15:59:59] maybe I will generate some logs that can be triaged :) [15:59:59] that triage is usually cancelled due to the WMF staff meeting occuring at the same time [16:00:05] hashar and dduvall: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T1600) [16:01:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10610552 (10VRiley-WMF) Hey @Papaul You're correct, I do apologize about that. The drive blanks (fillers for empty slots) made it seem like it was differen... [16:01:42] (03CR) 10Clément Goubert: conftool: empty jobrunner and videoscaler pools [puppet] - 10https://gerrit.wikimedia.org/r/1125181 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [16:02:08] (03PS2) 10Ilias Sarantopoulos: ml-services: update reference-quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125110 (https://phabricator.wikimedia.org/T387019) [16:02:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125130 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [16:02:56] (03CR) 10Ilias Sarantopoulos: ml-services: update reference-quality models (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125110 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [16:03:13] (03Merged) 10jenkins-bot: Enable SUL3 signup for 10% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125130 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [16:03:32] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1125130|Enable SUL3 signup for 10% of group 1 users (T384007)]] [16:03:37] T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007 [16:04:33] 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2242-2329 - https://phabricator.wikimedia.org/T384970#10610561 (10Jhancock.wm) we got these in this week. working on getting everything racked. @Clement_Goubert what's the numerical range for wikikube-ctrl and wikikube-worker for this... [16:05:52] (03Abandoned) 10Hnowlan: conftool: empty jobrunner and videoscaler pools [puppet] - 10https://gerrit.wikimedia.org/r/1125181 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [16:06:15] !log tgr@deploy2002 tgr: Backport for [[gerrit:1125130|Enable SUL3 signup for 10% of group 1 users (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:08:12] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [16:08:19] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [16:09:11] (03PS27) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [16:09:33] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [16:10:00] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [16:11:31] !log tgr@deploy2002 tgr: Continuing with sync [16:12:24] (03PS28) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [16:12:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1188.eqiad.wmnet with OS bullseye [16:12:40] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1189.eqiad.wmnet with OS bullseye [16:12:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610587 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1188.eqiad.wmnet with OS... [16:12:44] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1190.eqiad.wmnet with OS bullseye [16:12:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610588 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1189.eqiad.wmnet with OS... [16:12:49] !log fceratto@cumin1002 START - Cookbook sre.mysql.clone of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [16:12:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610589 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1190.eqiad.wmnet with OS... [16:12:54] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1191.eqiad.wmnet with OS bullseye [16:13:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610591 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1191.eqiad.wmnet with OS... [16:13:03] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1192.eqiad.wmnet with OS bullseye [16:13:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1192.eqiad.wmnet with OS... [16:13:29] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1194.eqiad.wmnet with OS bullseye [16:13:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610594 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1194.eqiad.wmnet with OS... [16:13:42] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1195.eqiad.wmnet with OS bullseye [16:13:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610595 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1195.eqiad.wmnet with OS... [16:14:18] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1193.eqiad.wmnet with OS bullseye [16:14:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610602 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1193.eqiad.wmnet with OS... [16:15:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1035.eqiad.wmnet [16:15:23] (03PS1) 10Hnowlan: jobrnuner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791) [16:17:30] (03PS2) 10Ebernhardson: cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking) [16:17:42] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1125130|Enable SUL3 signup for 10% of group 1 users (T384007)]] (duration: 14m 10s) [16:17:47] T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007 [16:18:36] (03PS3) 10Ebernhardson: cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking) [16:19:17] !log UTC afternoon deploys done [16:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:22] 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2242-2329 - https://phabricator.wikimedia.org/T384970#10610651 (10Clement_Goubert) For wikikube-worker: wikikube-worker2244-2329 For wikikube-ctrl: wikikube-ctrl2004-2005 [16:19:23] (03CR) 10Cathal Mooney: [C:03+1] Also exclude Private-Peer from remote_instance:gnmi_bgp_neighbor_session_state [puppet] - 10https://gerrit.wikimedia.org/r/1124795 (https://phabricator.wikimedia.org/T387287) (owner: 10Ayounsi) [16:19:27] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [16:19:39] 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2242-2329 - https://phabricator.wikimedia.org/T384970#10610655 (10Clement_Goubert) [16:19:47] (03CR) 10CI reject: [V:04-1] cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking) [16:19:48] ebernhardson, inflatador: apologies for the rogue rebase on your patch :| [16:20:00] 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2244-2329, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10610658 (10Clement_Goubert) [16:21:04] (03PS2) 10Hnowlan: jobrnuner: reimage the three remaining eqiad in-warranty jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/1125185 (https://phabricator.wikimedia.org/T354791) [16:22:26] hnowlan forever shall you be known as the Rogue Rebaser! [16:24:50] (03CR) 10Ebernhardson: [C:03+1] Remove obsolete $wgFlowMaintenanceMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125124 (owner: 10Hashar) [16:25:01] (03CR) 10Reedy: [C:03+2] CommonSettings-labs.php: Fix $wgSFSValidateIPListLocationMD5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124496 (owner: 10Reedy) [16:25:13] (03CR) 10Reedy: [V:03+2 C:03+2] CommonSettings-labs.php: Fix $wgSFSValidateIPListLocationMD5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124496 (owner: 10Reedy) [16:25:20] (03CR) 10Reedy: [C:03+2] CommonSettings-labs.php: Fix $wgSFSValidateIPListLocationMD5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124496 (owner: 10Reedy) [16:25:21] (03CR) 10Reedy: [C:03+2] Remove $wgExternalLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124515 (owner: 10Reedy) [16:25:22] (03CR) 10Reedy: [C:03+2] Remove $wgTemplateLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124517 (owner: 10Reedy) [16:25:23] (03CR) 10Reedy: [C:03+2] CommonSettings.php: Rename $wgTranslateServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124519 (owner: 10Reedy) [16:25:25] 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2244-2329, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10610699 (10Clement_Goubert) [16:25:31] 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2333, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10610700 (10Clement_Goubert) [16:25:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusSearchJobQueueLagTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [16:25:47] (03Merged) 10jenkins-bot: CommonSettings-labs.php: Fix $wgSFSValidateIPListLocationMD5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124496 (owner: 10Reedy) [16:26:06] (03PS3) 10Reedy: Remove $wgReadingListsCluster/$wgReadingListsDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124520 [16:26:08] (03Merged) 10jenkins-bot: Remove $wgExternalLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124515 (owner: 10Reedy) [16:26:10] (03Abandoned) 10Reedy: Remove $wgReadingListsCluster/$wgReadingListsDatabase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124520 (owner: 10Reedy) [16:26:14] (03Merged) 10jenkins-bot: Remove $wgTemplateLinksSchemaMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124517 (owner: 10Reedy) [16:26:16] (03Merged) 10jenkins-bot: CommonSettings.php: Rename $wgTranslateServices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124519 (owner: 10Reedy) [16:26:17] (03CR) 10Reedy: [C:03+2] CommonSettings.php: Remove $wgCodeEditorEnableCore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124524 (owner: 10Reedy) [16:26:18] (03CR) 10Reedy: [C:03+2] CommonSettings.php: Remove $wgSecurePollGPGCommand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124514 (owner: 10Reedy) [16:26:34] 10ops-codfw, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2333, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10610704 (10Clement_Goubert) Sorry for all the in-place changes, I forgot we still had some servers to reinstall/rename in codfw. This range should be good. [16:27:06] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1188.eqiad.wmnet with reason: host reimage [16:27:15] (03Merged) 10jenkins-bot: CommonSettings.php: Remove $wgCodeEditorEnableCore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124524 (owner: 10Reedy) [16:27:18] (03Merged) 10jenkins-bot: CommonSettings.php: Remove $wgSecurePollGPGCommand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124514 (owner: 10Reedy) [16:27:24] (03CR) 10Reedy: [C:03+2] CommonSettings.php: Remove duplicate load for CodeEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124525 (owner: 10Reedy) [16:27:25] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1190.eqiad.wmnet with reason: host reimage [16:27:33] (03CR) 10CI reject: [V:04-1] CommonSettings.php: Remove duplicate load for CodeEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124525 (owner: 10Reedy) [16:27:46] (03PS4) 10Reedy: CommonSettings.php: Remove duplicate load for CodeEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124525 [16:27:48] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1191.eqiad.wmnet with reason: host reimage [16:27:59] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1192.eqiad.wmnet with reason: host reimage [16:28:02] (03CR) 10Reedy: CommonSettings.php: Remove duplicate load for CodeEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124525 (owner: 10Reedy) [16:28:06] (03CR) 10Reedy: [C:03+2] CommonSettings.php: Remove duplicate load for CodeEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124525 (owner: 10Reedy) [16:28:20] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1195.eqiad.wmnet with reason: host reimage [16:28:33] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1194.eqiad.wmnet with reason: host reimage [16:28:56] (03Merged) 10jenkins-bot: CommonSettings.php: Remove duplicate load for CodeEditor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124525 (owner: 10Reedy) [16:29:02] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1193.eqiad.wmnet with reason: host reimage [16:30:37] (03CR) 10Ahmon Dancy: Remove profile::kubernetes::* from role::ci (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1125119 (https://phabricator.wikimedia.org/T288629) (owner: 10Jelto) [16:30:42] (03PS4) 10Bking: cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) [16:31:54] (03CR) 10CI reject: [V:04-1] cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) (owner: 10Bking) [16:32:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1188.eqiad.wmnet with reason: host reimage [16:32:22] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: update reference-quality models (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125110 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [16:32:34] (03CR) 10Ahmon Dancy: [C:03+1] Remove profile::kubernetes::* from role::ci [puppet] - 10https://gerrit.wikimedia.org/r/1125119 (https://phabricator.wikimedia.org/T288629) (owner: 10Jelto) [16:33:46] (03Merged) 10jenkins-bot: ml-services: update reference-quality models [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125110 (https://phabricator.wikimedia.org/T387019) (owner: 10Ilias Sarantopoulos) [16:34:05] (03PS5) 10Bking: cirrussearch: Add alerts for thread pool rejections [alerts] - 10https://gerrit.wikimedia.org/r/1125180 (https://phabricator.wikimedia.org/T387745) [16:35:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1190.eqiad.wmnet with reason: host reimage [16:38:51] !log reedy@deploy2002 Synchronized wmf-config/: Various config cleanup (duration: 08m 31s) [16:39:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1195.eqiad.wmnet with reason: host reimage [16:39:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10610775 (10Papaul) @VRiley-WMF no problem. Can you send an email to our Rep and attach the packing slip to the email to let him know that we supposed to... [16:41:46] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti1035.eqiad.wmnet with reason: remove from cluster for reimage [16:41:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10610780 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=25ef85c8-8d74-4903-a4fb-449180b148f4) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [16:42:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1194.eqiad.wmnet with reason: host reimage [16:44:04] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti1035 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1125109 (owner: 10Muehlenhoff) [16:46:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1193.eqiad.wmnet with reason: host reimage [16:48:40] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1035.eqiad.wmnet [16:50:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1191.eqiad.wmnet with reason: host reimage [16:52:43] (03CR) 10Alexandros Kosiaris: [C:03+2] ldap::management: Remove absent resource [puppet] - 10https://gerrit.wikimedia.org/r/1123281 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [16:52:58] (03CR) 10Alexandros Kosiaris: [C:03+2] ldap::management: File ownerships to root [puppet] - 10https://gerrit.wikimedia.org/r/1123283 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [16:53:23] (03CR) 10Alexandros Kosiaris: [C:03+2] ldap-admins: Empty group and remove privileges [puppet] - 10https://gerrit.wikimedia.org/r/1123282 (https://phabricator.wikimedia.org/T386472) (owner: 10Alexandros Kosiaris) [16:53:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1192.eqiad.wmnet with reason: host reimage [16:55:16] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:55:28] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1189.eqiad.wmnet with reason: host reimage [16:55:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:55:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1188.eqiad.wmnet with OS bullseye [16:55:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610830 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1188.eqiad.wmnet with OS bull... [16:56:58] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1196.eqiad.wmnet with OS bullseye [16:57:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610832 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1196.eqiad.wmnet with OS... [16:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:58:13] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:58:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1189.eqiad.wmnet with reason: host reimage [16:58:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:58:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1190.eqiad.wmnet with OS bullseye [16:58:39] 06SRE, 07LDAP, 13Patch-For-Review: ldap-admins POSIX group does not actually give any permissions to its members - https://phabricator.wikimedia.org/T386472#10610835 (10akosiaris) 05In progress→03Resolved a:03akosiaris >>! In T386472#10586054, @akosiaris wrote: > Reading the discussion above, I got... [16:58:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1190.eqiad.wmnet with OS bull... [16:58:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1197.eqiad.wmnet with OS bullseye [16:58:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610839 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1197.eqiad.wmnet with OS... [17:00:05] jhathaway and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T1700). [17:00:05] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:01:18] o/ [17:02:29] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:02:40] o/ [17:02:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:02:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1195.eqiad.wmnet with OS bullseye [17:02:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1195.eqiad.wmnet with OS bull... [17:03:50] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1198.eqiad.wmnet with OS bullseye [17:03:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610852 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1198.eqiad.wmnet with OS... [17:05:55] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:06:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:06:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1194.eqiad.wmnet with OS bullseye [17:06:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610855 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1194.eqiad.wmnet with OS bull... [17:07:07] tgr_: hey! we can roll this out today, but in future lua changes, you can get a normal review with s.ukhe or vgutierrez or anyone from the traffic team and coordinate deployment with them, this is generally too complex for the puppet window [17:07:36] I know I keep saying that to you, sorry -- the window is really just meant for simpler stuff than the kind of work you're doing [17:07:36] happy to proceed :D [17:07:38] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [17:09:09] I guess I'm just not sure how to proceed after getting a +1 from the relevant team, short of scheduling it somewhere [17:09:49] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:09:50] but yeah, sorry, could have just asked [17:10:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:10:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1193.eqiad.wmnet with OS bullseye [17:10:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1193.eqiad.wmnet with OS bull... [17:10:27] the same person can just +2 and merge it whenever you're ready -- if you want to make sure to be around you can schedule a time with them, even use the window if you want, just arrange with that person to be around to deploy it [17:10:38] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [17:10:45] noted [17:11:09] the difference is just, the window is for patches that any SRE can glance at and merge without subject matter expertise in the particular system -- that works for lots of things, but not ATS lua [17:11:35] anyway vgutierrez are you pushing the button or am I? happy either way [17:11:50] merging it :D [17:11:53] thanks! [17:11:56] (03CR) 10Vgutierrez: [C:03+2] Update CentralAuth multi-DC rules for SUL3, attempt 2 [puppet] - 10https://gerrit.wikimedia.org/r/1123029 (https://phabricator.wikimedia.org/T363695) (owner: 10Gergő Tisza) [17:12:03] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1196.eqiad.wmnet with reason: host reimage [17:12:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610865 (10Jclark-ctr) [17:14:05] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1197.eqiad.wmnet with reason: host reimage [17:14:42] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:15:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:15:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1191.eqiad.wmnet with OS bullseye [17:15:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610872 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1191.eqiad.wmnet with OS bull... [17:15:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610873 (10Jclark-ctr) [17:15:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1196.eqiad.wmnet with reason: host reimage [17:15:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1199.eqiad.wmnet with OS bullseye [17:15:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1199.eqiad.wmnet with OS... [17:15:39] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150 (10bking) 03NEW [17:15:44] (03PS1) 10Scott French: shellbox-media: clean up extra debug logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125194 (https://phabricator.wikimedia.org/T377038) [17:16:06] tgr_: if you tell me which cp server are you hitting at the moment for text I can run puppet there manually so you can validate :) [17:16:30] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:16:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:16:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1192.eqiad.wmnet with OS bullseye [17:16:47] (03CR) 10Effie Mouzeli: [C:03+1] shellbox-media: clean up extra debug logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125194 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [17:16:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610889 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1192.eqiad.wmnet with OS bull... [17:16:58] !log installing avahi security updates [17:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610891 (10Jclark-ctr) [17:17:34] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1200.eqiad.wmnet with OS bullseye [17:17:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker1200.eqiad.wmnet with OS... [17:18:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1197.eqiad.wmnet with reason: host reimage [17:20:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10610899 (10Jhancock.wm) [17:21:41] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ganeti1035.eqiad.wmnet [17:21:42] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:22:01] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti1035.eqiad.wmnet [17:22:07] vgutierrez: cp3070 [17:22:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:22:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1189.eqiad.wmnet with OS bullseye [17:22:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1035.eqiad.wmnet [17:22:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610901 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1189.eqiad.wmnet with OS bull... [17:22:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610902 (10Jclark-ctr) [17:24:19] tgr_: cp3070 has your change live already [17:25:22] vgutierrez: thanks, can confirm it's working [17:25:27] cool :D [17:25:39] it should take around ~30 minutes till it's live globally [17:25:53] awesome, thanks! [17:26:31] (03CR) 10Dzahn: [C:03+2] zuul: remove gearman wait queue monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1124857 (https://phabricator.wikimedia.org/T388041) (owner: 10Dzahn) [17:26:54] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.9 point update - https://phabricator.wikimedia.org/T383537#10610911 (10MoritzMuehlenhoff) [17:27:02] (03CR) 10Dzahn: [C:03+2] aptrepo: replace http with https in downloads.linux.hpe.com URLs [puppet] - 10https://gerrit.wikimedia.org/r/1124877 (https://phabricator.wikimedia.org/T388042) (owner: 10Dzahn) [17:30:22] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1199.eqiad.wmnet with reason: host reimage [17:33:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1199.eqiad.wmnet with reason: host reimage [17:34:26] (03CR) 10Dzahn: [C:03+2] zuul: remove gearman wait queue monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1124857 (https://phabricator.wikimedia.org/T388041) (owner: 10Dzahn) [17:36:07] (03CR) 10Dzahn: [C:03+1] "simplification is nice" [puppet] - 10https://gerrit.wikimedia.org/r/1125119 (https://phabricator.wikimedia.org/T288629) (owner: 10Jelto) [17:36:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1035.eqiad.wmnet [17:36:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti1035.eqiad.wmnet [17:38:37] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:38:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:38:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1196.eqiad.wmnet with OS bullseye [17:39:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1196.eqiad.wmnet with OS bull... [17:39:06] (03CR) 10Dzahn: "@Αλέξανδρος is it ok to just merge this and walk away or is there another step to get these deployed? Or would you mind just merging it?" [puppet] - 10https://gerrit.wikimedia.org/r/1123475 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [17:39:27] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125194 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [17:39:28] (03CR) 10Scott French: [C:03+2] shellbox-media: clean up extra debug logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125194 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [17:39:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610950 (10Jclark-ctr) [17:40:33] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1178.eqiad.wmnet with OS bullseye [17:40:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10610951 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1178.eqiad.wmnet with OS b... [17:40:44] (03CR) 10Dzahn: [C:03+2] "ran puppet on alert1002 and contint1002 but did not exactly see puppet removing something, not entirely sure if it's gone or not" [puppet] - 10https://gerrit.wikimedia.org/r/1124857 (https://phabricator.wikimedia.org/T388041) (owner: 10Dzahn) [17:40:52] (03Merged) 10jenkins-bot: shellbox-media: clean up extra debug logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125194 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [17:42:06] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:42:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:42:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1197.eqiad.wmnet with OS bullseye [17:42:32] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610954 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1197.eqiad.wmnet with OS bull... [17:42:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10610955 (10Jclark-ctr) [17:43:33] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [17:44:04] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [17:44:50] (03PS2) 10Ebernhardson: flink-app chart: Use ECS logging configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124887 [17:44:50] (03PS1) 10Ebernhardson: cirrus: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125200 [17:46:14] (03CR) 10Dzahn: [C:03+2] "graphite1004: "Failed to open TCP connection to puppetserver1003.eqiad.wmnet:8140" ?! but then puppet continues anyways, did not see it re" [puppet] - 10https://gerrit.wikimedia.org/r/1124857 (https://phabricator.wikimedia.org/T388041) (owner: 10Dzahn) [17:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:48:09] (03CR) 10JMeybohm: services: refactor helmfiles for helmfile 0.171.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124431 (https://phabricator.wikimedia.org/T387836) (owner: 10Jelto) [17:50:04] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-presto1014.eqiad.wmnet [17:50:48] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [17:51:23] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [17:53:09] (03CR) 10JMeybohm: [C:03+1] Remove profile::kubernetes::* from role::ci [puppet] - 10https://gerrit.wikimedia.org/r/1125119 (https://phabricator.wikimedia.org/T288629) (owner: 10Jelto) [17:54:04] PROBLEM - Hadoop NodeManager on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:55:03] (03CR) 10Ebernhardson: [C:03+2] flink-app chart: Use ECS logging configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124887 (owner: 10Ebernhardson) [17:55:07] (03CR) 10Ebernhardson: [C:03+2] cirrus: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125200 (owner: 10Ebernhardson) [17:55:37] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1178.eqiad.wmnet with reason: host reimage [17:56:30] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:00:05] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T1800). [18:00:05] swfrench-wmf: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T1800). [18:00:12] o/ [18:00:37] (03Merged) 10jenkins-bot: flink-app chart: Use ECS logging configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124887 (owner: 10Ebernhardson) [18:00:39] (03Merged) 10jenkins-bot: cirrus: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125200 (owner: 10Ebernhardson) [18:01:06] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): serve 5% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124849 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:02:13] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1178.eqiad.wmnet with reason: host reimage [18:02:30] (03Merged) 10jenkins-bot: mw-(api-ext|web): serve 5% of residual traffic on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124849 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [18:06:14] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:06:32] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:06:44] (03PS1) 10Andrew Bogott: Add cname for keystone.openstack..wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/1125202 (https://phabricator.wikimedia.org/T388137) [18:08:21] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:08:29] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:10:13] 06SRE, 10MediaWiki-extensions-OAuth, 06The-Wikipedia-Library, 07Datacenter-Switchover, 07User-notice-archive: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650#10611022 (10JEumerus) 05Resolved→03Open [18:10:13] PROBLEM - Host an-presto1014 is DOWN: PING CRITICAL - Packet loss = 100% [18:11:36] 06SRE, 10MediaWiki-extensions-OAuth, 06The-Wikipedia-Library, 07Datacenter-Switchover, 07User-notice-archive: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650#10611029 (10JEumerus) Reopened this mostly beca... [18:11:46] nothing for my deploy window this week [18:13:47] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1200.eqiad.wmnet with OS bullseye [18:13:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10611031 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1200.eqiad.wmnet with OS bull... [18:14:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:14:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1199.eqiad.wmnet with OS bullseye [18:14:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10611032 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1199.eqiad.wmnet with OS bull... [18:16:18] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:16:29] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:17:27] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:17:44] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:19:00] (03PS1) 10Reedy: wmf-config: Remove orphaned Vector config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125204 [18:19:03] RECOVERY - Hadoop NodeManager on analytics1072 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:20:34] (03PS1) 10Reedy: CommonSettings.php: Rename $wgStatsHost to not look like a $wg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125205 [18:21:08] (03PS1) 10Jgiannelos: pcs: Invalidate summaries on resource change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125207 (https://phabricator.wikimedia.org/T387277) [18:21:16] (03CR) 10Andrew Bogott: [C:03+2] Add cname for keystone.openstack..wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/1125202 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [18:21:39] (03PS1) 10Reedy: CommonSettings.php: Remove $wgTranslateDelayedMessageIndexRebuild [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125208 [18:22:54] (03PS1) 10JMeybohm: Don't warn if this and the needed release set installed: false [debs/helmfile] - 10https://gerrit.wikimedia.org/r/1125209 (https://phabricator.wikimedia.org/T387837) [18:23:14] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:23:23] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:23:30] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:23:37] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:23:52] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:23:58] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:24:04] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1198.eqiad.wmnet with OS bullseye [18:24:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q3:rack/setup/install an-worker1[187-208] - https://phabricator.wikimedia.org/T386390#10611043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1198.eqiad.wmnet with OS bull... [18:25:01] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [18:25:01] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10611044 (10KFrancis) The agreement is out for signatures. I'll confirm when it's complete. Thanks! [18:25:11] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1180.eqiad.wmnet with OS bullseye [18:25:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10611045 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1180.eqiad.wmnet with OS b... [18:26:19] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:26:23] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:26:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10611048 (10VRiley-WMF) [18:26:33] !log mw-api-ext: migrated 5% of residual PHP 7.4 traffic to 8.1 - T383845 [18:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:36] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:26:56] (03CR) 10Cathal Mooney: [C:03+1] Add cname for keystone.openstack..wikimediacloud.org [dns] - 10https://gerrit.wikimedia.org/r/1125202 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [18:27:08] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [18:27:09] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1178.eqiad.wmnet with OS bullseye [18:27:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10611051 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1178.eqiad.wmnet with OS bulls... [18:28:27] !log andrew@dns1004 START - running authdns-update [18:28:31] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:28:49] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:29:50] (03PS1) 10Reedy: InitialiseSettings.php: Remove unused NavigationTiming config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125210 [18:30:36] !log andrew@dns1004 END - running authdns-update [18:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:37:18] (03CR) 10Subramanya Sastry: [C:03+1] pcs: Invalidate summaries on resource change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125207 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [18:37:42] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:37:54] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:38:58] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:39:12] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:40:22] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1180.eqiad.wmnet with reason: host reimage [18:40:31] (03CR) 10Cwhite: [C:03+1] prometheus: trial moving k8s-mlstaging to prometheus2007 [puppet] - 10https://gerrit.wikimedia.org/r/1124747 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [18:43:04] 06SRE, 10MediaWiki-extensions-OAuth, 06The-Wikipedia-Library, 07Datacenter-Switchover, 07User-notice-archive: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650#10611073 (10matmarex) Can you share the error m... [18:43:32] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1180.eqiad.wmnet with reason: host reimage [18:45:21] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:45:29] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:45:41] !log mw-web: migrated 5% of residual PHP 7.4 traffic to 8.1 - T383845 [18:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:44] T383845: MediaWiki on PHP 8.1 production traffic ramp-up - https://phabricator.wikimedia.org/T383845 [18:53:20] (03PS1) 10Andrew Bogott: acme_chief: add SNI for keystone.openstack..wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1125217 (https://phabricator.wikimedia.org/T388137) [18:53:56] (03CR) 10Andrew Bogott: [C:03+2] acme_chief: add SNI for keystone.openstack..wikimediacloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1125217 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [18:57:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10611121 (10VRiley-WMF) [18:58:16] !log ebernhardson@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:58:22] !log ebernhardson@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:00:05] hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T1900) [19:04:32] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:04:46] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:05:54] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [19:06:18] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [19:06:18] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1180.eqiad.wmnet with OS bullseye [19:06:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.01 - 2025.03.21): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10611134 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1180.eqiad.wmnet with OS bulls... [19:09:02] !log T379002 start reindex of cirrus cebwiki_content index in eqiad [19:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:06] T379002: Consider resharding cebwiki_content - https://phabricator.wikimedia.org/T379002 [19:10:16] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-presto1014.eqiad.wmnet [19:11:28] !log T379002 start reindex of cirrus cebwiki_content index in codfw [19:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:57] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [19:25:45] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [19:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:38:36] (03CR) 10Kamila Součková: [C:03+1] pcs: Invalidate summaries on resource change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125207 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [19:38:40] (03PS2) 10Aaron Schulz: Update Docker images of change-prop services to ones using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124191 (https://phabricator.wikimedia.org/T381588) [19:39:47] (03CR) 10Jgiannelos: [C:03+2] pcs: Invalidate summaries on resource change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125207 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [19:41:20] (03Merged) 10jenkins-bot: pcs: Invalidate summaries on resource change [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125207 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [19:42:35] (03PS1) 10Andrew Bogott: Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) [19:42:57] (03CR) 10CI reject: [V:04-1] Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [19:43:18] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1124880 (owner: 10Cwhite) [19:43:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:44:57] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [19:45:45] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [19:47:08] (03PS2) 10Andrew Bogott: Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) [19:47:28] (03CR) 10Cwhite: [C:03+2] grafana: add quotes around interpolated log variables [puppet] - 10https://gerrit.wikimedia.org/r/1124880 (owner: 10Cwhite) [19:47:30] (03CR) 10CI reject: [V:04-1] Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [19:48:31] (03PS3) 10Andrew Bogott: Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) [19:48:52] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [19:49:07] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [19:49:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [19:51:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:53:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [19:58:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [19:59:59] (03PS4) 10Andrew Bogott: Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) [20:02:49] (03CR) 10BCornwall: geo-maps: update South America DCs (part 1/2) (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1124192 (https://phabricator.wikimedia.org/T387774) (owner: 10CDobbins) [20:02:53] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [20:08:34] (03PS5) 10Andrew Bogott: Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) [20:08:43] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [20:13:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:14:33] (03PS6) 10Andrew Bogott: Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) [20:14:45] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [20:20:56] (03PS7) 10Andrew Bogott: Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) [20:21:32] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [20:21:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:24:14] (03PS8) 10Andrew Bogott: Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) [20:24:18] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [20:26:52] (03PS1) 10Dwisehaupt: community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1125223 (https://phabricator.wikimedia.org/T383715) [20:27:14] (03CR) 10CI reject: [V:04-1] community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1125223 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [20:28:49] (03PS9) 10Cwhite: Profiler: emit both statsd and dogstatsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) [20:30:13] (03PS2) 10Dwisehaupt: community_civicrm: Add profile::community_civicrm::mail [puppet] - 10https://gerrit.wikimedia.org/r/1125223 (https://phabricator.wikimedia.org/T383715) [20:31:23] (03PS9) 10Andrew Bogott: Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) [20:31:45] (03CR) 10CI reject: [V:04-1] Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [20:31:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [20:34:15] (03PS10) 10Andrew Bogott: Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) [20:34:25] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [20:43:03] (03PS1) 10Jgiannelos: pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T387277) [20:43:16] (03CR) 10Dwisehaupt: "This one should be ready for review. I have tested the install and very basic functions in the cloud VPS realm. It will need exim sorted (" [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [20:44:16] (03CR) 10CI reject: [V:04-1] pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [20:46:24] (03PS2) 10Jgiannelos: pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T387277) [20:47:24] (03CR) 10Andrew Bogott: [C:03+2] Add a new type of haproxy config template, http-by-host [puppet] - 10https://gerrit.wikimedia.org/r/1125221 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [20:47:30] (03CR) 10CI reject: [V:04-1] pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [20:48:39] (03CR) 10Cwhite: Profiler: emit both statsd and dogstatsd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081461 (https://phabricator.wikimedia.org/T359385) (owner: 10Cwhite) [20:49:30] (03PS3) 10Jgiannelos: pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T387277) [20:50:48] (03CR) 10CI reject: [V:04-1] pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T387277) (owner: 10Jgiannelos) [20:51:50] (03CR) 10Dwisehaupt: "This has started but has some issues so getting it in a changeset so we can sort it out through review." [puppet] - 10https://gerrit.wikimedia.org/r/1125223 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [20:55:55] (03PS4) 10Jgiannelos: pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T387277) [20:57:41] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T2100) [21:00:05] MatmaRex, tgr, and subbu: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:12] hi. my patches are all nops [21:00:16] no-ops* [21:01:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:02:14] (03PS3) 10Cwhite: move statsd config to statsd-global, bump statsd chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117638 (https://phabricator.wikimedia.org/T359497) [21:02:43] (03PS5) 10Jgiannelos: pcs: Add missing rules for content pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125225 (https://phabricator.wikimedia.org/T387277) [21:03:33] o/ [21:04:03] (03PS1) 10Bking: cloudelastic: migrate cloudelastic1009 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1125226 [21:04:16] (i'm away for a sec) [21:04:17] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125226 (owner: 10Bking) [21:05:06] sorry, little late for the backport window, but here now for a patch I scheduled [21:06:46] I can deploy [21:09:15] (03PS1) 10Bking: cloudelastic: migrate cloudelastic1010 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1125227 [21:10:05] subbu: that patch is merged and reverted [21:10:23] do you want to re-revert it? [21:11:36] (03PS1) 10Andrew Bogott: haproxy hiera: correct new hostname-mapped endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1125228 [21:11:45] (03PS1) 10Bking: cloudelastic: migrate cloudelastic1011 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1125229 [21:12:09] I'll do the PrivateSettings change first [21:12:17] (03CR) 10Andrew Bogott: [C:03+2] haproxy hiera: correct new hostname-mapped endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1125228 (owner: 10Andrew Bogott) [21:15:11] yes -- isabelle tried it this morning, and we want to retry it, because it worked eveywhere else (locally, beta, patchdemo) and if it still fails, i want to look a bit closer what is going on. [21:15:17] i'm back, if anyone is able to deploy my config cleanups. they're not urgent though [21:15:36] (03PS1) 10Andrew Bogott: haproxy hiera: correct new hostname-mapped endpoints again [puppet] - 10https://gerrit.wikimedia.org/r/1125231 [21:16:23] (03CR) 10Andrew Bogott: [C:03+2] haproxy hiera: correct new hostname-mapped endpoints again [puppet] - 10https://gerrit.wikimedia.org/r/1125231 (owner: 10Andrew Bogott) [21:16:46] oh, i need to submit a revert of the revert. i didn't notice thta part. [21:17:29] or I can just have scap do it [21:17:37] but the commit message will be uglier [21:17:59] (03PS1) 10Subramanya Sastry: Revert^2 "Fix nested refs with the same name but a different group" [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125232 [21:18:14] ^ [21:19:59] Updated deployment calendar with that commit. [21:20:01] does scap sync-file still work these days? [21:20:32] the PrivateSettings instructions use it, but I don't think I have used it in a couple years [21:20:32] Possibly? But it won't be any faster. [21:20:44] Just sync-world like normal. [21:27:35] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host, interfaces up: 113, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:27:37] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:27:57] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:28:13] (03PS1) 10Bking: cloudelastic: migrate cloudelastic1012 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1125234 [21:28:33] (03CR) 10Ebernhardson: [C:03+1] "seems reasonable to keep moving forward, per IRC" [puppet] - 10https://gerrit.wikimedia.org/r/1125226 (owner: 10Bking) [21:31:34] (03CR) 10Bking: [C:03+2] cloudelastic: migrate cloudelastic1009 to opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1125226 (owner: 10Bking) [21:31:47] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:32:34] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1009* for ban host prior to reimage - bking@cumin2002 - T387904 [21:32:37] T387904: Migrate Cloudelastic to Opensearch - https://phabricator.wikimedia.org/T387904 [21:32:38] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1009* for ban host prior to reimage - bking@cumin2002 - T387904 [21:35:34] (03PS1) 10Andrew Bogott: Add keystone.openstack.eqiad1.wikimediacloud.org endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1125235 (https://phabricator.wikimedia.org/T388137) [21:35:35] (03PS1) 10Andrew Bogott: haproxy: make the port 443 host mappings open to outside internet [puppet] - 10https://gerrit.wikimedia.org/r/1125236 [21:35:35] (03PS1) 10Andrew Bogott: cloudweb2002-dev: Change keystone fqdn for horizon in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1125237 [21:36:39] (03PS2) 10Andrew Bogott: haproxy: make the port 443 host mappings open to outside internet [puppet] - 10https://gerrit.wikimedia.org/r/1125236 [21:36:39] (03PS2) 10Andrew Bogott: Add keystone.openstack.eqiad1.wikimediacloud.org endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1125235 (https://phabricator.wikimedia.org/T388137) [21:38:11] (03CR) 10Andrew Bogott: [C:03+2] haproxy: make the port 443 host mappings open to outside internet [puppet] - 10https://gerrit.wikimedia.org/r/1125236 (owner: 10Andrew Bogott) [21:38:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124895 (owner: 10Bartosz Dziewoński) [21:38:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124896 (owner: 10Bartosz Dziewoński) [21:38:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński) [21:39:58] (03Merged) 10jenkins-bot: Remove unused $wgDiscussionToolsABTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124895 (owner: 10Bartosz Dziewoński) [21:40:03] (03Merged) 10jenkins-bot: Remove unused $wgOATHAuthMultipleDevicesMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124896 (owner: 10Bartosz Dziewoński) [21:40:05] (03Merged) 10jenkins-bot: Deduplicate JsonConfig config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński) [21:40:23] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1124895|Remove unused $wgDiscussionToolsABTest]], [[gerrit:1124896|Remove unused $wgOATHAuthMultipleDevicesMigrationStage]], [[gerrit:1122711|Deduplicate JsonConfig config]] [21:43:12] !log tgr@deploy2002 matmarex, tgr: Backport for [[gerrit:1124895|Remove unused $wgDiscussionToolsABTest]], [[gerrit:1124896|Remove unused $wgOATHAuthMultipleDevicesMigrationStage]], [[gerrit:1122711|Deduplicate JsonConfig config]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:43:39] (03CR) 10Dzahn: "Could I ask you to get reviews from someone in infra foundations? I think it makes the most sense since they operate mail servers, know ab" [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [21:45:53] (03CR) 10Dzahn: "same here as on the related change. I think you are supposed to use postfix instead of exim as that is an ongoing effort by infra foundati" [puppet] - 10https://gerrit.wikimedia.org/r/1125223 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [21:47:42] FIRING: HelmReleaseBadStatus: Helm release article-descriptions/main on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=article-descriptions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:48:43] not sure how to test the JsonConfig patch but at least Data pages on Commons still seem functional [21:49:08] !log tgr@deploy2002 matmarex, tgr: Continuing with sync [21:53:10] !log otto@deploy2002 Started deploy [analytics/refinery@ec4c468]: 'emergency deploy for gobblin event_default recenchange memory issue' [21:54:46] !log otto@deploy2002 Finished deploy [analytics/refinery@ec4c468]: 'emergency deploy for gobblin event_default recenchange memory issue' (duration: 01m 55s) [21:55:24] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124895|Remove unused $wgDiscussionToolsABTest]], [[gerrit:1124896|Remove unused $wgOATHAuthMultipleDevicesMigrationStage]], [[gerrit:1122711|Deduplicate JsonConfig config]] (duration: 15m 00s) [21:56:17] actually that patch doesn't seem noop: https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig/3742/console but let's hope for the best [21:57:24] subbu: you are next [21:58:39] ok [21:58:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125232 (owner: 10Subramanya Sastry) [21:59:58] (03Merged) 10jenkins-bot: Revert^2 "Fix nested refs with the same name but a different group" [extensions/Cite] (wmf/1.44.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1125232 (owner: 10Subramanya Sastry) [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250306T2200) [22:00:28] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1125232|Revert^2 "Fix nested refs with the same name but a different group"]] [22:03:17] !log tgr@deploy2002 tgr, ssastry: Backport for [[gerrit:1125232|Revert^2 "Fix nested refs with the same name but a different group"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:03:20] Web team is gonna be using this window today [22:03:35] How much longer will the current deploy be? [22:03:42] tgr_, testing now. [22:03:57] 10-15 min [22:04:15] Sounds good, thank you! [22:05:07] tgr_, ship it! It works! Not sure what happened in the morning. [22:06:27] (03PS2) 10Andrew Bogott: cloudweb2002-dev: Change keystone fqdn for horizon in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1125237 [22:06:51] !log tgr@deploy2002 tgr, ssastry: Continuing with sync [22:13:13] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1125232|Revert^2 "Fix nested refs with the same name but a different group"]] (duration: 12m 44s) [22:13:21] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:13:37] toyofuku: done [22:13:50] Thank you! [22:14:07] can you ping me when you are finished? I'll deploy one more patch then [22:15:22] tgr_: will do! [22:16:15] bwang: double triple checking the patch I'm deploying is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1124510? [22:18:34] got confirmation in slack, we're proceeding [22:19:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by toyofuku@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang) [22:20:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:20:52] (03Merged) 10jenkins-bot: Enable Search AB test for en wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1124510 (owner: 10Bernard Wang) [22:21:09] !log toyofuku@deploy2002 Started scap sync-world: Backport for [[gerrit:1124510|Enable Search AB test for en wiki]] [22:21:17] and we're off [22:23:52] !log toyofuku@deploy2002 toyofuku, bwang: Backport for [[gerrit:1124510|Enable Search AB test for en wiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:24:03] coordinating testing via slack - brb [22:26:43] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1202.eqiad.wmnet onto db1253.eqiad.wmnet [22:27:45] (03CR) 10Dwisehaupt: "Sure thing, I'll reach out to them." [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [22:28:19] (03CR) 10Dwisehaupt: "Ah interesting. I was thinking of using postfix since that's what we use internally but figured sticking with what was already in place wa" [puppet] - 10https://gerrit.wikimedia.org/r/1125223 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [22:33:16] we're proceeding [22:33:18] !log toyofuku@deploy2002 toyofuku, bwang: Continuing with sync [22:33:20] (03PS1) 10Ryan Kemper: elastic: add 6 codfw refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125242 (https://phabricator.wikimedia.org/T380529) [22:33:35] (03CR) 10Dwisehaupt: "Jesse, would you be able to take a look at this in conjunction with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1125223 around ac" [puppet] - 10https://gerrit.wikimedia.org/r/1124205 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [22:34:08] (03CR) 10Dwisehaupt: "Jesse, would you be able to take a look at this in conjunction with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1124205 around ac" [puppet] - 10https://gerrit.wikimedia.org/r/1125223 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [22:34:14] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125242 (https://phabricator.wikimedia.org/T380529) (owner: 10Ryan Kemper) [22:35:38] (03PS2) 10Ryan Kemper: elastic: add 6 codfw refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125242 (https://phabricator.wikimedia.org/T380529) [22:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:37:44] (03PS3) 10Ryan Kemper: elastic: add 6 codfw refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125242 (https://phabricator.wikimedia.org/T380529) [22:38:48] (03PS2) 10Gergő Tisza: Enable SUL3 signup for 50% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125134 (https://phabricator.wikimedia.org/T384007) [22:39:36] !log toyofuku@deploy2002 Finished scap sync-world: Backport for [[gerrit:1124510|Enable Search AB test for en wiki]] (duration: 18m 27s) [22:40:02] woo that was fast [22:40:06] thank you all! [22:40:10] tgr_: back to you [22:40:19] thanks! [22:41:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125134 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [22:41:47] RECOVERY - MegaRAID on an-worker1066 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:42:09] (03Merged) 10jenkins-bot: Enable SUL3 signup for 50% of group 1 users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125134 (https://phabricator.wikimedia.org/T384007) (owner: 10Gergő Tisza) [22:42:21] (03PS1) 10Ryan Kemper: elastic: decom 6 codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125243 (https://phabricator.wikimedia.org/T380529) [22:42:26] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1125134|Enable SUL3 signup for 50% of group 1 users (T384007)]] [22:42:29] T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007 [22:42:55] (03CR) 10Bking: [C:03+2] elastic: add 6 codfw refresh hosts [puppet] - 10https://gerrit.wikimedia.org/r/1125242 (https://phabricator.wikimedia.org/T380529) (owner: 10Ryan Kemper) [22:43:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [22:45:07] !log tgr@deploy2002 tgr: Backport for [[gerrit:1125134|Enable SUL3 signup for 50% of group 1 users (T384007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:55:14] 06SRE, 06Fundraising-Backlog, 10fundraising-tech-ops, 10LDAP-Access-Requests: Grant Access to astein for fr-tech icinga acknowledgements - https://phabricator.wikimedia.org/T388186 (10Dwisehaupt) 03NEW [22:56:58] !log tgr@deploy2002 tgr: Continuing with sync [23:03:21] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1125134|Enable SUL3 signup for 50% of group 1 users (T384007)]] (duration: 20m 55s) [23:03:24] T384007: SUL3 Phase 1: All new account creation on group 0 and group 1 wikis - https://phabricator.wikimedia.org/T384007 [23:10:46] (03PS1) 10JHathaway: puppet: add an ACL puppet module [puppet] - 10https://gerrit.wikimedia.org/r/1125245 (https://phabricator.wikimedia.org/T385995) [23:10:46] (03PS1) 10JHathaway: puppetserver: fix gitpuppet group on puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/1125246 (https://phabricator.wikimedia.org/T385995) [23:10:47] (03PS1) 10JHathaway: puppetserver: add option to manage git permissions with an acl [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) [23:11:48] PROBLEM - MegaRAID on an-worker1066 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:13:32] (03CR) 10CI reject: [V:04-1] puppetserver: fix gitpuppet group on puppetservers [puppet] - 10https://gerrit.wikimedia.org/r/1125246 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [23:13:46] (03CR) 10CI reject: [V:04-1] puppetserver: add option to manage git permissions with an acl [puppet] - 10https://gerrit.wikimedia.org/r/1125247 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [23:18:33] !log joal@deploy2002 Started deploy [analytics/refinery@64b629d]: emergency deploy for gobblin event_default recenchange memory issue - 2 [23:19:46] !log joal@deploy2002 Finished deploy [analytics/refinery@64b629d]: emergency deploy for gobblin event_default recenchange memory issue - 2 (duration: 01m 13s) [23:22:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2112-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:23:31] (03PS3) 10Andrew Bogott: Add keystone.openstack.eqiad1.wikimediacloud.org endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1125235 (https://phabricator.wikimedia.org/T388137) [23:27:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2110-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:27:40] FIRING: [4x] KubernetesRsyslogDown: rsyslog on aux-k8s-worker2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:30:18] (03Abandoned) 10Andrew Bogott: cloudweb2002-dev: Change keystone fqdn for horizon in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/1125237 (owner: 10Andrew Bogott) [23:31:16] (03PS1) 10Andrew Bogott: Horizon/idp: access keystone on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/1125249 (https://phabricator.wikimedia.org/T388137) [23:31:34] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125235 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [23:31:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125249 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [23:33:55] (03CR) 10Andrew Bogott: [C:03+2] Add keystone.openstack.eqiad1.wikimediacloud.org endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1125235 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [23:37:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2110-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:38:09] (03PS2) 10Andrew Bogott: Horizon/idp: access keystone on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/1125249 (https://phabricator.wikimedia.org/T388137) [23:38:09] (03PS1) 10Andrew Bogott: haproxy eqiad1: fix name of keystone public api [puppet] - 10https://gerrit.wikimedia.org/r/1125252 [23:38:44] (03CR) 10Andrew Bogott: [C:03+2] haproxy eqiad1: fix name of keystone public api [puppet] - 10https://gerrit.wikimedia.org/r/1125252 (owner: 10Andrew Bogott) [23:42:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2110-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:44:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10612051 (10phaultfinder) [23:46:50] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125249 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [23:49:50] (03PS3) 10Andrew Bogott: Horizon/idp: access keystone on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/1125249 (https://phabricator.wikimedia.org/T388137) [23:50:08] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1125249 (https://phabricator.wikimedia.org/T388137) (owner: 10Andrew Bogott) [23:52:39] RESOLVED: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2110-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:56:13] (03PS4) 10Andrew Bogott: Horizon/idp: access keystone on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/1125249 (https://phabricator.wikimedia.org/T388137)