[00:03:54] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr) @cmooney i am having issues with Racks e4 and f4 these are cloud public vlan in new cage Wmcs... [00:12:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr) [00:16:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcontrol1008-dev [00:16:37] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [00:37:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol1008-dev.mgmt.eqiad.wmnet with reboot policy FORCED [00:38:14] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1008-dev'] [00:38:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/960685 [00:38:23] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1008-dev'] [00:38:27] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/960685 (owner: 10TrainBranchBot) [00:38:35] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1008-dev'] [00:38:40] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudcontrol1008-dev'] [00:38:52] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1008-dev'] [00:45:56] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:35] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol1008-dev'] [00:50:08] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1008-dev'] [00:50:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol1008-dev'] [00:50:49] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudcontrol1008-dev'] [00:50:56] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudcontrol1008-dev'] [00:51:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr) [00:53:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/960685 (owner: 10TrainBranchBot) [01:05:10] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:45:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10AKhatun_WMF) Thanks @colewhite. I'm all set! [01:45:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10AKhatun_WMF) >>! In T346796#9192790, @AKhatun_WMF wrote: > I am getting this error when I kinit > `kinit: Client 'akhatun@WIKIMEDIA' not found in Kerberos database whil... [01:59:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1242.eqiad.wmnet with OS bullseye [01:59:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1242.eqiad.wmnet with OS bullseye [02:00:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1243.eqiad.wmnet with OS bullseye [02:00:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1243.eqiad.wmnet with OS bullseye [02:01:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1244.eqiad.wmnet with OS bullseye [02:01:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1244.eqiad.wmnet with OS bullseye [02:03:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1245.eqiad.wmnet with OS bullseye [02:03:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1245.eqiad.wmnet with OS bullseye [02:04:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1246.eqiad.wmnet with OS bullseye [02:04:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1246.eqiad.wmnet with OS bullseye [02:05:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1247.eqiad.wmnet with OS bullseye [02:05:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1247.eqiad.wmnet with OS bullseye [02:06:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1248.eqiad.wmnet with OS bullseye [02:06:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1248.eqiad.wmnet with OS bullseye [02:07:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1249.eqiad.wmnet with OS bullseye [02:07:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1249.eqiad.wmnet with OS bullseye [02:12:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1242.eqiad.wmnet with reason: host reimage [02:13:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1243.eqiad.wmnet with reason: host reimage [02:14:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1244.eqiad.wmnet with reason: host reimage [02:14:47] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1243.eqiad.wmnet with reason: host reimage [02:15:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1242.eqiad.wmnet with reason: host reimage [02:15:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1245.eqiad.wmnet with reason: host reimage [02:17:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [02:17:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1244.eqiad.wmnet with reason: host reimage [02:18:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1247.eqiad.wmnet with reason: host reimage [02:18:46] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1248.eqiad.wmnet with reason: host reimage [02:19:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1248.eqiad.wmnet with reason: host reimage [02:20:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1249.eqiad.wmnet with reason: host reimage [02:20:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1246.eqiad.wmnet with reason: host reimage [02:23:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1247.eqiad.wmnet with reason: host reimage [02:25:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1249.eqiad.wmnet with reason: host reimage [02:25:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db1245.eqiad.wmnet with reason: host reimage [02:27:49] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:31:34] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:31:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:31:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1243.eqiad.wmnet with OS bullseye [02:31:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1243.eqiad.wmnet with OS bullseye completed: - db1243 (**WARN**) - Removed f... [02:33:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:33:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1242.eqiad.wmnet with OS bullseye [02:33:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1242.eqiad.wmnet with OS bullseye completed: - db1242 (**PASS**) - Removed f... [02:33:32] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:34:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:35:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:35:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1244.eqiad.wmnet with OS bullseye [02:35:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1244.eqiad.wmnet with OS bullseye completed: - db1244 (**PASS**) - Removed f... [02:35:20] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:37:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:38:46] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:39:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:39:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1246.eqiad.wmnet with OS bullseye [02:39:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1246.eqiad.wmnet with OS bullseye completed: - db1246 (**PASS**) - Removed f... [02:39:31] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:39:56] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:39:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1248.eqiad.wmnet with OS bullseye [02:40:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1248.eqiad.wmnet with OS bullseye completed: - db1248 (**WARN**) - Removed f... [02:40:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:40:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1247.eqiad.wmnet with OS bullseye [02:40:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1247.eqiad.wmnet with OS bullseye completed: - db1247 (**PASS**) - Removed f... [02:41:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:41:28] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:42:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:42:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1249.eqiad.wmnet with OS bullseye [02:42:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [02:42:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1245.eqiad.wmnet with OS bullseye [02:42:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1249.eqiad.wmnet with OS bullseye completed: - db1249 (**PASS**) - Removed f... [02:42:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1245.eqiad.wmnet with OS bullseye completed: - db1245 (**WARN**) - Removed f... [02:48:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Jhancock.wm) [02:54:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @Marostegui this is completed [03:08:46] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:56:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T343198)', diff saved to https://phabricator.wikimedia.org/P52706 and previous config saved to /var/cache/conftool/dbconfig/20230928-035657-arnaudb.json [03:57:04] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [04:12:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P52707 and previous config saved to /var/cache/conftool/dbconfig/20230928-041204-arnaudb.json [04:21:14] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:27:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P52708 and previous config saved to /var/cache/conftool/dbconfig/20230928-042710-arnaudb.json [04:36:08] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:36:18] PROBLEM - BFD status on cr1-drmrs is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:42:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T343198)', diff saved to https://phabricator.wikimedia.org/P52709 and previous config saved to /var/cache/conftool/dbconfig/20230928-044216-arnaudb.json [04:42:19] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [04:42:23] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [04:42:32] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [04:42:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T343198)', diff saved to https://phabricator.wikimedia.org/P52710 and previous config saved to /var/cache/conftool/dbconfig/20230928-044238-arnaudb.json [04:53:59] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:58:59] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [04:59:40] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:59:48] RECOVERY - BFD status on cr1-drmrs is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:02:10] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:06] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:16:08] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:16:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.228 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:17:26] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:18:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Marostegui) Thank you!! [05:21:22] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:22:40] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:54:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (10Marostegui) [05:59:14] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T0600) [06:00:04] kormat, marostegui, and Amir1: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T0600) [06:13:38] 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Joe) @Jdforrester-WMF no, this task is actually about that patch not having the effect we expected. [06:15:28] 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Joe) Interestingly, I do get correct results for m.wikidata.org, but somehow not for www.wikidata.org (also, please grep for `mw-web` as we've repooled eqiad in the... [06:40:02] 10SRE, 10Cloud-VPS, 10Toolforge: At least one commons tool timing out - https://phabricator.wikimedia.org/T347532 (10JJMC89) [06:40:06] 10SRE, 10Cloud-VPS, 10Toolforge: At least one commons tool timing out - https://phabricator.wikimedia.org/T347533 (10JJMC89) [06:53:19] (03PS1) 10Marostegui: install_server: Do not reimage pc1016 [puppet] - 10https://gerrit.wikimedia.org/r/961682 [06:53:41] 10SRE, 10LDAP-Access-Requests: Grant Access to to ldap/wmf for AKhatun - https://phabricator.wikimedia.org/T347546 (10MGerlach) [06:54:03] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage pc1016 [puppet] - 10https://gerrit.wikimedia.org/r/961682 (owner: 10Marostegui) [07:00:05] Amir1, apergos, and jnuche: #bothumor I � Unicode. All rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T0700). [07:00:42] morning! there are no trainees signed up to learn about deployment today, thank goodness, because we have no patches scheduled for deployment! [07:05:12] (03CR) 10Tim Starling: Update README file (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 (owner: 10Hokwelum) [07:08:46] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:36] (03CR) 10Muehlenhoff: [C: 03+2] firewall: Also move the sysctl under the manage_nf_conntrack conditional [puppet] - 10https://gerrit.wikimedia.org/r/961400 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:09:44] (03PS2) 10Muehlenhoff: firewall: Also move the sysctl under the manage_nf_conntrack conditional [puppet] - 10https://gerrit.wikimedia.org/r/961400 (https://phabricator.wikimedia.org/T336497) [07:16:16] (03PS2) 10Muehlenhoff: cloudgw: Don't override conntrack settings from firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/961377 (https://phabricator.wikimedia.org/T336497) [07:17:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961377 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [07:23:33] 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10JMeybohm) [07:25:38] <_joe_> !log restarting trafficserver on cp1081 T347493 [07:26:13] _joe_: Failed to log message to wiki. Somebody should check the error logs. [07:26:13] T347493: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 [07:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:26:51] <_joe_> uhm stashbot not working? [07:26:59] i'll poke it [07:27:13] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [07:27:13] <_joe_> is the problem stashbot or wikitech? :) [07:27:26] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [07:27:44] requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='wikitech.wikimedia.org', port=443): Read timed out. (read timeout=30) [07:27:45] that's odd [07:27:52] <_joe_> indeed [07:28:08] <_joe_> taavi: it did connect correctly to phabricator OTOH [07:28:22] !log test [07:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:03] <_joe_> ofc a test would work [07:30:07] <_joe_> :) [07:33:00] 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Joe) I tried restarting ATS on a backend, cp1081, then made requests for wikidata's special:random to trafficserver directly: still all going to appservers on bare m... [07:36:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10Ladsgroup) Thank you!! [07:37:44] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/961442 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah) [07:38:33] (03CR) 10Muehlenhoff: [C: 03+2] Make the dbconfig settings conditional on the hdb backend [puppet] - 10https://gerrit.wikimedia.org/r/961352 (https://phabricator.wikimedia.org/T292942) (owner: 10Muehlenhoff) [07:40:42] (03PS1) 10Giuseppe Lavagetto: wikidata: add mw-on-k8s routing [puppet] - 10https://gerrit.wikimedia.org/r/961684 (https://phabricator.wikimedia.org/T347493) [07:42:48] (03PS1) 10Ilias Sarantopoulos: ores-legacy: fix 5xx errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/961686 (https://phabricator.wikimedia.org/T347480) [07:43:09] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43680/console" [puppet] - 10https://gerrit.wikimedia.org/r/961442 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah) [07:43:17] 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 3 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Joe) Well turns out the issue was simpler: we even had a TODO in the code: ` # TODO: add mw-on-k8s once we think of moving wikidata or partial traffic. ` Sigh. Tha... [07:43:48] (03CR) 10Majavah: [V: 03+1 C: 03+2] Take cloudcontrol1006 out of service [puppet] - 10https://gerrit.wikimedia.org/r/961442 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah) [07:44:59] !log taavi@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudcontrol1006.wikimedia.org [07:45:02] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ores-legacy: fix 5xx errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/961686 (https://phabricator.wikimedia.org/T347480) (owner: 10Ilias Sarantopoulos) [07:45:52] (03Merged) 10jenkins-bot: ores-legacy: fix 5xx errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/961686 (https://phabricator.wikimedia.org/T347480) (owner: 10Ilias Sarantopoulos) [07:46:47] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [07:47:29] !log isaranto@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [07:48:00] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [07:50:40] 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10JMeybohm) [07:51:33] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [07:52:58] (03PS39) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [07:53:00] (03PS1) 10AOkoth: clamav: disable ConcurrentDatabase Reloads [puppet] - 10https://gerrit.wikimedia.org/r/961689 (https://phabricator.wikimedia.org/T347450) [07:53:55] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1006.wikimedia.org decommissioned, removing all IPs except the asset tag one - taavi@cumin1001" [07:55:40] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudcontrol1006.wikimedia.org decommissioned, removing all IPs except the asset tag one - taavi@cumin1001" [07:55:40] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:55:41] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudcontrol1006.wikimedia.org [07:55:53] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by taavi@cumin1001 for hosts: `cloudcontrol1006.wikimed... [07:56:27] (03PS2) 10AOkoth: clamav: disable ConcurrentDatabaseReload [puppet] - 10https://gerrit.wikimedia.org/r/961689 (https://phabricator.wikimedia.org/T347450) [07:56:47] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10taavi) [07:57:28] (03CR) 10CI reject: [V: 04-1] Render an environment file for kafka-kit to reduce manual toil [puppet] - 10https://gerrit.wikimedia.org/r/961685 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol) [07:58:30] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10taavi) a:05taavi→03Jclark-ctr Hi, this host is ready to be moved. Thanks! [08:07:03] (03CR) 10Muehlenhoff: clamav: disable ConcurrentDatabaseReload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961689 (https://phabricator.wikimedia.org/T347450) (owner: 10AOkoth) [08:07:59] (03CR) 10JMeybohm: [C: 03+2] Update chromium-render to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958473 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [08:08:25] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [08:08:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [08:08:48] (03Merged) 10jenkins-bot: Update chromium-render to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958473 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [08:13:16] (03PS2) 10Muehlenhoff: ssh: Disable ChallengeResponseAuthentication [puppet] - 10https://gerrit.wikimedia.org/r/959895 [08:14:05] (03CR) 10Brouberol: [V: 03+1] "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/961685 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol) [08:14:11] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [08:14:39] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [08:16:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM after previous patch was merged. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/961377 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:23:19] (03CR) 10Muehlenhoff: [C: 03+2] cloudgw: Don't override conntrack settings from firewall profile [puppet] - 10https://gerrit.wikimedia.org/r/961377 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [08:25:28] (03PS4) 10Mhorsey: Enable Campaigns email on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960559 (https://phabricator.wikimedia.org/T347065) [08:30:01] (03PS1) 10Jelto: gitlab: use static name in failover backup, increase concurrency [puppet] - 10https://gerrit.wikimedia.org/r/961694 (https://phabricator.wikimedia.org/T345590) [08:31:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/961685 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol) [08:32:03] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43690/console" [puppet] - 10https://gerrit.wikimedia.org/r/961694 (https://phabricator.wikimedia.org/T345590) (owner: 10Jelto) [08:34:25] (03CR) 10AOkoth: clamav: disable ConcurrentDatabaseReload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961689 (https://phabricator.wikimedia.org/T347450) (owner: 10AOkoth) [08:35:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T343198)', diff saved to https://phabricator.wikimedia.org/P52711 and previous config saved to /var/cache/conftool/dbconfig/20230928-083513-arnaudb.json [08:35:19] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [08:40:45] (03CR) 10Stevemunene: [C: 03+1] Add OIDC/datahub stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/961094 (https://phabricator.wikimedia.org/T305874) (owner: 10Muehlenhoff) [08:41:50] (03CR) 10Pcoombe: Allow FundraiseUp scripts in Donatewiki CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg) [08:43:05] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add OIDC/datahub stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/961094 (https://phabricator.wikimedia.org/T305874) (owner: 10Muehlenhoff) [08:45:52] (03PS1) 10Jgiannelos: wikifeeds: Use core page html endpoint for outgoing parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961696 [08:46:50] (03PS2) 10Jgiannelos: wikifeeds: Use core page html endpoint for outgoing parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961696 [08:49:23] (03CR) 10Clément Goubert: [C: 03+1] wikidata: add mw-on-k8s routing [puppet] - 10https://gerrit.wikimedia.org/r/961684 (https://phabricator.wikimedia.org/T347493) (owner: 10Giuseppe Lavagetto) [08:50:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P52712 and previous config saved to /var/cache/conftool/dbconfig/20230928-085019-arnaudb.json [08:51:43] (03PS1) 10Btullis: Bump the namenode heap by 4GB on the Hadoop masters [puppet] - 10https://gerrit.wikimedia.org/r/961698 (https://phabricator.wikimedia.org/T342587) [08:53:38] (03PS1) 10Stevemunene: Disable WMDE misc jobs on stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/961699 (https://phabricator.wikimedia.org/T340648) [08:53:57] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43691/console" [puppet] - 10https://gerrit.wikimedia.org/r/961698 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis) [08:57:01] (03CR) 10Muehlenhoff: [C: 03+1] clamav: disable ConcurrentDatabaseReload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961689 (https://phabricator.wikimedia.org/T347450) (owner: 10AOkoth) [08:57:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudgw2002-dev.codfw.wmnet [08:59:38] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [08:59:48] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43692/console" [puppet] - 10https://gerrit.wikimedia.org/r/961699 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [08:59:48] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [09:00:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wikidata: add mw-on-k8s routing [puppet] - 10https://gerrit.wikimedia.org/r/961684 (https://phabricator.wikimedia.org/T347493) (owner: 10Giuseppe Lavagetto) [09:00:48] (03PS1) 10Ilias Sarantopoulos: ml-services: fix more 5xx errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/961700 (https://phabricator.wikimedia.org/T347480) [09:02:03] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix more 5xx errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/961700 (https://phabricator.wikimedia.org/T347480) (owner: 10Ilias Sarantopoulos) [09:02:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2002-dev.codfw.wmnet [09:02:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590) (owner: 10Jelto) [09:03:18] (03Merged) 10jenkins-bot: ml-services: fix more 5xx errors [deployment-charts] - 10https://gerrit.wikimedia.org/r/961700 (https://phabricator.wikimedia.org/T347480) (owner: 10Ilias Sarantopoulos) [09:04:07] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1010.eqiad.wmnet with reason: rebooting backup1010 [09:04:20] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1010.eqiad.wmnet with reason: rebooting backup1010 [09:04:22] (03PS1) 10Slyngshede: Limit global account linking to LDAP properties page. [software/bitu] - 10https://gerrit.wikimedia.org/r/961702 [09:04:42] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [09:04:43] (03CR) 10Brouberol: [V: 03+1 C: 03+2] Render an environment file for kafka-kit to reduce manual toil [puppet] - 10https://gerrit.wikimedia.org/r/961685 (https://phabricator.wikimedia.org/T346764) (owner: 10Brouberol) [09:04:58] !log isaranto@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:05:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P52713 and previous config saved to /var/cache/conftool/dbconfig/20230928-090526-arnaudb.json [09:05:28] !log isaranto@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:05:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudgw2003-dev.codfw.wmnet [09:05:32] (03PS2) 10Slyngshede: Limit global account linking to LDAP properties page. [software/bitu] - 10https://gerrit.wikimedia.org/r/961702 [09:05:37] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet::expose_certs: automatically include the ca chain on puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/961406 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [09:05:47] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [09:06:07] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [09:09:33] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [09:09:51] (03CR) 10Muehlenhoff: [C: 03+2] Add a cookbook to restart/reboot maps masters [cookbooks] - 10https://gerrit.wikimedia.org/r/961074 (owner: 10Muehlenhoff) [09:10:52] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [09:11:34] !log arnaudb@cumin1001 START - Cookbook sre.hosts.reboot-single for host backup1010.eqiad.wmnet [09:12:40] (03CR) 10Klausman: "This change is ready for review." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/961701 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [09:13:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2003-dev.codfw.wmnet [09:13:12] 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10JMeybohm) a:03JMeybohm [09:16:12] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host backup1010.eqiad.wmnet [09:19:51] (03CR) 10Elukey: "What does grr preview says?" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/961701 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [09:20:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T343198)', diff saved to https://phabricator.wikimedia.org/P52714 and previous config saved to /var/cache/conftool/dbconfig/20230928-092032-arnaudb.json [09:20:35] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [09:20:37] (03CR) 10Klausman: SLOs: Add SLO for Liftwing Readability isvc (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/961701 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [09:20:38] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [09:20:48] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [09:20:50] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [09:21:04] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [09:21:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T343198)', diff saved to https://phabricator.wikimedia.org/P52715 and previous config saved to /var/cache/conftool/dbconfig/20230928-092109-arnaudb.json [09:21:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/959895 (owner: 10Muehlenhoff) [09:25:10] (03PS3) 10Elukey: modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 [09:25:21] (03PS1) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 [09:25:58] (03CR) 10CI reject: [V: 04-1] modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 (owner: 10Elukey) [09:27:04] (03CR) 10Elukey: [C: 03+1] SLOs: Add SLO for Liftwing Readability isvc [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/961701 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [09:28:33] (03PS4) 10Elukey: modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 [09:28:35] (03PS2) 10Majavah: Cleanup remains of haproxy-on-cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/961443 [09:28:37] (03PS1) 10Majavah: openstack: magnum: Remove redundant transport_url setting [puppet] - 10https://gerrit.wikimedia.org/r/961726 [09:28:39] (03PS1) 10Majavah: openstack: rename openstack_controllers to memcached_nodes [puppet] - 10https://gerrit.wikimedia.org/r/961727 (https://phabricator.wikimedia.org/T347554) [09:29:18] (03CR) 10Muehlenhoff: [C: 03+2] ssh: Disable ChallengeResponseAuthentication [puppet] - 10https://gerrit.wikimedia.org/r/959895 (owner: 10Muehlenhoff) [09:29:28] (03CR) 10CI reject: [V: 04-1] openstack: rename openstack_controllers to memcached_nodes [puppet] - 10https://gerrit.wikimedia.org/r/961727 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [09:30:17] (03CR) 10Btullis: "Should we remove the statistics::wmde class as well, rather than leave the files lying around unused? It's not applied to any other host." [puppet] - 10https://gerrit.wikimedia.org/r/961699 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [09:32:57] (03PS2) 10Majavah: openstack: rename openstack_controllers to memcached_nodes [puppet] - 10https://gerrit.wikimedia.org/r/961727 (https://phabricator.wikimedia.org/T347554) [09:34:14] (03CR) 10JMeybohm: [C: 03+1] Update push-notifications to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961078 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [09:34:20] (03PS1) 10Elukey: Upgrade mesh and ingress modules for python-webapp [deployment-charts] - 10https://gerrit.wikimedia.org/r/961730 (https://phabricator.wikimedia.org/T347344) [09:34:22] (03PS1) 10Elukey: ml-services: enable base CORS headers policy for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961731 (https://phabricator.wikimedia.org/T347344) [09:35:16] (03CR) 10CI reject: [V: 04-1] openstack: rename openstack_controllers to memcached_nodes [puppet] - 10https://gerrit.wikimedia.org/r/961727 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [09:35:30] (03CR) 10JMeybohm: [C: 03+1] Update tegola-vector-tiles to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961077 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [09:36:17] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 12 CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43694/console" [puppet] - 10https://gerrit.wikimedia.org/r/961727 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [09:36:50] (03PS3) 10Majavah: openstack: rename openstack_controllers to memcached_nodes [puppet] - 10https://gerrit.wikimedia.org/r/961727 (https://phabricator.wikimedia.org/T347554) [09:37:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:37:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:39:52] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 12 CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43695/console" [puppet] - 10https://gerrit.wikimedia.org/r/961727 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [09:39:55] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50568 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:40:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:40:34] (03PS2) 10Giuseppe Lavagetto: trafficserver: move 10% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957859 (https://phabricator.wikimedia.org/T346422) [09:41:37] (03CR) 10Clément Goubert: [C: 03+1] trafficserver: move 10% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957859 (https://phabricator.wikimedia.org/T346422) (owner: 10Giuseppe Lavagetto) [09:43:47] (03CR) 10Stevemunene: [V: 03+1] Disable WMDE misc jobs on stat1007 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961699 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [09:44:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver: move 10% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/957859 (https://phabricator.wikimedia.org/T346422) (owner: 10Giuseppe Lavagetto) [09:45:35] !log depool cp4037 to restart varnish and apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/960112 (T347192) [09:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:41] T347192: Varnish should allow PURGE requests only from socket (purged) - https://phabricator.wikimedia.org/T347192 [09:46:15] (03PS2) 10Stevemunene: Disable WMDE misc jobs on stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/961699 (https://phabricator.wikimedia.org/T340648) [09:48:20] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet [09:50:12] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43696/console" [puppet] - 10https://gerrit.wikimedia.org/r/961699 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [09:50:25] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: enable base CORS headers policy for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961731 (https://phabricator.wikimedia.org/T347344) (owner: 10Elukey) [09:50:46] (03CR) 10JMeybohm: [C: 03+2] Update machinetranslation to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/960625 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:51:07] <_joe_> !log running puppet on cp-text to move mw on k8s to 10% [09:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:28] (03Abandoned) 10Arturo Borrero Gonzalez: cloudgw: load nf_conntrack sysctl settings later [puppet] - 10https://gerrit.wikimedia.org/r/961376 (https://phabricator.wikimedia.org/T347469) (owner: 10Arturo Borrero Gonzalez) [09:51:33] (03Merged) 10jenkins-bot: Update machinetranslation to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/960625 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:51:55] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724 (10cmooney) >>! In T346724#9204080, @cmooney wrote: > So we should first asses what racks are being made available, what hardware is already in them,... [09:52:16] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [09:52:18] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [09:54:47] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [09:54:58] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet [09:58:26] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [09:59:48] (03PS1) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 [10:00:05] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T1000). [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T1000) [10:00:32] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Degraded RAID on dbstore1005 - https://phabricator.wikimedia.org/T347449 (10BTullis) 05Open→03Resolved Thanks @Jclark-ctr - but everything is OK now with this server so no further action is required at the moment. It looks like the RAID controller must have had a... [10:01:37] (03CR) 10CI reject: [V: 04-1] profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (owner: 10Jbond) [10:04:06] (03PS2) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 [10:05:23] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (owner: 10Jbond) [10:07:04] (03PS2) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 [10:07:17] (03PS3) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 [10:07:37] (03CR) 10Arturo Borrero Gonzalez: firewall: move conntrack logic to firewall module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953276 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [10:08:12] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet [10:10:48] (03PS1) 10Majavah: P:mariadb: drop special firewall rules from m5 servers [puppet] - 10https://gerrit.wikimedia.org/r/961736 [10:10:50] (03PS1) 10Majavah: P:mariadb::ferm_wmcs: cleanup unused rules [puppet] - 10https://gerrit.wikimedia.org/r/961737 [10:12:31] (03PS4) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 [10:12:35] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43697/console" [puppet] - 10https://gerrit.wikimedia.org/r/961737 (owner: 10Majavah) [10:13:10] (03CR) 10Joal: [C: 03+1] "Thanks Ben :)" [puppet] - 10https://gerrit.wikimedia.org/r/961698 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis) [10:13:38] (03CR) 10Btullis: [V: 03+1 C: 03+2] Bump the namenode heap by 4GB on the Hadoop masters [puppet] - 10https://gerrit.wikimedia.org/r/961698 (https://phabricator.wikimedia.org/T342587) (owner: 10Btullis) [10:14:15] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet [10:14:53] (03PS1) 10Slyngshede: Add URI validator [software/bitu] - 10https://gerrit.wikimedia.org/r/961738 [10:15:52] 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Lucas_Werkmeister_WMDE) Seems to be working now, thanks a lot for fixing it! `lang=shell-session $ for i in {1..100}; do curl -sIH 'User-Agent: test-Iebdc15b19b (lu... [10:17:15] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [10:17:35] 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 2 others: Serve Wikidata traffic via Kubernetes - https://phabricator.wikimedia.org/T347493 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert Yes, confirmed now working. Resolving. [10:19:17] (03CR) 10Klausman: [C: 03+1] modules: duplicate ingress:istio_1.0.2 to 1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961378 (owner: 10Elukey) [10:20:05] (03CR) 10Klausman: [C: 03+1] modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 (owner: 10Elukey) [10:20:53] (03CR) 10Klausman: [C: 03+1] Upgrade mesh and ingress modules for python-webapp [deployment-charts] - 10https://gerrit.wikimedia.org/r/961730 (https://phabricator.wikimedia.org/T347344) (owner: 10Elukey) [10:21:00] (03CR) 10Klausman: [C: 03+1] ml-services: enable base CORS headers policy for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961731 (https://phabricator.wikimedia.org/T347344) (owner: 10Elukey) [10:21:58] (03CR) 10EoghanGaffney: [C: 03+1] gitlab failover: use puppet-managed backup script [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590) (owner: 10Jelto) [10:22:09] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: use static name in failover backup, increase concurrency [puppet] - 10https://gerrit.wikimedia.org/r/961694 (https://phabricator.wikimedia.org/T345590) (owner: 10Jelto) [10:22:19] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [10:27:50] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [10:30:30] 10SRE, 10Infrastructure-Foundations, 10Puppet (Puppet 7.0): Switch rsyslog to use the new PKI infrastrcutre - https://phabricator.wikimedia.org/T347565 (10jbond) [10:30:55] 10SRE, 10Infrastructure-Foundations, 10Observability-Logging, 10Puppet (Puppet 7.0): Switch rsyslog to use the new PKI infrastrcutre - https://phabricator.wikimedia.org/T347565 (10jbond) p:05Triage→03Medium [10:32:21] (03PS3) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [10:32:23] (03PS5) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [10:32:25] (03PS1) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [10:32:27] (03PS1) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [10:32:53] (03PS2) 10Jbond: rsyslog: switch the endpoints to use the PKI system [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) [10:33:31] (03PS1) 10Ammarpad: wikifunctionswiki: Disable NearbyPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961742 (https://phabricator.wikimedia.org/T345459) [10:33:36] (Nonwrite HTTP requests with primary DB connections alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [10:34:51] (03PS4) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [10:35:02] (03PS6) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [10:35:10] (03PS2) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [10:35:22] (03PS2) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [10:36:15] (03CR) 10CI reject: [V: 04-1] profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [10:36:20] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] openstack: rename openstack_controllers to memcached_nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961727 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [10:36:36] (03PS1) 10Effie Mouzeli: (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [10:36:42] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "I really like this change BTW :-) thanks! We definitely needs more of these." [puppet] - 10https://gerrit.wikimedia.org/r/961727 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [10:37:19] (03CR) 10CI reject: [V: 04-1] (WIP) mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:38:26] (03PS1) 10Jbond: pki::root_ca: add new intermediate for syslog [puppet] - 10https://gerrit.wikimedia.org/r/961745 (https://phabricator.wikimedia.org/T347565) [10:38:29] (03CR) 10Majavah: [V: 03+1] openstack: rename openstack_controllers to memcached_nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961727 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [10:38:34] (03CR) 10Effie Mouzeli: [C: 03+2] Update push-notifications to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961078 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [10:39:05] (03CR) 10Jbond: [C: 03+2] pki::root_ca: add new intermediate for syslog [puppet] - 10https://gerrit.wikimedia.org/r/961745 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [10:39:32] (03CR) 10Btullis: [C: 03+1] Bump MW Page content change app version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/960610 (https://phabricator.wikimedia.org/T344688) (owner: 10Aqu) [10:39:55] (03Merged) 10jenkins-bot: Update push-notifications to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961078 (https://phabricator.wikimedia.org/T300033) (owner: 10Effie Mouzeli) [10:40:25] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [10:40:34] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [10:40:38] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [10:40:44] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [10:40:52] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [10:40:58] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [10:42:02] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:43:08] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:43:36] (Nonwrite HTTP requests with primary DB connections alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [10:44:08] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:44:12] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:48:39] (Nonwrite HTTP requests with primary DB connections alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [10:48:40] (03PS1) 10Jbond: pki: add syslog intermediate [puppet] - 10https://gerrit.wikimedia.org/r/961749 (https://phabricator.wikimedia.org/T347565) [10:49:00] (03PS1) 10Hnowlan: rest-gateway: strictly order route definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/961751 [10:49:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43698/console" [puppet] - 10https://gerrit.wikimedia.org/r/961749 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [10:50:51] (03CR) 10Giuseppe Lavagetto: [C: 03+1] modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 (owner: 10Elukey) [10:51:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] pki: add syslog intermediate [puppet] - 10https://gerrit.wikimedia.org/r/961749 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [10:51:29] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [10:53:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Cleanup remains of haproxy-on-cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/961443 (owner: 10Majavah) [10:53:54] (Nonwrite HTTP requests with primary DB connections alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DNonwrite+HTTP+requests+with+primary+DB+connections+alert [10:54:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack: magnum: Remove redundant transport_url setting [puppet] - 10https://gerrit.wikimedia.org/r/961726 (owner: 10Majavah) [10:54:24] (03PS2) 10Majavah: Set WRITE_BOTH for CA wikis on OATHAuth multiple devices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961237 (https://phabricator.wikimedia.org/T242031) [10:54:46] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot-master rolling reboot on A:maps-master-codfw [10:54:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.maps.roll-restart-reboot-master (exit_code=1) rolling reboot on A:maps-master-codfw [10:54:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Please merge!" [puppet] - 10https://gerrit.wikimedia.org/r/961727 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [10:55:05] (03PS3) 10Jbond: rsyslog: switch the endpoints to use the PKI system [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) [10:55:07] (03PS5) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [10:55:09] (03PS7) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [10:55:11] (03PS3) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [10:55:13] (03PS3) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [10:55:46] (03CR) 10Majavah: [C: 03+2] Cleanup remains of haproxy-on-cloudcontrol [puppet] - 10https://gerrit.wikimedia.org/r/961443 (owner: 10Majavah) [10:55:55] (03CR) 10Majavah: [C: 03+2] openstack: magnum: Remove redundant transport_url setting [puppet] - 10https://gerrit.wikimedia.org/r/961726 (owner: 10Majavah) [10:56:05] (03CR) 10Majavah: [V: 03+1 C: 03+2] openstack: rename openstack_controllers to memcached_nodes [puppet] - 10https://gerrit.wikimedia.org/r/961727 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [10:56:58] (03PS1) 10Muehlenhoff: sre.maps.roll-restart-reboot-master: Fix base class [cookbooks] - 10https://gerrit.wikimedia.org/r/961753 [10:59:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/961751 (owner: 10Hnowlan) [11:00:03] (03PS1) 10Kevin Bazira: ml-services: increase recommendation-api-ng uwsgi workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/961509 (https://phabricator.wikimedia.org/T347475) [11:00:08] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: use static name in failover backup, increase concurrency [puppet] - 10https://gerrit.wikimedia.org/r/961694 (https://phabricator.wikimedia.org/T345590) (owner: 10Jelto) [11:00:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline, but no strong preference, feel free to merge." [software/bitu] - 10https://gerrit.wikimedia.org/r/961738 (owner: 10Slyngshede) [11:01:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:03:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/961702 (owner: 10Slyngshede) [11:04:55] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [11:06:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:06:35] (03CR) 10Jelto: [C: 03+2] gitlab failover: use puppet-managed backup script [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590) (owner: 10Jelto) [11:07:28] (03CR) 10Muehlenhoff: [C: 03+2] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/961005 (owner: 10Muehlenhoff) [11:08:46] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:09:21] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [11:09:29] (03Merged) 10jenkins-bot: gitlab failover: use puppet-managed backup script [cookbooks] - 10https://gerrit.wikimedia.org/r/961418 (https://phabricator.wikimedia.org/T345590) (owner: 10Jelto) [11:09:37] !log cp4037 back in pool (T347192) [11:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:42] T347192: Varnish should allow PURGE requests only from socket (purged) - https://phabricator.wikimedia.org/T347192 [11:13:26] (03CR) 10Btullis: [C: 03+1] druid: Bring druid1009.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/959147 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [11:15:11] (03PS2) 10Muehlenhoff: Remove config option for challenge response auth [puppet] - 10https://gerrit.wikimedia.org/r/959896 [11:15:32] 10SRE, 10Bitu, 10Infrastructure-Foundations: SSH Key type expiry - https://phabricator.wikimedia.org/T347572 (10SLyngshede-WMF) [11:17:44] (03PS4) 10Jbond: rsyslog: update code to support cfssl and puppet [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) [11:17:46] (03PS8) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [11:17:48] (03PS4) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [11:17:50] (03PS6) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [11:17:52] (03PS4) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [11:17:54] (03PS1) 10Jbond: rsyslog::receiver: drop support for acme_name [puppet] - 10https://gerrit.wikimedia.org/r/961758 (https://phabricator.wikimedia.org/T347565) [11:17:56] (03PS1) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) [11:19:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43699/console" [puppet] - 10https://gerrit.wikimedia.org/r/961758 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [11:19:40] (03CR) 10Jbond: "@kieth i have update this patch set with a bit of a refactor to:" [puppet] - 10https://gerrit.wikimedia.org/r/961758 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [11:21:33] (03PS1) 10Jelto: Revert "gitlab: swap replica records" [dns] - 10https://gerrit.wikimedia.org/r/961709 [11:21:54] (03CR) 10Jbond: [C: 04-2] "actually this is used in cloud will amend" [puppet] - 10https://gerrit.wikimedia.org/r/961758 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [11:22:02] (03PS1) 10Jelto: Revert "gitlab: change service_name on replica hosts" [puppet] - 10https://gerrit.wikimedia.org/r/961710 [11:22:36] (03PS2) 10Jelto: Revert "gitlab: swap replica records" [dns] - 10https://gerrit.wikimedia.org/r/961709 [11:23:40] (03CR) 10Marostegui: [C: 03+1] P:mariadb: drop special firewall rules from m5 servers [puppet] - 10https://gerrit.wikimedia.org/r/961736 (owner: 10Majavah) [11:24:05] (03CR) 10Marostegui: [C: 03+1] P:mariadb::ferm_wmcs: cleanup unused rules [puppet] - 10https://gerrit.wikimedia.org/r/961737 (owner: 10Majavah) [11:26:02] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [11:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:26:43] (03CR) 10EoghanGaffney: [C: 03+1] Revert "gitlab: change service_name on replica hosts" [puppet] - 10https://gerrit.wikimedia.org/r/961710 (owner: 10Jelto) [11:26:52] (03CR) 10EoghanGaffney: [C: 03+1] Revert "gitlab: swap replica records" [dns] - 10https://gerrit.wikimedia.org/r/961709 (owner: 10Jelto) [11:27:18] (03CR) 10Muehlenhoff: [C: 03+2] Remove config option for challenge response auth [puppet] - 10https://gerrit.wikimedia.org/r/959896 (owner: 10Muehlenhoff) [11:29:45] (03PS5) 10Jbond: rsyslog: update code to support cfssl and puppet [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) [11:29:47] (03PS2) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) [11:29:49] (03PS9) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [11:29:51] (03PS5) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [11:29:53] (03PS7) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [11:29:55] (03PS5) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [11:30:35] 10SRE, 10Infrastructure-Foundations, 10Observability-Logging, 10Patch-For-Review, 10Puppet (Puppet 7.0): Switch rsyslog to use the new PKI infrastructure - https://phabricator.wikimedia.org/T347565 (10Aklapper) [11:30:54] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [11:31:09] (03CR) 10Jbond: [V: 03+1 C: 04-2] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43702/console" [puppet] - 10https://gerrit.wikimedia.org/r/961758 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [11:32:20] (03PS6) 10Jbond: rsyslog: update code to support cfssl and puppet [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) [11:32:22] (03PS3) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) [11:32:25] (03PS10) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [11:32:27] (03PS6) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [11:32:29] (03PS8) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [11:32:31] (03PS6) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [11:32:40] (03Abandoned) 10Jbond: rsyslog::receiver: drop support for acme_name [puppet] - 10https://gerrit.wikimedia.org/r/961758 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [11:34:17] (03PS7) 10Jbond: rsyslog: update code to support cfssl and puppet [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) [11:35:04] (03PS8) 10Jbond: rsyslog: update code to support cfssl and puppet [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) [11:35:06] (03PS4) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) [11:35:08] (03PS11) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [11:35:10] (03PS7) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [11:35:12] (03PS9) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [11:35:15] (03PS7) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [11:37:19] (03PS9) 10Jbond: rsyslog: update code to support cfssl and puppet [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) [11:37:21] (03PS5) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) [11:37:23] (03PS12) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [11:37:25] (03PS8) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [11:37:27] (03PS10) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [11:37:29] (03PS8) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [11:38:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43704/console" [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [11:41:12] (03PS1) 10Hnowlan: k8s, cassandra: add entries for {edit,editor,page}-analytics [puppet] - 10https://gerrit.wikimedia.org/r/961774 (https://phabricator.wikimedia.org/T336391) [11:41:26] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: strictly order route definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/961751 (owner: 10Hnowlan) [11:42:23] (03Merged) 10jenkins-bot: rest-gateway: strictly order route definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/961751 (owner: 10Hnowlan) [11:42:38] (03CR) 10Jbond: [V: 03+1] rsyslog: update code to support cfssl and puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [11:43:56] (03PS6) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) [11:43:58] (03PS13) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [11:44:00] (03PS9) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [11:44:02] (03PS11) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [11:44:04] (03PS9) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [11:45:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43706/console" [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [11:45:11] (03PS3) 10Jgiannelos: wikifeeds: Use core page html endpoint for outgoing parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961696 [11:46:56] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:56] (03PS7) 10Jbond: syslog::centralserver: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) [11:46:58] (03PS14) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [11:47:00] (03PS10) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [11:47:02] (03PS12) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [11:47:04] (03PS10) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [11:47:23] (03PS2) 10Hnowlan: k8s, cassandra: add entries for {edit,editor,page}-analytics [puppet] - 10https://gerrit.wikimedia.org/r/961774 (https://phabricator.wikimedia.org/T336391) [11:47:48] (03PS1) 10Muehlenhoff: Set KbdInteractiveAuthentication/ChallengeResponseAuthentication per OS [puppet] - 10https://gerrit.wikimedia.org/r/961775 [11:48:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43707/console" [puppet] - 10https://gerrit.wikimedia.org/r/961759 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [11:50:50] (03PS3) 10Hnowlan: k8s, cassandra: add entries for {edit,editor,page}-analytics [puppet] - 10https://gerrit.wikimedia.org/r/961774 (https://phabricator.wikimedia.org/T336391) [11:56:00] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [11:56:13] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [11:56:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T343198)', diff saved to https://phabricator.wikimedia.org/P52717 and previous config saved to /var/cache/conftool/dbconfig/20230928-115619-arnaudb.json [11:56:25] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [11:57:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/961775 (owner: 10Muehlenhoff) [11:57:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961775 (owner: 10Muehlenhoff) [11:57:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43709/console" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [11:58:53] (03CR) 10JMeybohm: [C: 03+1] Upgrade mesh and ingress modules for python-webapp [deployment-charts] - 10https://gerrit.wikimedia.org/r/961730 (https://phabricator.wikimedia.org/T347344) (owner: 10Elukey) [11:59:26] (03CR) 10Brouberol: [C: 03+2] Disable WMDE misc jobs on stat1007 [puppet] - 10https://gerrit.wikimedia.org/r/961699 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [11:59:33] (03PS15) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [11:59:35] (03PS11) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [11:59:37] (03PS13) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [11:59:39] (03PS11) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T1200) [12:00:14] (03CR) 10Ilias Sarantopoulos: [C: 03+1] Upgrade mesh and ingress modules for python-webapp [deployment-charts] - 10https://gerrit.wikimedia.org/r/961730 (https://phabricator.wikimedia.org/T347344) (owner: 10Elukey) [12:00:55] (03CR) 10Muehlenhoff: [C: 03+2] sre.maps.roll-restart-reboot-master: Fix base class [cookbooks] - 10https://gerrit.wikimedia.org/r/961753 (owner: 10Muehlenhoff) [12:01:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43710/console" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:02:14] (03PS3) 10JMeybohm: eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [12:03:47] (03PS16) 10Jbond: profile::rsyslog::syslog: refactor base::remote_syslog to a profile [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) [12:03:49] (03PS12) 10Jbond: profile::syslog::remote: create variables for cert and key [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) [12:03:51] (03PS14) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [12:03:53] (03PS12) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [12:04:11] (03PS1) 10Arturo Borrero Gonzalez: cloudservices[2004,2005]-dev: refresh their counterpart FQDN [puppet] - 10https://gerrit.wikimedia.org/r/961780 (https://phabricator.wikimedia.org/T347555) [12:05:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43711/console" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:06:24] (03CR) 10AOkoth: clamav: disable ConcurrentDatabaseReload (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961689 (https://phabricator.wikimedia.org/T347450) (owner: 10AOkoth) [12:06:34] (03CR) 10Jbond: [V: 03+1] "Please review, This refactor gives us more flexibility to reuse some variables" [puppet] - 10https://gerrit.wikimedia.org/r/961735 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:06:43] (03PS1) 10Hnowlan: admin: add namespaces for remaining aqs2 services, add config for page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/961782 (https://phabricator.wikimedia.org/T336391) [12:07:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43713/console" [puppet] - 10https://gerrit.wikimedia.org/r/961740 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:08:28] (03PS15) 10Jbond: syslog::centralserver: use mTLS for blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) [12:08:30] (03PS13) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [12:10:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43714/console" [puppet] - 10https://gerrit.wikimedia.org/r/961703 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:10:06] (03PS14) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [12:12:23] (03CR) 10FNegri: [C: 03+1] "LGTM, let's try!" [puppet] - 10https://gerrit.wikimedia.org/r/961780 (https://phabricator.wikimedia.org/T347555) (owner: 10Arturo Borrero Gonzalez) [12:12:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43715/console" [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:12:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudservices[2004,2005]-dev: refresh their counterpart FQDN [puppet] - 10https://gerrit.wikimedia.org/r/961780 (https://phabricator.wikimedia.org/T347555) (owner: 10Arturo Borrero Gonzalez) [12:12:44] (03PS1) 10Majavah: P:cloudceph: cleanup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/961783 [12:13:14] (03PS4) 10JMeybohm: eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [12:13:56] (03CR) 10Muehlenhoff: [C: 04-1] cloudelastic: new partman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961478 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [12:13:58] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43716/console" [puppet] - 10https://gerrit.wikimedia.org/r/961783 (owner: 10Majavah) [12:14:15] (03PS15) 10Jbond: profile::syslog::remote: Add support for pki [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) [12:15:06] (03CR) 10CI reject: [V: 04-1] P:cloudceph: cleanup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/961783 (owner: 10Majavah) [12:15:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10cmooney) >>! In T342455#9120329, @Jclark-ctr wrote: > cloudnet1007 E 4. U 39 Port 10 Cableid 2303... [12:15:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43717/console" [puppet] - 10https://gerrit.wikimedia.org/r/961741 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:16:05] (03PS2) 10Majavah: P:cloudceph: cleanup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/961783 [12:17:54] (03PS5) 10JMeybohm: eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [12:18:51] (03PS1) 10Jbond: sretest: switch sretest to cfssl for rsyslog mTLS [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) [12:19:40] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43718/console" [puppet] - 10https://gerrit.wikimedia.org/r/961783 (owner: 10Majavah) [12:20:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43719/console" [puppet] - 10https://gerrit.wikimedia.org/r/961785 (https://phabricator.wikimedia.org/T347565) (owner: 10Jbond) [12:24:39] (03PS3) 10JMeybohm: eventgate-logging-external: Remove CPU limit for tls-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/959184 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [12:25:51] (03PS1) 10Hnowlan: Add druid-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961786 (https://phabricator.wikimedia.org/T336385) [12:25:53] (03PS1) 10Majavah: toolsdb_replica_cnf: Remove firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/961787 [12:27:05] (03CR) 10JMeybohm: [C: 03+1] dragonfly::supernode: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960603 (owner: 10Muehlenhoff) [12:29:59] (03CR) 10JMeybohm: [C: 04-1] modules: add CORS policy to Istio Ingress' virtual services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 (owner: 10Elukey) [12:31:12] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot-master rolling reboot on A:maps-master-codfw [12:32:36] (03CR) 10JMeybohm: [C: 03+1] eventgate-logging-external: Remove CPU limit for tls-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/959184 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [12:32:46] (03CR) 10JMeybohm: [C: 03+1] eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [12:33:50] (03PS4) 10Jgiannelos: wikifeeds: Use core page html endpoint for outgoing parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961696 [12:34:25] (03CR) 10JMeybohm: [C: 03+1] "I've updated this patch to the latest module versions and to include the additional SANs the eventgate deployments are using currently (fr" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [12:34:49] (03CR) 10JMeybohm: [C: 03+1] modules: duplicate ingress:istio_1.0.2 to 1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961378 (owner: 10Elukey) [12:36:32] (03PS5) 10Jgiannelos: wikifeeds: Use core page html endpoint for outgoing parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961696 [12:38:05] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:04] (03PS5) 10Elukey: modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 [12:39:06] (03PS2) 10Elukey: Upgrade mesh and ingress modules for python-webapp [deployment-charts] - 10https://gerrit.wikimedia.org/r/961730 (https://phabricator.wikimedia.org/T347344) [12:39:08] (03PS2) 10Elukey: ml-services: enable base CORS headers policy for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961731 (https://phabricator.wikimedia.org/T347344) [12:39:21] (03CR) 10Elukey: modules: add CORS policy to Istio Ingress' virtual services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 (owner: 10Elukey) [12:40:07] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:34] acked [12:40:35] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:40:37] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:40:48] (03CR) 10JMeybohm: [C: 03+1] modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 (owner: 10Elukey) [12:40:55] is it thanos on eqiad? [12:41:12] graph doesn't load [12:41:25] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:41:26] yeah, thanos-query.svc.eqiad.wmnet [12:41:43] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:41:47] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:41:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot-master (exit_code=0) rolling reboot on A:maps-master-codfw [12:42:43] thanos-query on titan1001 seems to fail to contact prometheus2xx [12:42:44] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [12:43:47] jynus, topranks - maybe a quick service restart on one node could be an easy one to see if it is temporary? [12:44:03] (03CR) 10JMeybohm: [C: 04-1] P:mw::deployment::server: Don't alert for train-presync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/953200 (owner: 10Clément Goubert) [12:44:03] we can try [12:44:11] elukey: thanks for the suggestion - makes sense [12:44:20] but I wonder if it is traffic-caused [12:44:24] !log restart thanos-query on titan1001 [12:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:07] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:46:17] on titan1002 I see similar errors, but I can telnet the endpoints [12:46:41] I see it resolved, but not sure if I see the errors and latency down [12:46:54] didn't restart on 1002 yet [12:47:06] so it is still in half broken state [12:47:12] shall I proceed? [12:47:17] go on [12:47:29] !log restart thanos-query on titan1002 [12:47:30] but shouldn't that at least cause less errors? [12:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:08] latency looks better on last stats on the graph, 132ms [12:48:15] I see the dashboard now [12:48:30] and loss gone [12:48:34] latency seems down [12:49:32] https://usercontent.irccloud-cdn.com/file/XpiZotDc/image.png [12:49:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10aborrero) Could we please make sure we have the `-dev` sufix in them? Otherwise we will need to rename them... [12:49:45] yes, I see it now [12:50:20] also on: https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend [12:50:39] should we search for a root cause on logs? [12:51:37] of course the graph wouldn't work! I belive some use thanos itself, I was silly [12:51:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10cmooney) >>! In T342455#9206640, @aborrero wrote: > Could we please make sure we have the `-dev` sufix in th... [12:52:00] https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=titan&var-origin_instance=All&var-destination=All [12:52:08] this is interesting (local envoy on titan) [12:52:44] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [12:53:10] afaics the thanos-query processes started to slow down for $reasons, until we restarted them [12:53:16] maybe we can loop in observability [12:53:29] (03CR) 10Elukey: [C: 03+2] modules: duplicate ingress:istio_1.0.2 to 1.0.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/961378 (owner: 10Elukey) [12:53:38] (03PS6) 10Elukey: modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 [12:53:42] thanks for the help, elukey [12:53:45] np! [12:53:51] I wasn't familiar with the titan cluster [12:54:21] (03CR) 10Klausman: "Will not submit until I have figured out the right silence for ProbeDown." [puppet] - 10https://gerrit.wikimedia.org/r/961791 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [12:54:39] (03CR) 10Elukey: [C: 03+2] modules: add CORS policy to Istio Ingress' virtual services [deployment-charts] - 10https://gerrit.wikimedia.org/r/961379 (owner: 10Elukey) [12:54:47] (03PS3) 10Elukey: Upgrade mesh and ingress modules for python-webapp [deployment-charts] - 10https://gerrit.wikimedia.org/r/961730 (https://phabricator.wikimedia.org/T347344) [12:55:05] I am trying to understand what high level services would have been affected: grafana? [12:55:52] I think so yes [12:56:10] (03CR) 10Elukey: [C: 03+2] Upgrade mesh and ingress modules for python-webapp [deployment-charts] - 10https://gerrit.wikimedia.org/r/961730 (https://phabricator.wikimedia.org/T347344) (owner: 10Elukey) [12:56:18] (03CR) 10Elukey: [C: 03+2] ml-services: enable base CORS headers policy for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961731 (https://phabricator.wikimedia.org/T347344) (owner: 10Elukey) [12:56:23] (03PS3) 10Elukey: ml-services: enable base CORS headers policy for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/961731 (https://phabricator.wikimedia.org/T347344) [12:57:39] I will also file a patch to update the runbook, there is actually WIP useful documentation, but it is not linked correctly [12:58:35] g*dog is off today, so will ask someone else [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T1300). [13:00:06] houseofm, hubaishan, and Ammar: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:52] I can deploy today [13:01:33] But it seems None of the requesting developers are around? [13:01:55] looks like it [13:02:02] ah, there’s one :) [13:02:05] o/ [13:03:02] Hi HouseOfM and Ammar [13:03:07] !log elukey@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [13:03:11] topranks: I added https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 [13:03:46] !log elukey@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [13:04:11] urbanecm Hi, I'm here [13:04:12] will ping obs to see if they can reference titan specifically [13:04:13] !log elukey@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [13:04:16] hello [13:04:25] urbanecm: ping me when done please? [13:04:26] jynus: good stuff [13:04:29] (03CR) 10Urbanecm: [C: 03+2] Enable Campaigns email on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960559 (https://phabricator.wikimedia.org/T347065) (owner: 10Mhorsey) [13:04:31] taavi: sure [13:04:54] yeah my instinct was to look on thanos-fe* nodes so good to get titan included [13:05:15] (03Merged) 10jenkins-bot: Enable Campaigns email on test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960559 (https://phabricator.wikimedia.org/T347065) (owner: 10Mhorsey) [13:06:08] topranks my guess it is a very recent change, as I see few references to it, but will just mention it on their channel [13:06:57] (03PS1) 10Muehlenhoff: Only install ppolicy.schema with OpenLDAP < 2.5 [puppet] - 10https://gerrit.wikimedia.org/r/961796 (https://phabricator.wikimedia.org/T331699) [13:06:59] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:960559|Enable Campaigns email on test wiki (T347065)]] [13:07:07] T347065: Release the email participants feature - https://phabricator.wikimedia.org/T347065 [13:07:47] (03CR) 10Klausman: Services: Remove pybal/LVS entry for ORES (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961791 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:07:56] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 8 hosts matching query A:cp-text_ulsfo [13:07:59] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 7 hosts matching query A:cp-upload_ulsfo and not P{cp4052*} [13:08:27] !log urbanecm@deploy2002 urbanecm and mhorsey: Backport for [[gerrit:960559|Enable Campaigns email on test wiki (T347065)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:08:47] HouseOfM: can you test your patch at mwdebug1001, please? [13:09:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961796 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [13:09:18] (03CR) 10Muehlenhoff: On Bookworm ship ppolicy.schema via Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961066 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [13:09:54] (03CR) 10Klausman: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/961797 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:11:31] HouseOfM: hi, how's the testing going please? [13:11:52] (03CR) 10JMeybohm: [C: 04-1] modules: add base.statsd (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [13:11:59] (03PS2) 10Klausman: wmnet: Remove ORES discovery entry [dns] - 10https://gerrit.wikimedia.org/r/961797 (https://phabricator.wikimedia.org/T347278) [13:13:10] Looks good, sorry [13:13:31] no worries, i just wasn't sure if you saw my message. thanks for confirming, proceeding. [13:13:32] !log urbanecm@deploy2002 urbanecm and mhorsey: Continuing with sync [13:13:47] (03PS2) 10Urbanecm: wikifunctionswiki: Disable NearbyPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961742 (https://phabricator.wikimedia.org/T345459) (owner: 10Ammarpad) [13:14:19] hubaishan: hi, are you around for the deployment of your patch? [13:14:31] yes [13:15:07] okay, i'll ping you once i get to your patch :) [13:15:54] (03PS1) 10CDanis: hiera: Test HAProxy bw limits per URL on cp5030 [puppet] - 10https://gerrit.wikimedia.org/r/961800 (https://phabricator.wikimedia.org/T317799) [13:16:17] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961800 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [13:16:41] (03CR) 10Urbanecm: [C: 03+2] wikifunctionswiki: Disable NearbyPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961742 (https://phabricator.wikimedia.org/T345459) (owner: 10Ammarpad) [13:17:23] (03Merged) 10jenkins-bot: wikifunctionswiki: Disable NearbyPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961742 (https://phabricator.wikimedia.org/T345459) (owner: 10Ammarpad) [13:18:33] (03Abandoned) 10Klausman: wmnet: Remove ORES discovery entry [dns] - 10https://gerrit.wikimedia.org/r/961797 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:19:31] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:960559|Enable Campaigns email on test wiki (T347065)]] (duration: 12m 31s) [13:19:36] T347065: Release the email participants feature - https://phabricator.wikimedia.org/T347065 [13:19:39] HouseOfM: your patch is live now :) [13:20:00] (03CR) 10Klausman: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/961802 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:20:01] Ammar: continuing with yours now [13:20:16] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:961742|wikifunctionswiki: Disable NearbyPages (T345459)]] [13:20:20] (03PS2) 10Klausman: wmnet: Remove ORES discovery entry [dns] - 10https://gerrit.wikimedia.org/r/961802 (https://phabricator.wikimedia.org/T347278) [13:20:23] T345459: Remove Nearby from Wikifunctions - https://phabricator.wikimedia.org/T345459 [13:21:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:21:47] !log urbanecm@deploy2002 ammarpad and urbanecm: Backport for [[gerrit:961742|wikifunctionswiki: Disable NearbyPages (T345459)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:21:49] (03PS1) 10Anzx: update sawikiquote logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961803 [13:21:55] Ammar: can you test your patch at mwdebug1001, please? [13:22:21] yes, doing [13:22:26] ty [13:24:17] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot-master rolling reboot on A:maps-master-eqiad [13:25:15] urbanecm looks ok, the special page disappeared and delisted on special:version [13:25:21] great, proceeding [13:25:22] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [13:25:22] !log urbanecm@deploy2002 ammarpad and urbanecm: Continuing with sync [13:25:25] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:25:28] (03PS4) 10Urbanecm: Enable WikiLove on arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957842 (https://phabricator.wikimedia.org/T346391) (owner: 10Zoranzoki21) [13:25:32] (03CR) 10Urbanecm: [C: 03+2] Enable WikiLove on arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957842 (https://phabricator.wikimedia.org/T346391) (owner: 10Zoranzoki21) [13:26:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T343198)', diff saved to https://phabricator.wikimedia.org/P52718 and previous config saved to /var/cache/conftool/dbconfig/20230928-132559-arnaudb.json [13:26:05] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [13:26:13] (03Merged) 10jenkins-bot: Enable WikiLove on arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957842 (https://phabricator.wikimedia.org/T346391) (owner: 10Zoranzoki21) [13:26:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:28:42] !log mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=arwikisource wikilove # T346391 [13:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:47] T346391: Install wikilove on arwikisource - https://phabricator.wikimedia.org/T346391 [13:31:24] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:961742|wikifunctionswiki: Disable NearbyPages (T345459)]] (duration: 11m 07s) [13:31:29] Ammar: and live :) [13:31:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot-master (exit_code=0) rolling reboot on A:maps-master-eqiad [13:31:35] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:31:37] T345459: Remove Nearby from Wikifunctions - https://phabricator.wikimedia.org/T345459 [13:31:55] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:957842|Enable WikiLove on arwikisource (T346391)]] [13:32:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] P:mediawiki::maintenance: CampaignEvents periodic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [13:32:41] (03CR) 10Muehlenhoff: [V: 03+1 C: 03+2] dragonfly::supernode: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960603 (owner: 10Muehlenhoff) [13:33:09] (03CR) 10Ssingh: [C: 03+1] wmnet: Remove ORES discovery entry [dns] - 10https://gerrit.wikimedia.org/r/961802 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:33:20] !log urbanecm@deploy2002 zoranzoki21 and urbanecm: Backport for [[gerrit:957842|Enable WikiLove on arwikisource (T346391)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:33:41] (03PS2) 10Anzx: update sawikiquote logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961803 [13:33:49] hubaishan: your patch is at mwdebug1001 now, can you test it there please? [13:34:56] (03CR) 10Klausman: [C: 03+2] wmnet: Remove ORES discovery entry [dns] - 10https://gerrit.wikimedia.org/r/961802 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:36:05] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:28] 10SRE, 10ops-codfw, 10decommission-hardware: decommission frauth2001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340153 (10Jhancock.wm) [13:36:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:37:06] urbanecm it is OK. [13:37:06] (03PS3) 10Anzx: update sawikiquote logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961803 [13:37:09] (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/961799 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:37:11] great, syncing [13:37:13] !log urbanecm@deploy2002 zoranzoki21 and urbanecm: Continuing with sync [13:37:18] (03PS2) 10Klausman: services: Move ORES to state lvs_setup for turndown [puppet] - 10https://gerrit.wikimedia.org/r/961799 (https://phabricator.wikimedia.org/T347278) [13:38:41] (03CR) 10Ssingh: [C: 03+1] services: Move ORES to state lvs_setup for turndown [puppet] - 10https://gerrit.wikimedia.org/r/961799 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:39:01] (03CR) 10Klausman: [C: 03+2] services: Move ORES to state lvs_setup for turndown [puppet] - 10https://gerrit.wikimedia.org/r/961799 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:40:29] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:29] (03PS6) 10Clément Goubert: P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) [13:41:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P52719 and previous config saved to /var/cache/conftool/dbconfig/20230928-134105-arnaudb.json [13:41:17] (03PS3) 10Muehlenhoff: scap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945749 [13:41:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:42:07] (03CR) 10JMeybohm: [C: 03+1] P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [13:42:43] (03CR) 10CI reject: [V: 04-1] P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [13:43:06] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:957842|Enable WikiLove on arwikisource (T346391)]] (duration: 11m 10s) [13:43:11] T346391: Install wikilove on arwikisource - https://phabricator.wikimedia.org/T346391 [13:43:14] hubaishan: and should be live [13:43:19] taavi: all patches deployed, over to you :) [13:43:25] (03PS7) 10Clément Goubert: P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) [13:43:27] thx [13:44:03] Thank you. [13:44:15] (03PS3) 10Majavah: Set WRITE_BOTH for CA wikis on OATHAuth multiple devices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961237 (https://phabricator.wikimedia.org/T242031) [13:44:28] np [13:44:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961237 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [13:45:19] 10SRE, 10ops-codfw, 10decommission-hardware: decommission frauth2001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340153 (10Jhancock.wm) 05Open→03Resolved [13:45:31] (03Merged) 10jenkins-bot: Set WRITE_BOTH for CA wikis on OATHAuth multiple devices [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961237 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [13:45:58] !log taavi@deploy2002 Started scap: Backport for [[gerrit:961237|Set WRITE_BOTH for CA wikis on OATHAuth multiple devices (T242031)]] [13:46:10] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [13:46:17] (03CR) 10Clément Goubert: [C: 03+2] P:mediawiki::periodic_job: Add splay parameter [puppet] - 10https://gerrit.wikimedia.org/r/954249 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [13:47:03] (03PS4) 10Anzx: update sawikiquote logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961803 [13:47:08] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [13:47:11] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:47:19] !log taavi@deploy2002 taavi: Backport for [[gerrit:961237|Set WRITE_BOTH for CA wikis on OATHAuth multiple devices (T242031)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:47:26] (03PS2) 10CDanis: hiera: Test HAProxy bw limits per URL on cp5030 [puppet] - 10https://gerrit.wikimedia.org/r/961800 (https://phabricator.wikimedia.org/T317799) [13:47:44] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43720/console" [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson) [13:49:04] (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/961805 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:49:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye [13:49:06] (03CR) 10CI reject: [V: 04-1] services/lvs: Turn down ORES LVS setup [puppet] - 10https://gerrit.wikimedia.org/r/961805 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:49:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1229.eqiad.wmnet with OS bullseye [13:49:42] (03PS2) 10Klausman: services/lvs: Turn down ORES LVS setup [puppet] - 10https://gerrit.wikimedia.org/r/961805 (https://phabricator.wikimedia.org/T347278) [13:49:54] urbanecm yes, thank you [13:50:55] !log taavi@deploy2002 taavi: Continuing with sync [13:51:00] (03PS1) 10Bking: flink-app: increment chart version number [deployment-charts] - 10https://gerrit.wikimedia.org/r/961806 (https://phabricator.wikimedia.org/T347521) [13:52:06] !log installing flac security updates [13:52:07] !log bking@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [13:52:09] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:34] (03CR) 10Ssingh: [C: 03+1] services/lvs: Turn down ORES LVS setup [puppet] - 10https://gerrit.wikimedia.org/r/961805 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:52:44] !log bking@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:52:53] (03CR) 10Klausman: [C: 03+2] services/lvs: Turn down ORES LVS setup [puppet] - 10https://gerrit.wikimedia.org/r/961805 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [13:54:13] You're welcome [13:54:31] (03CR) 10Vgutierrez: [C: 03+1] hiera: Test HAProxy bw limits per URL on cp5030 [puppet] - 10https://gerrit.wikimedia.org/r/961800 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [13:55:05] (03CR) 10Herron: rsyslog: update code to support cfssl and puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:56:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P52720 and previous config saved to /var/cache/conftool/dbconfig/20230928-135612-arnaudb.json [13:56:31] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:00] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:961237|Set WRITE_BOTH for CA wikis on OATHAuth multiple devices (T242031)]] (duration: 11m 02s) [13:57:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:57:36] (03PS1) 10Slyngshede: Clearily key management GUI [software/bitu] - 10https://gerrit.wikimedia.org/r/961807 [13:58:37] (03Abandoned) 10Bking: flink-app: increment chart version number [deployment-charts] - 10https://gerrit.wikimedia.org/r/961806 (https://phabricator.wikimedia.org/T347521) (owner: 10Bking) [14:00:17] urbanecm: Thank you, sorry I didn't respond earlier, multitasking is hard! [14:00:51] !log restarted pybal on lvs1020 and lvs2014 (LVS low-traffic backups) for T347278 (ORES turndown) [14:00:53] No problem [14:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:10] T347278: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 [14:01:22] !log installing gsl security updates [14:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:35] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:01:37] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:01:45] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:02:15] !log depooling cp5030 for haproxy upgrade & testing T317799 [14:02:17] * taavi done [14:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:25] (03CR) 10CDanis: [C: 03+2] hiera: Test HAProxy bw limits per URL on cp5030 [puppet] - 10https://gerrit.wikimedia.org/r/961800 (https://phabricator.wikimedia.org/T317799) (owner: 10CDanis) [14:02:29] T317799: Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799 [14:02:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1228.eqiad.wmnet with OS bullseye [14:02:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:02:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1228.eqiad.wmnet with OS bullseye [14:02:43] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1228.eqiad.wmnet with OS bullseye [14:02:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1228.eqiad.wmnet with OS bullseye executed with errors: - db1228 (**FAIL**)... [14:03:21] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:05:22] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:07:12] (03PS1) 10Jbond: docker:registry::web: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/961808 (https://phabricator.wikimedia.org/T341373) [14:08:10] (03CR) 10JMeybohm: [V: 03+1 C: 04-1] "Please run PCC on puppet changes. At least the zookeeper_hosts structure looks broken." [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson) [14:08:48] !log repooling cp5030 after haproxy upgrade & config deploy T317799 [14:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:54] T317799: Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799 [14:11:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T343198)', diff saved to https://phabricator.wikimedia.org/P52722 and previous config saved to /var/cache/conftool/dbconfig/20230928-141118-arnaudb.json [14:11:21] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:11:27] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [14:11:34] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:11:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T343198)', diff saved to https://phabricator.wikimedia.org/P52723 and previous config saved to /var/cache/conftool/dbconfig/20230928-141140-arnaudb.json [14:11:56] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/961811 [14:12:14] (03PS2) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/961811 [14:13:21] !log restarting pybal on lvs1019 and lvs2013 (LVS low-traffic actives) for T347278 (ORES turndown) [14:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:29] T347278: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 [14:13:47] (03CR) 10Majavah: [C: 03+2] P:mariadb: drop special firewall rules from m5 servers [puppet] - 10https://gerrit.wikimedia.org/r/961736 (owner: 10Majavah) [14:13:54] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:mariadb::ferm_wmcs: cleanup unused rules [puppet] - 10https://gerrit.wikimedia.org/r/961737 (owner: 10Majavah) [14:14:03] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:41] (03CR) 10Hnowlan: [C: 03+1] wikifeeds: Use core page html endpoint for outgoing parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961696 (owner: 10Jgiannelos) [14:14:52] (03PS1) 10Ssingh: install_server: replace ntp.$site with anycasted ntp.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/961812 (https://phabricator.wikimedia.org/T347054) [14:15:24] (03CR) 10CI reject: [V: 04-1] install_server: replace ntp.$site with anycasted ntp.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/961812 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:16:15] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43721/console" [puppet] - 10https://gerrit.wikimedia.org/r/961812 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:16:58] (03PS2) 10Ssingh: install_server: replace ntp.$site with anycasted ntp.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/961812 (https://phabricator.wikimedia.org/T347054) [14:17:07] (03PS2) 10Jbond: docker:registry::web: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/961808 (https://phabricator.wikimedia.org/T341373) [14:17:09] (03PS1) 10Jbond: docker::registry::web: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/961814 (https://phabricator.wikimedia.org/T340741) [14:17:31] (03PS1) 10Majavah: hieradata: fix reference to certificate name on cloudlb codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/961815 [14:17:41] (03CR) 10CI reject: [V: 04-1] docker::registry::web: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/961814 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:18:23] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: fix reference to certificate name on cloudlb codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/961815 (owner: 10Majavah) [14:18:46] (03CR) 10Majavah: [C: 03+2] hieradata: fix reference to certificate name on cloudlb codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/961815 (owner: 10Majavah) [14:19:43] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:21:35] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:24:13] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:46] (03PS1) 10Ssingh: dnsbox: add ntp.anycast.wmnet as the anycasted NTP address [puppet] - 10https://gerrit.wikimedia.org/r/961818 (https://phabricator.wikimedia.org/T347054) [14:26:11] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/961819 [14:26:27] (03CR) 10AOkoth: [C: 03+1] Revert "gitlab: swap replica records" [dns] - 10https://gerrit.wikimedia.org/r/961709 (owner: 10Jelto) [14:26:56] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43722/console" [puppet] - 10https://gerrit.wikimedia.org/r/961818 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:29:23] (03PS2) 10Klausman: Services: Remove pybal/LVS entry for ORES [puppet] - 10https://gerrit.wikimedia.org/r/961791 (https://phabricator.wikimedia.org/T347278) [14:29:51] (03CR) 10Clément Goubert: [C: 03+2] P:mediawiki::maintenance: CampaignEvents periodic [puppet] - 10https://gerrit.wikimedia.org/r/954225 (https://phabricator.wikimedia.org/T339984) (owner: 10Clément Goubert) [14:31:28] (03CR) 10SBassett: Allow FundraiseUp scripts in Donatewiki CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg) [14:31:33] (03CR) 10Ssingh: "See also references to ores in conftool-data/node/{eqiad,codfw}.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/961791 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [14:31:50] (03PS2) 10Jbond: docker::registry::web: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/961814 (https://phabricator.wikimedia.org/T340741) [14:32:25] (03PS3) 10Klausman: Services: Remove pybal/LVS entry for ORES [puppet] - 10https://gerrit.wikimedia.org/r/961791 (https://phabricator.wikimedia.org/T347278) [14:32:46] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1016.eqiad.wmnet with reason: jnl compression [14:32:59] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1016.eqiad.wmnet with reason: jnl compression [14:33:29] (03PS1) 10Jbond: docker_registry_ha::web: add docker_registry_ha [puppet] - 10https://gerrit.wikimedia.org/r/961822 (https://phabricator.wikimedia.org/T340741) [14:33:31] (03PS1) 10Jbond: docker_registry_ha::web: drop support for using puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/961823 (https://phabricator.wikimedia.org/T340741) [14:33:33] (03PS4) 10Klausman: Services: Remove pybal/LVS entry for ORES [puppet] - 10https://gerrit.wikimedia.org/r/961791 (https://phabricator.wikimedia.org/T347278) [14:33:44] (03CR) 10Klausman: Services: Remove pybal/LVS entry for ORES (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961791 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [14:34:27] (03CR) 10Ssingh: [C: 03+1] Services: Remove pybal/LVS entry for ORES [puppet] - 10https://gerrit.wikimedia.org/r/961791 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [14:34:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43724/console" [puppet] - 10https://gerrit.wikimedia.org/r/961823 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:34:43] (03CR) 10Klausman: [C: 03+2] Services: Remove pybal/LVS entry for ORES [puppet] - 10https://gerrit.wikimedia.org/r/961791 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [14:35:27] !log installing ghostscript security updates [14:35:28] (03PS3) 10Jbond: docker::registry::web: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/961814 (https://phabricator.wikimedia.org/T340741) [14:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:19] (03CR) 10Ssingh: "Depends on I0cadd24bf8ec759ba31e482de79a69b94b860af9." [puppet] - 10https://gerrit.wikimedia.org/r/961812 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:36:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43725/console" [puppet] - 10https://gerrit.wikimedia.org/r/961814 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:36:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff) [14:36:54] (03CR) 10Jbond: [C: 03+2] docker_registry_ha::web: add docker_registry_ha [puppet] - 10https://gerrit.wikimedia.org/r/961822 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:37:03] (03CR) 10Jbond: [C: 03+2] docker:registry::web: Add documentation [puppet] - 10https://gerrit.wikimedia.org/r/961808 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [14:37:26] (03PS4) 10Jbond: docker::registry::web: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/961814 (https://phabricator.wikimedia.org/T340741) [14:37:35] (03PS2) 10Jbond: docker_registry_ha::web: drop support for using puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/961823 (https://phabricator.wikimedia.org/T340741) [14:38:07] klausman: I think your last ORES removal patch broke something [14:38:14] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Could not find service ores in service::catalog (file: /etc/puppet/modules/profile/manifests/services_proxy/envoy.pp, line: 41, column: 13) on node mwmaint2002.codfw.wmnet [14:38:47] Yes, missed a role to remove from ORES machines, has since been fixed and seen a successful run of r-p-a [14:38:48] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:52] (03PS4) 10Muehlenhoff: scap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945749 [14:39:55] klausman: No it's still broken on mwmaint [14:40:07] dangit, looking [14:40:12] profile::services_proxy::envoy::enabled_listeners [14:40:21] if I had to guess [14:41:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff) [14:41:46] (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_codfw_ores.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:43:02] (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/961825 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [14:43:15] ^^^ addresses it [14:43:22] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [14:43:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119', diff saved to https://phabricator.wikimedia.org/P52725 and previous config saved to /var/cache/conftool/dbconfig/20230928-144338-root.json [14:45:24] klausman: the error that c*laime pointed out is on mwmaint1002 [14:45:30] outside the ores nodes [14:45:40] ah, I missed that [14:46:34] there is a list of default listeners for mw installations in envoy.yaml [14:46:36] klausman: yeah, hieradata/common/profile/services_proxy/envoy.yaml [14:46:46] (ConfdResourceFailed) firing: (6) confd resource _srv_config-master_pybal_codfw_ores.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:46:49] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:46:51] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:47:43] In the profile::services_proxy::envoy::enabled_listeners section at least [14:48:20] Don't know about the discovery service on port 6010 [14:48:46] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:19] !log bking@wdqs1016 shutting down services to compress a 1.2 TB jnl file [14:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:38] claime: I think we might remove that as well, since the whole service is going away [14:49:49] ack [14:50:04] updating patch [14:50:06] (03PS2) 10Klausman: ORES: remove profile::services_proxy::envoy::enabled_listeners role [puppet] - 10https://gerrit.wikimedia.org/r/961825 (https://phabricator.wikimedia.org/T347278) [14:50:45] (03PS3) 10Klausman: ORES: remove ORES from Envoy listeners list [puppet] - 10https://gerrit.wikimedia.org/r/961825 (https://phabricator.wikimedia.org/T347278) [14:55:51] claime: elukey ^^^ if either fo you can +1 that, I'll merge it. [14:56:16] (03CR) 10Clément Goubert: [C: 03+1] ORES: remove ORES from Envoy listeners list [puppet] - 10https://gerrit.wikimedia.org/r/961825 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [14:56:22] merci bien [14:56:27] pas de quoi :) [14:56:32] (03CR) 10Klausman: [C: 03+2] ORES: remove ORES from Envoy listeners list [puppet] - 10https://gerrit.wikimedia.org/r/961825 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [14:57:25] klausman: there is also https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DConfdResourceFailed [14:57:25] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [14:57:29] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:46] (03PS1) 10Jbond: mariadb: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) [14:58:34] checking [14:59:31] (03CR) 10Jbond: "happy to split this if if needs be" [puppet] - 10https://gerrit.wikimedia.org/r/961829 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [15:00:08] it is going away, maybe puppet needs to run [15:00:34] running puppet on config-master2001 [15:00:35] (03CR) 10Muehlenhoff: [C: 03+2] webperf: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/960036 (owner: 10Muehlenhoff) [15:01:03] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [15:01:22] yep should recover now [15:01:34] Confirm it fixes the issue on mwmaint [15:01:47] (ConfdResourceFailed) resolved: (6) confd resource _srv_config-master_pybal_codfw_ores.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:01:49] ok, goody [15:02:09] very nice, we should be good now [15:02:58] Phew. Marking that part of T347278 done [15:02:59] T347278: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 [15:03:10] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye [15:03:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**)... [15:03:26] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [15:03:59] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [15:04:23] (03PS1) 10Btullis: Update the nginx regex for archiva to use allowlist of URLs [puppet] - 10https://gerrit.wikimedia.org/r/961833 (https://phabricator.wikimedia.org/T318962) [15:05:35] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [15:05:53] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [15:06:11] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:09:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST clusterissuers) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:10:15] (03CR) 10Btullis: [C: 03+2] Update the nginx regex for archiva to use allowlist of URLs [puppet] - 10https://gerrit.wikimedia.org/r/961833 (https://phabricator.wikimedia.org/T318962) (owner: 10Btullis) [15:10:29] (03PS2) 10Muehlenhoff: gerrit : Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952457 [15:13:25] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) @BTullis Im not sure if you are the right person for hadoop related things or if yuo can d... [15:13:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952457 (owner: 10Muehlenhoff) [15:14:18] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST clusterissuers) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:15:34] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [15:15:51] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [15:16:59] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond) [15:17:08] (03PS3) 10Muehlenhoff: gerrit : Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952457 [15:17:37] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [15:21:08] (03PS1) 10Krinkle: NostalgiaTemplate.php: Fix array illegal offset error [skins/Nostalgia] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/961717 [15:22:18] (03PS1) 10Jbond: postgress: update to use /etc/ssl/certs/wmf-ca-certificates.crt CA [puppet] - 10https://gerrit.wikimedia.org/r/961839 (https://phabricator.wikimedia.org/T340741) [15:23:33] (03CR) 10FNegri: [C: 03+2] Remove old cloud restricted bastion [puppet] - 10https://gerrit.wikimedia.org/r/961402 (https://phabricator.wikimedia.org/T340241) (owner: 10FNegri) [15:23:45] (03PS2) 10FNegri: Remove old cloud restricted bastion [puppet] - 10https://gerrit.wikimedia.org/r/961402 (https://phabricator.wikimedia.org/T340241) [15:24:41] (03PS2) 10BBlack: traffic hosts: use broader regexes everywhere [puppet] - 10https://gerrit.wikimedia.org/r/961460 [15:26:16] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:27:01] (03PS1) 10Cwhite: profile: enable wal on grafana sqlite db [puppet] - 10https://gerrit.wikimedia.org/r/961510 (https://phabricator.wikimedia.org/T345362) [15:27:12] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 7 hosts matching query A:cp-upload_ulsfo and not P{cp4052*} [15:29:03] (03PS1) 10Jbond: druid: update to use puppetdb_query instead of query_classes [puppet] - 10https://gerrit.wikimedia.org/r/961841 (https://phabricator.wikimedia.org/T341373) [15:30:22] jouncebot nowandnext [15:30:22] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [15:30:22] In 0 hour(s) and 29 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T1600) [15:30:47] i'm going to backport https://gerrit.wikimedia.org/r/c/mediawiki/skins/Nostalgia/+/961717 to clean up log monitoring. [15:31:59] (03CR) 10Ssingh: traffic hosts: use broader regexes everywhere (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/961460 (owner: 10BBlack) [15:32:06] (03PS2) 10Jbond: druid: update to use puppetdb_query instead of query_classes [puppet] - 10https://gerrit.wikimedia.org/r/961841 (https://phabricator.wikimedia.org/T341373) [15:32:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952457 (owner: 10Muehlenhoff) [15:32:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [skins/Nostalgia] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/961717 (owner: 10Krinkle) [15:32:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10BTullis) @RobH I'm sorry to have to be a pain, but is there any chance that we can increase the RAM in these two servers, before they come into ser... [15:33:36] (03CR) 10Herron: [C: 03+1] SLOs: Add SLO for Liftwing Readability isvc [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/961701 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [15:33:52] (03PS3) 10Jbond: druid: update to use puppetdb_query instead of query_classes [puppet] - 10https://gerrit.wikimedia.org/r/961841 (https://phabricator.wikimedia.org/T341373) [15:35:56] (03CR) 10Muehlenhoff: "The patch was changed quite a bit over the initial PS1 which was reviewed, please revisit." [puppet] - 10https://gerrit.wikimedia.org/r/952457 (owner: 10Muehlenhoff) [15:36:23] (03Merged) 10jenkins-bot: NostalgiaTemplate.php: Fix array illegal offset error [skins/Nostalgia] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/961717 (owner: 10Krinkle) [15:36:47] !log brennen@deploy2002 Started scap: Backport for [[gerrit:961717|NostalgiaTemplate.php: Fix array illegal offset error]] [15:38:22] !log brennen@deploy2002 krinkle and brennen: Backport for [[gerrit:961717|NostalgiaTemplate.php: Fix array illegal offset error]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:38:37] (03PS4) 10Jbond: druid: update to use puppetdb_query instead of query_classes [puppet] - 10https://gerrit.wikimedia.org/r/961841 (https://phabricator.wikimedia.org/T341373) [15:39:08] (03CR) 10Ssingh: dns::dotls: expose and gather haproxy internal metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/948087 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [15:39:11] !log brennen@deploy2002 Sync cancelled. [15:39:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43729/console" [puppet] - 10https://gerrit.wikimedia.org/r/961841 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [15:40:25] (03PS1) 10Hnowlan: media-analytics: bump container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/961842 (https://phabricator.wikimedia.org/T346202) [15:40:51] (03PS1) 10Arturo Borrero Gonzalez: codfw1dev: disable cloudservices2004-dev LDAP server [puppet] - 10https://gerrit.wikimedia.org/r/961843 (https://phabricator.wikimedia.org/T347555) [15:40:58] (03PS1) 10Brennen Bearnes: Revert "NostalgiaTemplate.php: Fix array illegal offset error" [skins/Nostalgia] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/961719 [15:41:11] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "NostalgiaTemplate.php: Fix array illegal offset error" [skins/Nostalgia] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/961719 (owner: 10Brennen Bearnes) [15:41:15] (03CR) 10Fabfur: [C: 03+1] "At a simple check in both eqiad and codfw seems that the headers for the old and new endpoint looks good." [puppet] - 10https://gerrit.wikimedia.org/r/956909 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [15:41:20] (03CR) 10Jbond: [V: 03+1] "hi otto, luca, sorry if you are not the right people for review going from blame." [puppet] - 10https://gerrit.wikimedia.org/r/961841 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [15:41:33] (03PS1) 10Hashar: ci: manage cinder volume on Castor instance [puppet] - 10https://gerrit.wikimedia.org/r/961844 (https://phabricator.wikimedia.org/T304080) [15:44:05] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/961844 (https://phabricator.wikimedia.org/T304080) (owner: 10Hashar) [15:44:09] (03CR) 10Hnowlan: [C: 03+2] media-analytics: bump container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/961842 (https://phabricator.wikimedia.org/T346202) (owner: 10Hnowlan) [15:44:56] (03Merged) 10jenkins-bot: media-analytics: bump container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/961842 (https://phabricator.wikimedia.org/T346202) (owner: 10Hnowlan) [15:45:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [skins/Nostalgia] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/961719 (owner: 10Brennen Bearnes) [15:45:13] (03CR) 10Hashar: "[ 2023-09-28T15:44:32 ] CRITICAL: Unexpected error running run_host: Unable to find fact file for: integration-castor05.integration.eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/961844 (https://phabricator.wikimedia.org/T304080) (owner: 10Hashar) [15:45:55] (03Merged) 10jenkins-bot: Revert "NostalgiaTemplate.php: Fix array illegal offset error" [skins/Nostalgia] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/961719 (owner: 10Brennen Bearnes) [15:45:57] (03PS2) 10Hashar: ci: manage cinder volume on Castor instance [puppet] - 10https://gerrit.wikimedia.org/r/961844 (https://phabricator.wikimedia.org/T304080) [15:46:24] !log brennen@deploy2002 Started scap: Backport for [[gerrit:961719|Revert "NostalgiaTemplate.php: Fix array illegal offset error"]] [15:47:29] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:47:47] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:47:47] !log brennen@deploy2002 brennen: Backport for [[gerrit:961719|Revert "NostalgiaTemplate.php: Fix array illegal offset error"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:48:01] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 8 hosts matching query A:cp-text_ulsfo [15:48:31] !log brennen@deploy2002 Sync cancelled. [15:48:48] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [15:49:11] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [15:49:19] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [15:49:55] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [15:51:29] (03CR) 10Hashar: "I have cherry picked it on the integration Puppet master. I ran Puppet on the sole affected instance integration-castor05 and it was a noo" [puppet] - 10https://gerrit.wikimedia.org/r/961844 (https://phabricator.wikimedia.org/T304080) (owner: 10Hashar) [15:51:45] (03PS3) 10BBlack: traffic hosts: use broader regexes everywhere [puppet] - 10https://gerrit.wikimedia.org/r/961460 [15:53:31] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [15:53:35] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Not intended to be merged, only for demo purposes." [puppet] - 10https://gerrit.wikimedia.org/r/961843 (https://phabricator.wikimedia.org/T347555) (owner: 10Arturo Borrero Gonzalez) [15:54:15] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [15:55:57] (03CR) 10BBlack: traffic hosts: use broader regexes everywhere (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/961460 (owner: 10BBlack) [15:58:26] (03PS6) 10Jgiannelos: wikifeeds: Use core page html endpoint for outgoing parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961696 [16:00:04] jbond and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:14] (03PS4) 10BBlack: traffic hosts: use broader regexes everywhere [puppet] - 10https://gerrit.wikimedia.org/r/961460 [16:01:03] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST clusterservingruntimes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:02:24] (03CR) 10Hnowlan: [C: 03+2] "Output from the service is now minified!" [puppet] - 10https://gerrit.wikimedia.org/r/956909 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [16:03:19] (03PS2) 10Cwhite: profile: enable wal on grafana sqlite db [puppet] - 10https://gerrit.wikimedia.org/r/961510 (https://phabricator.wikimedia.org/T345362) [16:03:24] !log disabled puppet on A:cp [16:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:44] (03PS7) 10Jgiannelos: wikifeeds: Use core page html endpoint for outgoing parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961696 [16:06:03] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (LIST clusterservingruntimes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:08:04] (03CR) 10Ssingh: [C: 03+1] traffic hosts: use broader regexes everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961460 (owner: 10BBlack) [16:09:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH) [16:14:31] !log enabling puppet on A:cp, routing mediarequests API via rest-gateway [16:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:29] (03CR) 10BBlack: [C: 03+2] traffic hosts: use broader regexes everywhere [puppet] - 10https://gerrit.wikimedia.org/r/961460 (owner: 10BBlack) [16:16:00] 10SRE, 10Traffic: Varnish should allow PURGE requests only from socket (purged) - https://phabricator.wikimedia.org/T347192 (10Fabfur) All cp hosts restarted in ulsfo, change is actually applied, no issues so far. Proceeding with other DCs [16:19:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH) >>! In T342291#9207501, @BTullis wrote: > @RobH I'm sorry to have to be a pain, but is there any chance that we can increase the RAM in these... [16:23:37] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 8 hosts matching query A:cp-text_codfw [16:23:41] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 8 hosts matching query A:cp-upload_codfw [16:26:54] !log joal@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [16:26:58] !log joal@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [16:28:54] (03PS3) 10Jdlrobson: Drop the desktop improvements dblist group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) [16:29:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:30:39] (03PS5) 10Jdlrobson: update sawikiquote logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961803 (https://phabricator.wikimedia.org/T341260) (owner: 10Anzx) [16:32:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [16:34:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:34:20] 10SRE, 10Abstract Wikipedia team, 10Traffic, 10Wikifunctions, 10serviceops: Separate deployment for wikifunctions.org - https://phabricator.wikimedia.org/T347544 (10Jdforrester-WMF) 05Open→03In progress [16:35:04] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.cdn.roll-restart-varnish (exit_code=97) rolling restart of Varnish on 8 hosts matching query A:cp-text_codfw [16:35:07] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.cdn.roll-restart-varnish (exit_code=97) rolling restart of Varnish on 8 hosts matching query A:cp-upload_codfw [16:35:39] (03CR) 10Hnowlan: [C: 03+1] wikifeeds: Use core page html endpoint for outgoing parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961696 (owner: 10Jgiannelos) [16:36:04] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: Use core page html endpoint for outgoing parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961696 (owner: 10Jgiannelos) [16:36:54] (03Merged) 10jenkins-bot: wikifeeds: Use core page html endpoint for outgoing parsoid reqs [deployment-charts] - 10https://gerrit.wikimedia.org/r/961696 (owner: 10Jgiannelos) [16:37:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [16:37:48] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host eventlog1003.eqiad.wmnet [16:39:51] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [16:39:54] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [16:41:47] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [16:41:51] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [16:41:54] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 7 hosts matching query A:cp-upload_codfw and not P{cp2028*} [16:41:55] brennen: i just saw https://gerrit.wikimedia.org/r/c/mediawiki/skins/Nostalgia/+/961477 and its revert, do you know what's the current status of that problem? i don't see a task anywhere [16:41:57] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.cdn.roll-restart-varnish (exit_code=97) rolling restart of Varnish on 7 hosts matching query A:cp-upload_codfw and not P{cp2028*} [16:41:59] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 7 hosts matching query A:cp-upload_codfw and not P{cp2028*} [16:42:02] !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-restart-varnish rolling restart of Varnish on 7 hosts matching query A:cp-text_codfw and not P{cp2027*} [16:42:08] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host eventlog1003.eqiad.wmnet [16:42:11] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [16:42:14] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [16:42:53] (03PS1) 10Jbond: scap::dsh::group: switch from query_nodes to puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961850 (https://phabricator.wikimedia.org/T341373) [16:42:59] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [16:43:14] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [16:44:15] brennen: i guess nostalgiawiki is on 1.41.0-wmf.28 now, so it can't be too bad. okay [16:46:30] (03PS2) 10Jbond: scap::dsh::group: switch from query_nodes to puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961850 (https://phabricator.wikimedia.org/T341373) [16:50:23] (03PS3) 10Jbond: scap::dsh::group: switch from query_nodes to puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961850 (https://phabricator.wikimedia.org/T341373) [16:57:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T343198)', diff saved to https://phabricator.wikimedia.org/P52728 and previous config saved to /var/cache/conftool/dbconfig/20230928-165706-arnaudb.json [16:57:12] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T1700) [17:01:18] (03PS4) 10Jbond: scap::dsh::group: switch from query_nodes to puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961850 (https://phabricator.wikimedia.org/T341373) [17:06:24] (03CR) 10Bking: [C: 03+2] cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/961478 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [17:07:35] (03PS2) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/961478 (https://phabricator.wikimedia.org/T342463) [17:07:39] (03PS5) 10Jbond: scap::dsh::group: switch from query_nodes to puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961850 (https://phabricator.wikimedia.org/T341373) [17:07:42] (03CR) 10Bking: cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/961478 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [17:08:01] (03CR) 10Bking: cloudelastic: new partman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961478 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [17:09:26] (03CR) 10Bking: [C: 03+2] cloudelastic: new partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/961478 (https://phabricator.wikimedia.org/T342463) (owner: 10Bking) [17:10:02] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) [17:12:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P52729 and previous config saved to /var/cache/conftool/dbconfig/20230928-171212-arnaudb.json [17:14:02] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) [17:14:51] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [17:14:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [17:21:51] (03PS8) 10Ebernhardson: k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 [17:22:19] (03CR) 10CI reject: [V: 04-1] k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson) [17:23:44] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) [17:24:02] (03PS1) 10Jbond: redis::slave: switch to puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961857 (https://phabricator.wikimedia.org/T341373) [17:27:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P52730 and previous config saved to /var/cache/conftool/dbconfig/20230928-172719-arnaudb.json [17:30:07] (03PS4) 10Herron: pyrra: add serveraliases and redirect to apache config [puppet] - 10https://gerrit.wikimedia.org/r/961131 (https://phabricator.wikimedia.org/T302995) [17:31:07] (03PS1) 10TChin: Enable unaligned checkpointing for codfw mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/961859 (https://phabricator.wikimedia.org/T347615) [17:33:07] (03PS9) 10Ebernhardson: k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 [17:33:21] (03CR) 10Herron: [C: 03+2] pyrra: add serveraliases and redirect to apache config [puppet] - 10https://gerrit.wikimedia.org/r/961131 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:33:29] (03CR) 10Joal: [C: 03+1] "LGTM - let's try!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/961859 (https://phabricator.wikimedia.org/T347615) (owner: 10TChin) [17:33:42] (03CR) 10TChin: [C: 03+2] Enable unaligned checkpointing for codfw mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/961859 (https://phabricator.wikimedia.org/T347615) (owner: 10TChin) [17:34:32] (03Merged) 10jenkins-bot: Enable unaligned checkpointing for codfw mw-page-content-change-enrich [deployment-charts] - 10https://gerrit.wikimedia.org/r/961859 (https://phabricator.wikimedia.org/T347615) (owner: 10TChin) [17:35:39] (03PS10) 10Ebernhardson: k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 [17:35:48] (03PS1) 10Jbond: prometheus::class_config: convert to wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961861 (https://phabricator.wikimedia.org/T341373) [17:37:44] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43737/console" [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson) [17:39:16] !log joal@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [17:39:19] !log joal@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:39:57] (03PS1) 10Herron: thanos::httpd: include rewrite module [puppet] - 10https://gerrit.wikimedia.org/r/961862 (https://phabricator.wikimedia.org/T302995) [17:42:21] (03CR) 10CI reject: [V: 04-1] thanos::httpd: include rewrite module [puppet] - 10https://gerrit.wikimedia.org/r/961862 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:42:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T343198)', diff saved to https://phabricator.wikimedia.org/P52731 and previous config saved to /var/cache/conftool/dbconfig/20230928-174230-arnaudb.json [17:42:32] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [17:42:36] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [17:42:46] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [17:42:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T343198)', diff saved to https://phabricator.wikimedia.org/P52732 and previous config saved to /var/cache/conftool/dbconfig/20230928-174251-arnaudb.json [17:43:03] (03PS2) 10Herron: thanos::httpd: include rewrite module [puppet] - 10https://gerrit.wikimedia.org/r/961862 (https://phabricator.wikimedia.org/T302995) [17:43:06] (03PS1) 10Jbond: prometheus::jmx_exporter_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961863 (https://phabricator.wikimedia.org/T341373) [17:43:38] (03PS11) 10Ebernhardson: k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 [17:43:51] (03PS1) 10Brion VIBBER: Video transcode update for experimental HLS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152) [17:45:37] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43742/console" [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson) [17:46:53] (03PS1) 10Jbond: prometheus::pdu_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961865 (https://phabricator.wikimedia.org/T341373) [17:47:11] (03PS12) 10Ebernhardson: k8s config: Provide kafka and zookeeper hostnames [puppet] - 10https://gerrit.wikimedia.org/r/960662 [17:48:33] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43744/console" [puppet] - 10https://gerrit.wikimedia.org/r/961862 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:48:52] (03CR) 10Herron: [V: 03+1 C: 03+2] thanos::httpd: include rewrite module [puppet] - 10https://gerrit.wikimedia.org/r/961862 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [17:49:03] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43745/console" [puppet] - 10https://gerrit.wikimedia.org/r/960662 (owner: 10Ebernhardson) [17:50:11] (03CR) 10Ahmon Dancy: [C: 03+1] "OK w/ me." [puppet] - 10https://gerrit.wikimedia.org/r/961850 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [17:51:37] !log Imported acme-chief from Gerrit into Gitlab [17:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:30] (03PS1) 10Jbond: prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) [17:53:48] (03CR) 10Brion VIBBER: [C: 04-1] "keeping a -1 in here at the moment since it can't go out early or it breaks stuff. i'll remove once i figure out how to make config & vers" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152) (owner: 10Brion VIBBER) [17:53:51] MatmaRex: it's reverted on .28, i suspect it probably should be reverted on master but have not confirmed. [17:56:11] (03CR) 10Andrew Bogott: "you prefer this to https://gerrit.wikimedia.org/r/c/operations/puppet/+/961188 ?" [puppet] - 10https://gerrit.wikimedia.org/r/961796 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [17:56:29] (03PS1) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [17:57:05] (03CR) 10CI reject: [V: 04-1] prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [17:57:21] (03PS2) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [18:00:06] (03PS3) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [18:00:06] dduvall and brennen: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T1800). [18:01:40] uhhh okay, i'll file a task and figure out what is going on here [18:01:50] but it looks like not a release blocker [18:02:25] apologies for not filing a task; had to go afk for a dr's appointment for a bit. [18:04:54] (03PS4) 10Krinkle: noc: Fix various PHP errors that prevent db.php from working locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944355 (https://phabricator.wikimedia.org/T341859) [18:05:38] (03PS1) 10TChin: Revert "Enable unaligned checkpointing for codfw mw-page-content-change-enrich" [deployment-charts] - 10https://gerrit.wikimedia.org/r/961721 [18:07:51] brennen: no problem. filed https://phabricator.wikimedia.org/T347620 [18:09:25] (03PS2) 10Brion VIBBER: Video transcode update for experimental HLS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152) [18:09:53] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye [18:10:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [18:10:36] (03CR) 10Brion VIBBER: "ok it should be safe to deploy at any time now :D Will take effect disabling some old VP8 and low-res VP9 WebMs immediately, and the new H" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152) (owner: 10Brion VIBBER) [18:11:55] (03PS1) 10Krinkle: noc: Add new format=json to wiki.php, and JSON button to db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961869 [18:14:23] (03CR) 10Andrew Bogott: [C: 03+2] Only install ppolicy.schema with OpenLDAP < 2.5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961796 (https://phabricator.wikimedia.org/T331699) (owner: 10Muehlenhoff) [18:16:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T343198)', diff saved to https://phabricator.wikimedia.org/P52733 and previous config saved to /var/cache/conftool/dbconfig/20230928-181635-arnaudb.json [18:16:48] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [18:19:43] (03CR) 10Joal: [C: 03+1] "Error seems unrelated to reverted change" [deployment-charts] - 10https://gerrit.wikimedia.org/r/961721 (owner: 10TChin) [18:20:00] (03PS4) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [18:21:18] !log tchin@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [18:21:20] !log tchin@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [18:22:36] (03CR) 10TChin: [C: 03+2] Revert "Enable unaligned checkpointing for codfw mw-page-content-change-enrich" [deployment-charts] - 10https://gerrit.wikimedia.org/r/961721 (owner: 10TChin) [18:23:26] (03Merged) 10jenkins-bot: Revert "Enable unaligned checkpointing for codfw mw-page-content-change-enrich" [deployment-charts] - 10https://gerrit.wikimedia.org/r/961721 (owner: 10TChin) [18:23:43] (03PS5) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [18:24:04] !log renaming cloud-hosts1-codfw vlan to cloud-hosts1-b1-codfw on cloudsw1-b1-codfw [18:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:20] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961872 (https://phabricator.wikimedia.org/T345889) [18:25:22] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961872 (https://phabricator.wikimedia.org/T345889) (owner: 10TrainBranchBot) [18:26:11] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961872 (https://phabricator.wikimedia.org/T345889) (owner: 10TrainBranchBot) [18:31:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P52734 and previous config saved to /var/cache/conftool/dbconfig/20230928-183141-arnaudb.json [18:33:26] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.28 refs T345889 [18:33:32] T345889: 1.41.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T345889 [18:33:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PUT endpoints) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:34:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 13): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43749/console" [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:36:22] (03PS6) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [18:37:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43750/console" [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:38:35] (KubernetesAPILatency) resolved: (22) High Kubernetes API latency (GET clusterinformations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:40:38] (03PS7) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [18:41:01] (NodeTextfileStale) resolved: Stale textfile for cloudcontrol2004-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:41:03] (03CR) 10CI reject: [V: 04-1] prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:41:40] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Relabel puppetmaster1004 to puppetserver1003 - https://phabricator.wikimedia.org/T347395 (10VRiley-WMF) a:03VRiley-WMF [18:41:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43751/console" [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:45:19] (03PS8) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [18:45:45] (03CR) 10CI reject: [V: 04-1] prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:46:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P52735 and previous config saved to /var/cache/conftool/dbconfig/20230928-184648-arnaudb.json [18:48:46] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:49:13] (03PS9) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [18:49:42] (03CR) 10CI reject: [V: 04-1] prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:50:58] (03PS1) 10DCausse: flink: upgrade to flink 1.17.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/961877 (https://phabricator.wikimedia.org/T346719) [18:53:25] (03PS10) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [18:53:50] (03CR) 10CI reject: [V: 04-1] prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:55:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43754/console" [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:56:27] (03PS11) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [18:56:53] (03CR) 10CI reject: [V: 04-1] prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:57:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43755/console" [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [18:58:49] (03PS1) 10Bartosz Dziewoński: Handle SpecialPage::getDescription() returning a Message [skins/Nostalgia] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/961725 (https://phabricator.wikimedia.org/T347620) [18:59:42] (03PS12) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [19:00:12] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 7 hosts matching query A:cp-text_codfw and not P{cp2027*} [19:00:40] !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-restart-varnish (exit_code=0) rolling restart of Varnish on 7 hosts matching query A:cp-upload_codfw and not P{cp2028*} [19:00:54] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Relabel puppetmaster1004 to puppetserver1003 - https://phabricator.wikimedia.org/T347395 (10VRiley-WMF) Updated label on puppetserver1003 [19:01:08] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Relabel puppetmaster1004 to puppetserver1003 - https://phabricator.wikimedia.org/T347395 (10VRiley-WMF) 05Open→03Resolved [19:01:10] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): reimage puppetmasteres to puppetserveres - https://phabricator.wikimedia.org/T345067 (10VRiley-WMF) [19:01:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T343198)', diff saved to https://phabricator.wikimedia.org/P52736 and previous config saved to /var/cache/conftool/dbconfig/20230928-190154-arnaudb.json [19:01:56] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [19:02:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [19:02:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T343198)', diff saved to https://phabricator.wikimedia.org/P52737 and previous config saved to /var/cache/conftool/dbconfig/20230928-190216-arnaudb.json [19:03:01] (03PS13) 10Jbond: prometheus::resource_config: update to use wmflib::puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) [19:04:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43757/console" [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [19:05:23] (03PS1) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [19:07:57] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [19:08:37] (03PS2) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [19:10:57] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [19:12:42] !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1007'] [19:13:04] !log robh@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1007'] [19:13:19] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1007.wikimedia.org'] [19:13:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:14:12] !log bking@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cloudelastic1007.wikimedia.org'] [19:14:17] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1007.wikimedia.org'] [19:14:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43758/console" [puppet] - 10https://gerrit.wikimedia.org/r/961867 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [19:18:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:20:10] (03PS2) 10Jbond: prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) [19:22:08] (03PS3) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [19:23:03] !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1007.wikimedia.org'] [19:24:20] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [19:24:29] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1007.eqiad.wmnet with OS bullseye [19:24:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [19:24:35] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [19:24:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [19:27:57] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [19:28:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [19:28:25] (03PS3) 10Jbond: prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) [19:30:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:30:17] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:30:55] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:32:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43761/console" [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [19:32:44] (03PS4) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [19:35:23] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:35:26] (03PS4) 10Jbond: prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) [19:35:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.477 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:36:11] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50569 bytes in 7.796 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:37:00] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [19:39:36] dduvall: brennen: how do you feel about backporting https://gerrit.wikimedia.org/r/c/mediawiki/skins/Nostalgia/+/961725 now? i was going to schedule it for the window, but it's already full [19:39:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43762/console" [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [19:39:49] (it fixes the Nostalgia log spam) [19:41:03] jouncebot nowandnext [19:41:03] For the next 0 hour(s) and 18 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T1800) [19:41:04] In 0 hour(s) and 18 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T2000) [19:41:26] (03PS5) 10Jbond: prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) [19:41:41] MatmaRex: yeah, i'm down. [19:41:47] (03PS5) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [19:41:53] (03CR) 10CI reject: [V: 04-1] prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [19:42:09] brennen: i'm not a deployer, so you'd have to click the buttons [19:42:14] yep, clicking [19:42:17] thanks [19:42:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [skins/Nostalgia] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/961725 (https://phabricator.wikimedia.org/T347620) (owner: 10Bartosz Dziewoński) [19:44:27] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [19:46:33] (03Merged) 10jenkins-bot: Handle SpecialPage::getDescription() returning a Message [skins/Nostalgia] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/961725 (https://phabricator.wikimedia.org/T347620) (owner: 10Bartosz Dziewoński) [19:46:50] !log brennen@deploy2002 Started scap: Backport for [[gerrit:961725|Handle SpecialPage::getDescription() returning a Message (T347620)]] [19:46:56] T347620: "PHP Warning: Illegal offset type" in Nostalgia skin causing the special page list dropdown to be almost empty - https://phabricator.wikimedia.org/T347620 [19:47:17] (03PS6) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [19:48:10] !log brennen@deploy2002 matmarex and brennen: Backport for [[gerrit:961725|Handle SpecialPage::getDescription() returning a Message (T347620)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:48:26] MatmaRex: https://nostalgia.wikipedia.org/wiki/HomePage loads - anything else to check? [19:49:11] brennen: and shows the list of special pages too. looks good [19:49:44] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [19:49:48] (03CR) 10C. Scott Ananian: [BETA HACK] Allow external access from anywhere to parsoid port 80 for CI purposes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/941477 (owner: 10Krinkle) [19:50:19] cool, thx [19:50:21] !log brennen@deploy2002 matmarex and brennen: Continuing with sync [19:53:49] (03PS7) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [19:55:38] (03PS8) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [19:55:43] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer (T347624, testing new cookbook changes) xfer categories => wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet, repooling both afterwards [19:55:48] T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 [19:55:52] (03PS6) 10Jbond: prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) [19:56:17] (03CR) 10CI reject: [V: 04-1] prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [19:56:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:56:43] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:961725|Handle SpecialPage::getDescription() returning a Message (T347620)]] (duration: 09m 53s) [19:56:48] T347620: "PHP Warning: Illegal offset type" in Nostalgia skin causing the special page list dropdown to be almost empty - https://phabricator.wikimedia.org/T347620 [19:58:12] (03CR) 10Ryan Kemper: "Added Ben and Balthazar for visibility/context sharing." [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [19:58:19] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [19:58:30] thanks [19:59:17] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:59:45] thanks for the fix [20:00:06] brennen and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230928T2000). [20:00:06] James_F and Jdlrobson: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:23] * James_F waves. [20:00:39] o/ [20:01:00] o/ [20:01:11] Who's actually deploying? :-) [20:01:20] o/ [20:01:28] o/ [20:01:32] :D [20:01:34] (KubernetesAPILatency) resolved: (20) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:01:36] :-P [20:01:40] is this a "not it" sort of thing? [20:01:41] (03PS7) 10Jbond: prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) [20:01:53] Bring in the wheel! [20:02:16] It's just that I have a meting in 27 minutes' time. [20:02:20] (03CR) 10CI reject: [V: 04-1] prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [20:02:25] I can deploy [20:02:28] <3 [20:02:45] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:02:46] Also "my" three config patches are all for the same config file and for the same wiki. [20:02:47] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye [20:02:54] (03PS9) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [20:02:56] James_F: re: https://phabricator.wikimedia.org/T347627 my sense is that it does not warrant a rollback, being it's labtestwiki [20:02:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [20:03:05] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [20:03:12] but please, anyone else let me know otherwise [20:03:13] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1007.eqiad.wmnet with OS bullseye [20:03:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [20:03:18] James_F: so the question is: can I deploy them all at once or were you hoping for a slow rollout? [20:03:21] dduvall: let me fix that, one moment [20:03:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [20:03:28] thcipriani: Go for it, all in one is fine. [20:03:35] taavi: <3 [20:03:37] James_F: will do [20:03:39] dduvall: Agreed, but I wanted to wait in case there's something I'm missing before just creating the table. [20:03:45] taavi: Thanks! [20:03:45] here [20:04:04] (03PS3) 10Jforrester: Add 'confirmed' to Wikifunctions sysop add and remove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954363 (https://phabricator.wikimedia.org/T344261) (owner: 10Terasail) [20:04:07] (03PS3) 10Jforrester: add 'autopatrol' to Wikifunctions' functioneer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [20:04:11] (03PS4) 10Jforrester: add autopatrolled group with autopatrol right for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757) [20:04:29] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T347624, testing new cookbook changes) xfer categories => wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet, repooling both afterwards [20:04:40] T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 [20:05:15] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [20:06:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43766/console" [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [20:07:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43767/console" [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [20:07:51] (03PS10) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [20:07:56] (03PS8) 10Jbond: prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) [20:08:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954363 (https://phabricator.wikimedia.org/T344261) (owner: 10Terasail) [20:08:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [20:08:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757) [20:08:30] we'll see if this gets blocked by the gerrit merge policy for that repo :) [20:08:34] (03CR) 10CI reject: [V: 04-1] prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [20:08:36] 10SRE, 10ops-codfw, 10User-aborrero, 10User-dcaro, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10nskaggs) @cmooney I raised a similar question about expanding WMCS racks in eqiad and as I understood the answer w... [20:08:57] PROBLEM - Check systemd state on logstash1030 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:09:17] PROBLEM - OpenSearch health check for shards on 9200 on logstash1030 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f92724ab280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [20:09:17] org/wiki/Search%23Administration [20:09:36] (03Merged) 10jenkins-bot: Add 'confirmed' to Wikifunctions sysop add and remove [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954363 (https://phabricator.wikimedia.org/T344261) (owner: 10Terasail) [20:09:42] (03CR) 10CI reject: [V: 04-1] add autopatrolled group with autopatrol right for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757) [20:09:46] Boo CI. [20:09:51] ^ [20:10:00] I'll rebase. [20:10:27] I was thinking rebasing one on top of the other would probably magic this into working [20:11:10] Sadly not. [20:11:35] !log create new oathauth tables on labtestwikitech and run `taavi@cloudweb2002-dev ~ $ mwscript extensions/OATHAuth/maintenance/UpdateForMultipleDevicesSupport.php labtestwiki`, fixes T347627 [20:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:40] T347627: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'labtestwiki.oathauth_devices' doesn't exist - https://phabricator.wikimedia.org/T347627 [20:11:55] (03PS11) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [20:11:58] (03CR) 10Bking: [C: 03+1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [20:12:18] (03CR) 10Gehel: "minor comment inline (I haven't spend much time reviewing yet)" [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [20:12:24] (03PS40) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [20:12:26] (03PS3) 10AOkoth: clamav: disable ConcurrentDatabaseReload [puppet] - 10https://gerrit.wikimedia.org/r/961689 (https://phabricator.wikimedia.org/T347450) [20:12:28] (03PS1) 10AOkoth: aptrepo: update gitlab-ce [puppet] - 10https://gerrit.wikimedia.org/r/961889 [20:12:45] (03PS9) 10Jbond: prometheus::redis_exporter_config: update to use puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) [20:13:06] James_F, dduvall, the labtestwiki issue was just fixed. and yes, it's generally fine to ignore things that only pop up there [20:13:23] (03PS5) 10Jforrester: add autopatrolled group with autopatrol right for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757) [20:13:25] RECOVERY - Check systemd state on logstash1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:13:31] thcipriani: Good to retry. [20:13:42] awesome, thanks taavi [20:13:45] RECOVERY - OpenSearch health check for shards on 9200 on logstash1030 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 654, active_shards: 1499, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [20:13:45] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:13:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [20:13:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757) [20:14:07] (03PS2) 10AOkoth: aptrepo: update gitlab-ce [puppet] - 10https://gerrit.wikimedia.org/r/961889 [20:14:11] (03CR) 10Jbond: "sorry for the noise, ready to review now" [puppet] - 10https://gerrit.wikimedia.org/r/961866 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [20:14:21] (03PS4) 10Jforrester: add 'autopatrol' to Wikifunctions' functioneer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [20:14:32] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [20:14:38] (03CR) 10TrainBranchBot: "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [20:14:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757) [20:14:46] this time for sure [20:15:12] Uh-huh. [20:15:32] (03CR) 10EoghanGaffney: [C: 03+1] aptrepo: update gitlab-ce [puppet] - 10https://gerrit.wikimedia.org/r/961889 (owner: 10AOkoth) [20:15:39] (03Merged) 10jenkins-bot: add autopatrolled group with autopatrol right for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757) [20:15:47] I kind of wonder if trainbranchbot should be trying to stack things all trying to merge together or if that would just be more confusing :) [20:16:05] In a GitLab world, it's a bit different anyway. [20:16:10] Is it worth it? [20:16:17] (03PS5) 10Thcipriani: add 'autopatrol' to Wikifunctions' functioneer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [20:16:23] (03CR) 10AOkoth: [C: 03+2] aptrepo: update gitlab-ce [puppet] - 10https://gerrit.wikimedia.org/r/961889 (owner: 10AOkoth) [20:16:26] (03CR) 10Thcipriani: [C: 03+2] add 'autopatrol' to Wikifunctions' functioneer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [20:16:29] (03PS3) 10AOkoth: aptrepo: update gitlab-ce [puppet] - 10https://gerrit.wikimedia.org/r/961889 [20:17:25] (03Merged) 10jenkins-bot: add 'autopatrol' to Wikifunctions' functioneer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [20:17:35] (03CR) 10Jforrester: "Nice milestone!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:17:49] OK, finally they're all landed. [20:18:18] !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:954363|Add 'confirmed' to Wikifunctions sysop add and remove (T344261)]], [[gerrit:948196|add 'autopatrol' to Wikifunctions' functioneer group (T344085)]], [[gerrit:947495|add autopatrolled group with autopatrol right for Wikifunctions (T343946)]] [20:18:31] T344261: Add the ability for administrators to add and remove the confirmed user group - https://phabricator.wikimedia.org/T344261 [20:18:32] T344085: add 'autopatrol' to Wikifunctions' functioneer group - https://phabricator.wikimedia.org/T344085 [20:18:32] T343946: 'autopatrolled' group for Wikifunctions - https://phabricator.wikimedia.org/T343946 [20:18:39] James_F: in this moment it feels like worth it, but maybe that feeling will subside after I'm done deploying :) [20:18:56] Yeah. [20:19:18] And of course in the magical future all these kinds of things will be configured on-wiki and no deployments will be needed at all, right? :-) [20:19:37] !log thcipriani@deploy2002 mdaniels5757 and thcipriani and terasail: Backport for [[gerrit:954363|Add 'confirmed' to Wikifunctions sysop add and remove (T344261)]], [[gerrit:948196|add 'autopatrol' to Wikifunctions' functioneer group (T344085)]], [[gerrit:947495|add autopatrolled group with autopatrol right for Wikifunctions (T343946)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:19:46] ^ James_F all on mwdebug, check please [20:20:43] thcipriani: LGTM. [20:20:50] cool, going live in all the places [20:21:45] !log thcipriani@deploy2002 mdaniels5757 and thcipriani and terasail: Continuing with sync [20:24:50] thcipriani: ready when you are. [20:25:04] Jdlrobson: cool, you're up next :) [20:25:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:26:06] 10SRE, 10ops-codfw, 10User-aborrero, 10User-dcaro, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10nskaggs) @cmooney You predicted this question and answered it already here :-) https://phabricator.wikimedia.org/T... [20:28:24] !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:954363|Add 'confirmed' to Wikifunctions sysop add and remove (T344261)]], [[gerrit:948196|add 'autopatrol' to Wikifunctions' functioneer group (T344085)]], [[gerrit:947495|add autopatrolled group with autopatrol right for Wikifunctions (T343946)]] (duration: 10m 06s) [20:28:32] T344261: Add the ability for administrators to add and remove the confirmed user group - https://phabricator.wikimedia.org/T344261 [20:28:32] T344085: add 'autopatrol' to Wikifunctions' functioneer group - https://phabricator.wikimedia.org/T344085 [20:28:33] T343946: 'autopatrolled' group for Wikifunctions - https://phabricator.wikimedia.org/T343946 [20:28:34] ^ James_F your patches are live everywhere [20:29:57] (03CR) 10Thcipriani: [C: 03+2] update sawikiquote logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961803 (https://phabricator.wikimedia.org/T341260) (owner: 10Anzx) [20:30:27] thcipriani: Thanks! [20:30:34] (KubernetesAPILatency) resolved: (22) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:31:12] (03Merged) 10jenkins-bot: update sawikiquote logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961803 (https://phabricator.wikimedia.org/T341260) (owner: 10Anzx) [20:32:16] (03PS4) 10Thcipriani: Drop the desktop improvements dblist group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:33:06] (03PS1) 10Ahmon Dancy: logspam-watch: And refreshing indicator [puppet] - 10https://gerrit.wikimedia.org/r/961893 [20:33:10] Jdlrobson: can we do the logos one together and then remove the desktop-improvements on its own? [20:33:47] yep [20:34:02] (03PS2) 10Thcipriani: Wikimedia special project logo updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961484 (owner: 10Jdlrobson) [20:34:05] (03CR) 10Thcipriani: [C: 03+2] Wikimedia special project logo updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961484 (owner: 10Jdlrobson) [20:35:03] (03Merged) 10jenkins-bot: Wikimedia special project logo updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961484 (owner: 10Jdlrobson) [20:35:09] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:35:37] !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:961803|update sawikiquote logos (T341260)]], [[gerrit:961484|Wikimedia special project logo updates]] [20:35:38] (03PS1) 10Majavah: mailmap: add extra entry for me [puppet] - 10https://gerrit.wikimedia.org/r/961895 [20:35:42] T341260: Design: Provide wordmarks for Wikiquote projects - https://phabricator.wikimedia.org/T341260 [20:36:05] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:36:57] !log thcipriani@deploy2002 anzx and jdlrobson and thcipriani: Backport for [[gerrit:961803|update sawikiquote logos (T341260)]], [[gerrit:961484|Wikimedia special project logo updates]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:37:15] ^ Jdlrobson first and last patch: live on mwdebug, check please [20:38:08] (03CR) 10Majavah: [C: 03+2] mailmap: add extra entry for me [puppet] - 10https://gerrit.wikimedia.org/r/961895 (owner: 10Majavah) [20:38:17] (03CR) 10Brennen Bearnes: [C: 03+1] logspam-watch: And refreshing indicator [puppet] - 10https://gerrit.wikimedia.org/r/961893 (owner: 10Ahmon Dancy) [20:40:38] (looking) [20:40:45] ack, thanks [20:41:48] thcipriani: LGTM [20:42:24] (03PS5) 10Thcipriani: Drop the desktop improvements dblist group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:45:09] Jdlrobson: thanks for checking, going live [20:45:34] !log thcipriani@deploy2002 anzx and jdlrobson and thcipriani: Continuing with sync [20:48:28] (03PS1) 10Majavah: dumps: replace WMCS paging Icinga check with Blackbox probe [puppet] - 10https://gerrit.wikimedia.org/r/961897 [20:49:43] (03CR) 10Krinkle: Drop the desktop improvements dblist group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:50:51] (03PS6) 10Jdlrobson: Drop the desktop improvements dblist group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) [20:51:00] (03CR) 10Jdlrobson: Drop the desktop improvements dblist group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:51:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:52:10] !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:961803|update sawikiquote logos (T341260)]], [[gerrit:961484|Wikimedia special project logo updates]] (duration: 16m 32s) [20:52:16] T341260: Design: Provide wordmarks for Wikiquote projects - https://phabricator.wikimedia.org/T341260 [20:52:31] (03PS2) 10Majavah: dumps: replace WMCS paging Icinga check with Blackbox probe [puppet] - 10https://gerrit.wikimedia.org/r/961897 [20:53:35] Jdlrobson: first and last are live. Looks like you made a change to the middle one: good to go? [20:53:46] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43770/console" [puppet] - 10https://gerrit.wikimedia.org/r/961897 (owner: 10Majavah) [20:54:10] (03PS3) 10Majavah: dumps: replace WMCS paging Icinga check with Blackbox probe [puppet] - 10https://gerrit.wikimedia.org/r/961897 [20:54:24] thcipriani: yep good to go [20:54:32] (03CR) 10Krinkle: "Is it intentional at this changes $wgDefaultSkin and $wgVectorDefaultSkinVersionForExistingAccounts for strategywiki, nawiki, and akwiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:54:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:55:21] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [20:55:23] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43771/console" [puppet] - 10https://gerrit.wikimedia.org/r/961897 (owner: 10Majavah) [20:55:29] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1007.eqiad.wmnet with OS bullseye [20:55:30] oh good, comment at the last moment :D [20:55:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [20:55:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [20:55:42] (03Merged) 10jenkins-bot: Drop the desktop improvements dblist group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [20:55:57] !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:961471|Drop the desktop improvements dblist group (T347444)]] [20:56:02] T347444: New projects should get Vector 2022 skin - https://phabricator.wikimedia.org/T347444 [20:56:05] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:34] (KubernetesAPILatency) resolved: (11) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:57:12] !log thcipriani@deploy2002 jdlrobson and thcipriani: Backport for [[gerrit:961471|Drop the desktop improvements dblist group (T347444)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:58:25] Jdlrobson: per Krinkle 's comment are those closed wikis intended to change? Do I need to pause here and go back? [20:58:38] yes they are intended [20:58:55] ok, in that case, live on mwdebug machines, check please [20:58:56] i mentioned it in the commit message "Restore Vector 2022 skin to closed wikis that were explicitlyusing the new skin" [20:59:00] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [20:59:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [20:59:41] thcipriani: LGTM [20:59:45] German still has legacy Vector [20:59:53] and they are the ones I'm scared of most haha [21:00:08] (03PS2) 10Brennen Bearnes: logspam-watch: Add refreshing indicator [puppet] - 10https://gerrit.wikimedia.org/r/961893 (owner: 10Ahmon Dancy) [21:00:45] !log thcipriani@deploy2002 jdlrobson and thcipriani: Continuing with sync [21:00:52] (03CR) 10Brennen Bearnes: [C: 03+1] "Tweaked a tiny bit after testing in production for a while. Should be good to go." [puppet] - 10https://gerrit.wikimedia.org/r/961893 (owner: 10Ahmon Dancy) [21:00:55] Jdlrobson: going live [21:01:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:01:26] (03PS1) 10Ahmon Dancy: Revert "logspam.pl: Ignore messages from mwmaint* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/961906 [21:01:42] (03CR) 10CI reject: [V: 04-1] Revert "logspam.pl: Ignore messages from mwmaint* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/961906 (owner: 10Ahmon Dancy) [21:02:10] (03CR) 10Jdlrobson: Drop the desktop improvements dblist group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961471 (https://phabricator.wikimedia.org/T347444) (owner: 10Jdlrobson) [21:02:14] thcipriani: sweet [21:03:44] (03CR) 10Ahmon Dancy: [C: 03+1] logspam-watch: Add refreshing indicator [puppet] - 10https://gerrit.wikimedia.org/r/961893 (owner: 10Ahmon Dancy) [21:05:23] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:43] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:06:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:07:04] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:07:19] !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:961471|Drop the desktop improvements dblist group (T347444)]] (duration: 11m 22s) [21:07:21] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:07:27] ^ Jdlrobson all done [21:07:27] T347444: New projects should get Vector 2022 skin - https://phabricator.wikimedia.org/T347444 [21:07:43] (03PS2) 10Ahmon Dancy: Partially revert "logspam.pl: Ignore messages from mwmaint* hosts" [puppet] - 10https://gerrit.wikimedia.org/r/961906 [21:10:25] thcipriani: yay thank you [21:10:35] (03CR) 10Brennen Bearnes: [C: 03+1] "As discussed, this seems like it reduces confusion between systems, so I'm in favor. We can add a toggle if it becomes a problem." [puppet] - 10https://gerrit.wikimedia.org/r/961906 (owner: 10Ahmon Dancy) [21:11:03] Jdlrobson: thanks for the fixups and kudos [21:12:04] (KubernetesAPILatency) resolved: (14) High Kubernetes API latency (LIST blockaffinities) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:13:58] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye [21:14:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [21:14:26] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1007.wikimedia.org'] [21:21:57] PROBLEM - Check systemd state on install1004 is CRITICAL: CRITICAL - degraded: The following units failed: isc-dhcp-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:31] !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1007.wikimedia.org'] [21:25:46] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1007.wikimedia.org'] [21:28:27] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [21:28:51] !log bking@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudelastic1007.wikimedia.org'] [21:30:39] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1028.eqiad.wmnet [21:30:55] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1028.eqiad.wmnet [21:31:31] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10cmooney) [21:31:37] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [21:32:36] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) [21:32:42] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [21:36:30] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/2 Draft: BCornwall's avatar Release 0.36-2 for Bookworm [21:37:37] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/3 Draft: Update dependencies to match Bookworm versions [21:37:50] 10SRE, 10Acme-chief, 10Traffic, 10Patch-For-Review: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/3 Draft: Update dependencies to match Bookworm versions [21:38:05] (03Abandoned) 10BCornwall: Update dependencies to match Bookworm versions [software/acme-chief] - 10https://gerrit.wikimedia.org/r/949544 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [21:39:05] (03Abandoned) 10BCornwall: Allow configuration of AddressFamily used for DNS validation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/574221 (https://phabricator.wikimedia.org/T245937) (owner: 10Alex Monk) [21:40:56] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices2005-dev.codfw.wmnet with OS bookworm [21:41:40] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [21:42:26] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1031.eqiad.wmnet [21:42:42] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1031.eqiad.wmnet [21:45:01] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:45:13] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:48:54] (03CR) 10Bking: [C: 03+1] flink: upgrade to flink 1.17.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/961877 (https://phabricator.wikimedia.org/T346719) (owner: 10DCausse) [21:49:43] RECOVERY - Check systemd state on install1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:52] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [21:53:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host cloudelastic1007.e... [21:53:42] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [21:54:10] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1022.eqiad.wmnet [21:54:20] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1022.eqiad.wmnet [21:54:31] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1023.eqiad.wmnet [21:54:34] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1023.eqiad.wmnet [21:54:42] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1024.eqiad.wmnet [21:54:45] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1024.eqiad.wmnet [21:55:28] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1029.eqiad.wmnet [21:55:32] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1029.eqiad.wmnet [21:55:40] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1029.eqiad.wmnet [21:55:44] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1029.eqiad.wmnet [21:56:01] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1032.eqiad.wmnet [21:56:09] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1032.eqiad.wmnet [21:56:40] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1025.eqiad.wmnet [21:56:45] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1025.eqiad.wmnet [21:57:02] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1026.eqiad.wmnet [21:57:06] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1026.eqiad.wmnet [21:57:17] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1027.eqiad.wmnet [21:57:28] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1027.eqiad.wmnet [21:57:54] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1030.eqiad.wmnet [21:58:24] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1030.eqiad.wmnet [21:58:34] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1033.eqiad.wmnet [21:58:46] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1033.eqiad.wmnet [21:59:15] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [22:00:08] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices2005-dev.codfw.wmnet with reason: host reimage [22:01:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) Hello DC Ops, I'm still getting PXE boot failures on `cloudelastic1007` . I've upgraded/downgraded to the... [22:01:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10bking) a:05bking→03None [22:02:47] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudelastic1007.eqiad.wmnet with OS bullseye [22:02:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host cloudelastic1007.eqiad... [22:03:03] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:03:15] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices2005-dev.codfw.wmnet with reason: host reimage [22:09:45] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:25:21] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:25:57] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:30:37] (03CR) 10Cwhite: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/960075 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [22:40:26] (03PS1) 10Cathal Mooney: Add automation to define ESI-LAGs on EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/961927 (https://phabricator.wikimedia.org/T347191) [22:41:00] (03CR) 10CI reject: [V: 04-1] Add automation to define ESI-LAGs on EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/961927 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [22:42:54] (03PS2) 10Cathal Mooney: Add automation to define ESI-LAGs on EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/961927 (https://phabricator.wikimedia.org/T347191) [22:44:01] (03PS1) 10Andrew Bogott: Update cloudservices2005-dev to use mdb backend for ldap [puppet] - 10https://gerrit.wikimedia.org/r/961929 [22:44:43] (03CR) 10Andrew Bogott: [C: 03+2] Update cloudservices2005-dev to use mdb backend for ldap [puppet] - 10https://gerrit.wikimedia.org/r/961929 (owner: 10Andrew Bogott) [22:44:52] (03CR) 10Cathal Mooney: Add automation to define ESI-LAGs on EVPN switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/961927 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [22:46:08] (03CR) 10Cwhite: [C: 03+1] "Looks good! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/960076 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [22:48:46] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:57:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T343198)', diff saved to https://phabricator.wikimedia.org/P52742 and previous config saved to /var/cache/conftool/dbconfig/20230928-225705-arnaudb.json [22:57:11] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [23:05:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T343198)', diff saved to https://phabricator.wikimedia.org/P52743 and previous config saved to /var/cache/conftool/dbconfig/20230928-230512-arnaudb.json [23:05:19] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [23:12:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P52744 and previous config saved to /var/cache/conftool/dbconfig/20230928-231211-arnaudb.json [23:16:57] (03PS1) 10EoghanGaffney: [gitlab/upgrade] Add option to skip backups on replica upgrades [cookbooks] - 10https://gerrit.wikimedia.org/r/961932 [23:19:23] (03CR) 10CI reject: [V: 04-1] [gitlab/upgrade] Add option to skip backups on replica upgrades [cookbooks] - 10https://gerrit.wikimedia.org/r/961932 (owner: 10EoghanGaffney) [23:20:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P52745 and previous config saved to /var/cache/conftool/dbconfig/20230928-232019-arnaudb.json [23:22:51] (03CR) 10Cwhite: [C: 03+1] "PCC OK: https://puppet-compiler.wmflabs.org/output/960125/43772/" [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [23:23:06] (03PS2) 10EoghanGaffney: [gitlab/upgrade] Add option to skip backups on replica upgrades [cookbooks] - 10https://gerrit.wikimedia.org/r/961932 [23:23:31] (03CR) 10Cwhite: [C: 03+1] prometheus: switch to wmflib::get_config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/960125 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [23:24:38] (03CR) 10Cwhite: "Commit message could be more clear, but patch content LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/960126 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [23:25:30] (03CR) 10CI reject: [V: 04-1] [gitlab/upgrade] Add option to skip backups on replica upgrades [cookbooks] - 10https://gerrit.wikimedia.org/r/961932 (owner: 10EoghanGaffney) [23:27:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P52746 and previous config saved to /var/cache/conftool/dbconfig/20230928-232718-arnaudb.json [23:28:17] (03PS3) 10EoghanGaffney: [gitlab/upgrade] Add option to skip backups on replica upgrades [cookbooks] - 10https://gerrit.wikimedia.org/r/961932 [23:35:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P52747 and previous config saved to /var/cache/conftool/dbconfig/20230928-233525-arnaudb.json [23:42:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T343198)', diff saved to https://phabricator.wikimedia.org/P52748 and previous config saved to /var/cache/conftool/dbconfig/20230928-234224-arnaudb.json [23:42:27] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [23:42:30] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [23:42:40] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [23:42:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T343198)', diff saved to https://phabricator.wikimedia.org/P52749 and previous config saved to /var/cache/conftool/dbconfig/20230928-234246-arnaudb.json [23:50:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T343198)', diff saved to https://phabricator.wikimedia.org/P52750 and previous config saved to /var/cache/conftool/dbconfig/20230928-235032-arnaudb.json [23:50:34] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [23:50:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:50:38] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [23:50:47] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [23:50:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T343198)', diff saved to https://phabricator.wikimedia.org/P52751 and previous config saved to /var/cache/conftool/dbconfig/20230928-235053-arnaudb.json [23:55:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency