[07:44:04] 10serviceops, 10Push-Notification-Service: Allow `push-notifications` service to accept production environment flag for APNS requests - https://phabricator.wikimedia.org/T274456 (10Jgiannelos) a:05Jgiannelos→03None [08:25:30] 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10JMeybohm) [08:35:37] 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10JMeybohm) [08:37:35] 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10JMeybohm) [08:39:23] 10serviceops: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=10429df6-7d65-43c3-8f79-49fdea88c7ca) set by jayme@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: host is down ` kubernetes2010.codfw.wmnet ` [08:44:03] 10serviceops, 10DC-Ops, 10ops-codfw: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10JMeybohm) Hey DC-Ops, could you please check on kubernetes2010 [09:19:41] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1030.eqiad.wmnet with OS bullseye [09:20:03] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1031.eqiad.wmnet with OS bullseye [09:20:11] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1032.eqiad.wmnet with OS bullseye [09:45:33] 10serviceops, 10Data-Persistence, 10Performance-Team, 10SRE, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10Jelto) [09:53:49] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1030.eqiad.wmnet with OS bullseye completed: - kubernetes1030 (**PASS**) - Downtimed on Icinga/Aler... [09:55:35] 10serviceops, 10Observability-Metrics, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Refactor discovery of calico-felix targets in prometheus - https://phabricator.wikimedia.org/T346915 (10JMeybohm) 05Open→03Resolved a:03JMeybohm I've updated the relevant dashboards (as the instance label of... [09:59:48] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1031.eqiad.wmnet with OS bullseye completed: - kubernetes1031 (**PASS**) - Downtimed on Icinga/Aler... [10:03:50] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1033.eqiad.wmnet with OS bullseye [10:04:01] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1034.eqiad.wmnet with OS bullseye [10:04:08] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1035.eqiad.wmnet with OS bullseye [10:04:17] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye [10:04:23] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1037.eqiad.wmnet with OS bullseye [10:04:31] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye [10:04:38] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1036 (**FAIL**) - Downtimed on... [10:04:43] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1039.eqiad.wmnet with OS bullseye [10:04:55] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1038 (**FAIL**) - Downtimed on... [10:04:57] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1040.eqiad.wmnet with OS bullseye [10:05:17] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1041.eqiad.wmnet with OS bullseye [10:05:33] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1042.eqiad.wmnet with OS bullseye [10:08:18] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye [10:08:26] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1036 (**FAIL**) - Removed from... [10:09:37] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye [10:09:43] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1038 (**FAIL**) - Removed from... [10:23:48] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1043.eqiad.wmnet with OS bullseye [10:24:07] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye [10:24:30] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1044.eqiad.wmnet with OS bullseye [10:25:09] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1045.eqiad.wmnet with OS bullseye [10:25:23] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1046.eqiad.wmnet with OS bullseye [10:25:41] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1047.eqiad.wmnet with OS bullseye [10:25:56] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1048.eqiad.wmnet with OS bullseye [10:26:03] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1047.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1047 (**FAIL**) - Downtimed on... [10:26:08] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1049.eqiad.wmnet with OS bullseye [10:26:22] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1050.eqiad.wmnet with OS bullseye [10:26:34] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1051.eqiad.wmnet with OS bullseye [10:27:08] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1053.eqiad.wmnet with OS bullseye [10:27:17] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1054.eqiad.wmnet with OS bullseye [10:27:28] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye [10:27:42] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1055.eqiad.wmnet with OS bullseye [10:31:52] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1047.eqiad.wmnet with OS bullseye [10:33:59] hello folks [10:34:16] traffic to ores (bare metal) is now zero, all handled by ores-legacy on k8s [10:34:24] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye [10:34:41] elukey: Awesome! [10:34:45] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye [10:34:47] we don't use the redis cache so if you need to reboot those nodes etc.. we can safely do it now [10:34:59] at some point we'll also start decomming everything [10:35:36] Yeah we do have all the rdb hosts to reboot [10:36:06] I'll check with the team if there's a specific procedure for those [10:37:01] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1042.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1042 (**FAIL**) - Downtimed on... [10:37:17] 10serviceops, 10RESTBase Sunsetting, 10Code-Health-Objective, 10Data Products (Sprint 01), 10Patch-For-Review: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 (10hnowlan) >>! In T342213#9187721, @Milimetric wrote: > AQS 1.0 is sending the required headers now, etag is... [10:37:38] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1040.eqiad.wmnet with OS bullseye completed: - kubernetes1040 (**WARN**) - Downtimed on Icinga/Aler... [10:39:51] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1041.eqiad.wmnet with OS bullseye completed: - kubernetes1041 (**PASS**) - Downtimed on Icinga/Aler... [10:40:54] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1035.eqiad.wmnet with OS bullseye completed: - kubernetes1035 (**WARN**) - Downtimed on Icinga/Aler... [10:41:13] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1033.eqiad.wmnet with OS bullseye completed: - kubernetes1033 (**PASS**) - Downtimed on Icinga/Aler... [10:43:26] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1039.eqiad.wmnet with OS bullseye completed: - kubernetes1039 (**PASS**) - Downtimed on Icinga/Aler... [10:46:43] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1037.eqiad.wmnet with OS bullseye completed: - kubernetes1037 (**WARN**) - Downtimed on Icinga/Aler... [10:48:44] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1034.eqiad.wmnet with OS bullseye completed: - kubernetes1034 (**WARN**) - Downtimed on Icinga/Aler... [10:54:31] 10serviceops, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1): Identify path forward for k8s deployment of prometheus-statsd-exporter - https://phabricator.wikimedia.org/T343025 (10Joe) When you have your code in core, and we have merged the above patches, we can start tes... [10:57:24] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1051.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1051 (**FAIL**) - Downtimed on... [10:58:04] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1055.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1055 (**FAIL**) - Downtimed on... [10:58:07] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1045.eqiad.wmnet with OS bullseye completed: - kubernetes1045 (**PASS**) - Downtimed on Icinga/Aler... [10:58:56] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1043.eqiad.wmnet with OS bullseye completed: - kubernetes1043 (**PASS**) - Downtimed on Icinga/Aler... [11:03:24] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1056.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1056 (**FAIL**) - Downtimed on... [11:05:24] 10serviceops, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10SRE, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10LSobanski) [11:06:09] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1048.eqiad.wmnet with OS bullseye completed: - kubernetes1048 (**WARN**) - Downtimed on Icinga/Aler... [11:07:41] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1044.eqiad.wmnet with OS bullseye completed: - kubernetes1044 (**WARN**) - Downtimed on Icinga/Aler... [11:08:33] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1049.eqiad.wmnet with OS bullseye completed: - kubernetes1049 (**WARN**) - Downtimed on Icinga/Aler... [11:08:56] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1052.eqiad.wmnet with OS bullseye completed: - kubernetes1052 (**WARN**) - Downtimed on Icinga/Aler... [11:09:11] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1046.eqiad.wmnet with OS bullseye completed: - kubernetes1046 (**WARN**) - Downtimed on Icinga/Aler... [11:12:40] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1050.eqiad.wmnet with OS bullseye completed: - kubernetes1050 (**WARN**) - Downtimed on Icinga/Aler... [11:13:14] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye completed: - kubernetes1038 (**WARN**) - Removed from Puppet and... [11:17:05] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1047.eqiad.wmnet with OS bullseye completed: - kubernetes1047 (**WARN**) - Removed from Puppet and... [11:17:11] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1053.eqiad.wmnet with OS bullseye completed: - kubernetes1053 (**WARN**) - Downtimed on Icinga/Aler... [11:17:17] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye completed: - kubernetes1036 (**PASS**) - Removed from Puppet and... [11:20:31] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubernetes1054.eqiad.wmnet with OS bullseye completed: - kubernetes1054 (**WARN**) - Downtimed on Icinga/Aler... [11:36:33] 10serviceops, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10SRE, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) Thanks for the advice @joe! > What I fail to understand is how, if this was an open file l... [11:37:24] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10JMeybohm) re-ran puppet on kubernetes[1027-1056].eqiad.wmnet and called the remove-downtime cookbook for them [11:58:29] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10JMeybohm) [11:59:03] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-53] - https://phabricator.wikimedia.org/T342534 (10JMeybohm) [11:59:11] 10serviceops, 10Patch-For-Review: Set up kubernetes10[27-56] - https://phabricator.wikimedia.org/T346714 (10JMeybohm) 05Open→03Resolved a:03JMeybohm [11:59:13] I think scaling down mobileapps may have caused some wobbliness in restbase: https://grafana.wikimedia.org/d/000000068/restbase?orgId=1&viewPanel=18&from=now-7d&to=now [11:59:24] This aligns well enough with the change to replicas [11:59:47] We'll be able to scale it back up now [11:59:49] hnowlan: we can ramp that up again if you think it helps [12:02:34] sounds good [12:03:07] it's a bit weird, there is *some* throttling but it's nothing big enough to figure it'd have an impact that checks would be failing [12:03:12] but I guess restbase is a bit sensitive [12:03:20] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/960596 [12:23:35] +1ed [13:19:33] 10serviceops, 10Dumps-Generation, 10MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432 (10ArielGlenn) [13:28:13] 10serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar): mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690 (10Bmueller) [13:52:32] 10serviceops, 10SRE, 10Datacenter-Switchover: Sept 2023 Switchover: list new primary DC servers first in debug.json - https://phabricator.wikimedia.org/T346472 (10kamila) 05Open→03Resolved [13:52:37] 10serviceops, 10Data-Persistence, 10Performance-Team, 10SRE, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) [13:53:07] 10serviceops, 10SRE, 10Datacenter-Switchover: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) [13:53:58] 10serviceops, 10SRE, 10Datacenter-Switchover: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) 05Open→03Resolved [13:54:05] 10serviceops, 10Data-Persistence, 10Performance-Team, 10SRE, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) [14:43:58] 10serviceops, 10Abstract Wikipedia team, 10Wikifunctions, 10Wikimedia-production-error: Wikifunctions functions that require a lookup on wikifunctions.org timing out in the orchestrator, UX instead showing 'http' - https://phabricator.wikimedia.org/T344998 (10Jdforrester-WMF) I re-deployed after we moved t... [14:48:03] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) server is not getting to POST. starting troubleshooting. [15:01:23] 10serviceops, 10Content-Transform-Team-WIP, 10Parsoid, 10RESTBase, and 3 others: Requests originating from zhwiki wikifeeds caused parsoid outage - https://phabricator.wikimedia.org/T346657 (10MSantos) [15:13:42] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) @JMeybohm looks like the system board has died. Server powers on, but even with minimum hardware configuration the server will not actually boot up. Idrac is also inaccessible. This... [15:14:09] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) a:03Jhancock.wm [15:31:30] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10JMeybohm) Thanks! We did not plan to decom immediately, so it would really help us if you could replace the board and we could run the server for a bit longer. [16:42:45] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) got it replaced. updated the asset tag, idrac IP, bios/idrac firmware, and adjusted some bios settings. the idrac and network addresses are pinging, and there are no alerts that I can... [16:53:18] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10JMeybohm) Nice, thanks for handling this so quickly! Nothing more to do from your end [16:54:06] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: kubernetes2010 down - https://phabricator.wikimedia.org/T347267 (10Jhancock.wm) 05Open→03Resolved [16:58:52] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [17:12:14] the occasional blips in restbase/mobileapps are continuing despite increasing the replicas, must be something going wrong within the service itself [17:12:22] but it coincides with the scaledown too tightly [19:36:22] 10serviceops, 10Dumps-Generation, 10MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432 (10matmarex) > Configure MW with ICU emulation for PHP 8.1 that matches PHP 7.4 (see also: {T263437}, {T292552}). I'm not sure where to raise this, so I'll... [21:43:33] 10serviceops, 10Dumps-Generation, 10MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432 (10Reedy) >>! In T319432#9177477, @Tacsipacsi wrote: >>>! In T319432#8338918, @Krinkle wrote: >> We choose PHP 8.1, based on PHP 8.2 not being ready yet > >... [23:34:59] 10serviceops, 10Observability-Alerting, 10observability: Port openapi/swagger checks/alerts to Prometheus - https://phabricator.wikimedia.org/T320620 (10colewhite) [23:35:07] 10serviceops, 10Observability-Alerting: Investigate swagger-exporter failures - https://phabricator.wikimedia.org/T346893 (10colewhite) 05Open→03Resolved Error logs have gone away and the prometheus view looks good.