[07:39:21] 10serviceops, 10iPoid-Service (iPoid 1.0): Empty traffic and error panels for ipoid grafana - https://phabricator.wikimedia.org/T356861 (10kostajh) [10:04:49] 10serviceops, 10DC-Ops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10jijiki) I set the host as inactive since I noticed a bit of log spam on lvs2013 `Feb 8 08:40:59 lvs2013 pybal[2489063]: [eventgate-analytics_4592 IdleConnection] WARN: mw2282.codfw.wmnet (enabled/... [10:07:09] 10serviceops, 10DC-Ops, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10jijiki) [11:08:47] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10cmooney) I think there may be an issue here with the cable (usually the NIC firmware issue hits us when the debian-installer does it's DHCP request, rather than at the initia... [11:09:40] 👋 RESTBase scap is failing to deploy https://phabricator.wikimedia.org/T356898#9522001 [11:09:54] I think its related to https://phabricator.wikimedia.org/T352469 [11:10:46] I have a patch to cleanup old targets from scap config: https://gerrit.wikimedia.org/r/998488 [11:10:56] <_joe_> nemo-yiannis: yeah I was about to say [11:10:59] <_joe_> that's what's needed [11:11:33] thanks for the +1 _joe_ [14:23:42] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) >>! In T349619#9521720, @Volans wrote: > We could either catch the exception and retry or acquire a lock for all puppetserver ca operati... [14:30:01] 10serviceops: Cross fleet runc upgrades - https://phabricator.wikimedia.org/T356661 (10akosiaris) 05Open→03Resolved >>! In T356661#9524903, @klausman wrote: > ml-serve in codfw also done, so all done for ML team Cool, thanks! I 'll resolve this one then. [14:30:09] 10serviceops: Cross fleet runc upgrades - https://phabricator.wikimedia.org/T356661 (10akosiaris) Marking this as public now that we are done patching. [14:30:23] 10serviceops: Cross fleet runc upgrades - https://phabricator.wikimedia.org/T356661 (10akosiaris) [14:37:30] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) @cmooney the SFP failed. I've replaced it and it looks to be up now. [14:44:53] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: eventstreams regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357005 (10Clement_Goubert) [14:46:52] 10serviceops, 10EventStreams, 10Prod-Kubernetes, 10Kubernetes: eventstreams regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357005 (10Clement_Goubert) [15:17:38] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [15:47:17] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [15:47:35] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [16:03:32] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [16:03:44] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [16:03:50] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - **The reimage fa... [16:04:27] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [16:10:15] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [16:18:20] 10serviceops, 10EventStreams, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: eventstreams regularly uses more than 95% of its memory limit - https://phabricator.wikimedia.org/T357005 (10akosiaris) I am not sure this would actually solve the problem tbh. It don't hurt to try ofc, which is why I +1e,... [16:26:11] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye [16:31:14] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Unstewarded-production-error, and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10Krinkle) [16:31:39] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) Good catch! Unfortunately I'm still seeing the same PXE behaviour failing on boot [16:31:58] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors: - mw2282 (**FAIL**) - Removed from Pup... [16:36:17] claime: should have said all good to repool those hosts [16:36:20] thanks for the help! [16:36:38] thanks, will do right after my meeting [17:06:46] 10serviceops, 10SRE Observability: Introduce a way to retry checks for SystemdUnitFailed before alerting - https://phabricator.wikimedia.org/T357028 (10Clement_Goubert) [17:27:00] 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10CodeReviewBot) jforrester closed https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/merg... [23:58:58] 10serviceops, 10SRE Observability, 10Release-Engineering-Team (Radar): Introduce a way to retry checks for SystemdUnitFailed before alerting - https://phabricator.wikimedia.org/T357028 (10brennen)