[08:58:16] I am doing a kubelet partition resize for ml2003 and ml2002 today. It's not an urgent piece of maintenance, but I want to make sure I understand the process. Should not be too disruptive. [09:17:39] And done. [09:18:00] 10Machine-Learning-Team, 10Patch-For-Review, 10sre-alert-triage: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10klausman) I have done ml2002 and ml2003 today (two machines to force some pods back... [10:12:36] 10Machine-Learning-Team, 10Patch-For-Review, 10sre-alert-triage: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10JMeybohm) I might be missing something here, but what issues did you have with the... [10:37:38] * klausman lunch [10:43:46] 10Machine-Learning-Team, 10Patch-For-Review, 10sre-alert-triage: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10klausman) The problem is only really relevant for LLMs (Large Language Models), sin... [10:50:13] 10Machine-Learning-Team: Expand the Lift Wing workers' kubelet partition - https://phabricator.wikimedia.org/T339231 (10klausman) (copied from T343900, this ticket is more appropriate for this info) I have done ml2002 and ml2003 today (two machines to force some pods back onto 2002, to see it works properly). S... [13:24:11] 10Machine-Learning-Team, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10Kubernetes: Issues deploying calico to ml-staging-codfw and aux-k8s-eqiad - https://phabricator.wikimedia.org/T333302 (10JMeybohm) In aux the calico deloyment failed because the cluster is not row redundant and typha has a pod anti... [13:24:31] 10Machine-Learning-Team, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10Kubernetes: Issues deploying calico to ml-staging-codfw and aux-k8s-eqiad - https://phabricator.wikimedia.org/T333302 (10JMeybohm) 05Open→03Resolved a:03JMeybohm [13:59:55] 10Machine-Learning-Team, 10sre-alert-triage: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10JMeybohm) >>! In T343900#9092723, @klausman wrote: > The problem is only really relevant for LLMs (Large... [14:01:23] 10Machine-Learning-Team, 10Item Quality Evaluator, 10Wikidata, 10Wikidata Dev Team, 10Wikidata.org: [SW] Update API calls from ORES to Lift Wing - https://phabricator.wikimedia.org/T343731 (10Arian_Bozorg)