[07:25:12] morning! [07:25:53] so in eqiad I am running an experiment with the istio sidecar settings, and what I did was to cordon (avoid scheduling of new pods) ml-serve100[2-4] and disable puppet on 1001, to play with cni settings for the kubelet [07:26:08] today I had to run puppet on the node to clear some alerts, and then I restore the previous config [07:26:29] the kube-api server, for some reason, stated to log over and over stuff like [07:26:32] Failed to list *unstructured.Unstructured: conversion webhook for serving.kubeflow.org/v1beta1, Kind=InferenceService failed: Post https://kfserving-webhook-server-service.kfserving-system.svc:443/convert?timeout=30s: service "kfserving-webhook-server-service" not found [07:26:52] now this is interesting since "kfserving" is the old version of kserve (when it was still under kubeflow) [07:27:25] so I cleared some remaining webhook configs via kubectl (probably the kserve 0.7 upgrade procedure is still not 100% correct) [07:27:32] but I still see high latency issues [07:27:45] https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&var-datasource=thanos&var-site=eqiad&var-cluster=k8s-mlserve&from=now-1h&to=now [07:36:52] mmm now only a big chunk of 500s are reported, the rest looks reasonably good [09:16:44] 10Machine-Learning-Team, 10artificial-intelligence, 10Edit-Review-Improvements-RC-Page, 10Growth community maintenance, and 3 others: Enable ORES in RecentChanges for Hindi Wikipedia - https://phabricator.wikimedia.org/T303293 (10kostajh) @DMburugu @MShilova_WMF it sounds like it's on Growth to implement t... [09:22:43] there is an old CRD to delete, inferenceservices.serving.kubeflow.org, but via kubectl everything hangs [09:22:46] sigh [11:19:23] Does increasing verbosity help? [11:26:07] didn't try, IIUC from Janis there may be some resource that still reference the CRD to drop [11:26:20] once we find them we should be able to drop the other CRD and clear the issue [11:26:27] didn't have time to check yet [11:40:28] going to lunch! Will check later :) [14:00:22] I have workers at home in a bit, I may be on and off for the next hour :) [14:29:59] very nice, the twisted pair that I have in my last mile connection was oxidated, only one cable was used instead of two [14:30:03] /o\ [14:39:21] How did that even work? [14:39:29] Or was it just "unpaired"? [14:45:13] I have no idea :D [14:56:50] all kubernetes workers on bullseye! \o/ [14:57:08] now we can finally think about upgrading k8s [14:57:35] * elukey plays ACDC - It’s a Long Way to the Top [15:27:34] nice!!! Luca \o/ [15:32:39] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) [15:34:31] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) [15:37:43] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Experiment with the Istio TLS mesh - https://phabricator.wikimedia.org/T297612 (10elukey) Next steps: * create the `istio-cni` infrastructure user in puppet, and add the config to deploy its credentials in profile::calico::kubernetes *... [15:39:55] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1001.eqiad.wmnet wit... [16:01:49] klausman: I am reading https://phabricator.wikimedia.org/T277191 for the cluster reset [16:01:57] (Alex suggested it) [16:02:10] so far the idea, IIUC, could be to [16:02:33] 1) shutdown all nodes (master ctrl + workers) [16:02:59] 2) cleanup ml-etcd via the command stated in the task [16:03:25] 3) reimage the control plane and allow puppet to deploy the configs etc.. [16:03:32] 4) reimage all worker nodes as well [16:03:41] in theory after this the cluster should come up empty [16:03:50] (and we avoid any weird old state) [16:03:56] what do you think? [16:04:09] Re: 1) shutdown as in stop k8s services, I presume? [16:04:21] nono I mean shutdown the node [16:04:40] power-off? [16:04:43] yep [16:05:20] Ah, so when re0doing etcd, keep them off to be super safe, then set them for reimage-on-next-boot? [16:05:36] Then boot the ctrl plane first, then the workers? [16:05:50] exactly yes, just to prevent weird things like the new control plane booted and a worker with stale configs joining [16:05:59] Ack. SGTM [16:06:38] I have a question mark about why the TLS cert for the LVS endpoint was changed in the task, but I'll follow up with service ops [16:07:05] Yeah, that seems unnecessary [16:07:28] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1001.eqiad.wmnet with OS... [16:07:43] ack so I am going to create a task with the plan, and ask a comment from serviceops [16:08:02] when we have the green light for the new subnets we can proceed in my opinion [16:08:21] :+1: [16:08:35] I have basically completed the tests for the sidecars, and we can deploy them later on [16:08:38] super [16:09:11] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet wit... [16:09:15] If you want extra hands+eyes during the thing, I'm game, of course [16:09:50] oh yes we can split the work for sure, or do it together on meet [16:10:11] it will be tedious but hopefully something done in a day [16:13:52] Ack [16:18:20] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye [16:35:58] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye completed: -... [16:36:58] Morning all. A little late because I have a cold [16:39:32] o/ [16:44:27] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye completed: -... [16:46:34] 10Machine-Learning-Team, 10Observability-Logging, 10SRE: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10colewhite) @elukey This issue hasn't reappeared since we began dropping the field. If you're ok with keeping this mitigation in place, please feel free to c... [17:00:39] 10Machine-Learning-Team, 10Observability-Logging, 10SRE: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10elukey) 05Open→03Resolved I am yes! Thanks a lot for the support! [17:55:15] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) [17:55:57] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) 05Open→03Resolved Completed [18:42:50] * elukey afk!