[07:25:12] <elukey>	 morning!
[07:25:53] <elukey>	 so in eqiad I am running an experiment with the istio sidecar settings, and what I did was to cordon (avoid scheduling of new pods) ml-serve100[2-4] and disable puppet on 1001, to play with cni settings for the kubelet
[07:26:08] <elukey>	 today I had to run puppet on the node to clear some alerts, and then I restore the previous config
[07:26:29] <elukey>	 the kube-api server, for some reason, stated to log over and over stuff like
[07:26:32] <elukey>	 Failed to list *unstructured.Unstructured: conversion webhook for serving.kubeflow.org/v1beta1, Kind=InferenceService failed: Post https://kfserving-webhook-server-service.kfserving-system.svc:443/convert?timeout=30s: service "kfserving-webhook-server-service" not found
[07:26:52] <elukey>	 now this is interesting since "kfserving" is the old version of kserve (when it was still under kubeflow)
[07:27:25] <elukey>	 so I cleared some remaining webhook configs via kubectl (probably the kserve 0.7 upgrade procedure is still not 100% correct)
[07:27:32] <elukey>	 but I still see high latency issues
[07:27:45] <elukey>	 https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&var-datasource=thanos&var-site=eqiad&var-cluster=k8s-mlserve&from=now-1h&to=now
[07:36:52] <elukey>	 mmm now only a big chunk of 500s are reported, the rest looks reasonably good
[09:16:44] <wikibugs>	 10Machine-Learning-Team, 10artificial-intelligence, 10Edit-Review-Improvements-RC-Page, 10Growth community maintenance, and 3 others: Enable ORES in RecentChanges for Hindi Wikipedia - https://phabricator.wikimedia.org/T303293 (10kostajh) @DMburugu @MShilova_WMF it sounds like it's on Growth to implement t...
[09:22:43] <elukey>	 there is an old CRD to delete, inferenceservices.serving.kubeflow.org, but via kubectl everything hangs
[09:22:46] <elukey>	 sigh
[11:19:23] <klausman>	 Does increasing verbosity help?
[11:26:07] <elukey>	 didn't try, IIUC from Janis there may be some resource that still reference the CRD to drop
[11:26:20] <elukey>	 once we find them we should be able to drop the other CRD and clear the issue
[11:26:27] <elukey>	 didn't have time to check yet
[11:40:28] <elukey>	 going to lunch! Will check later :)
[14:00:22] <elukey>	 I have workers at home in a bit, I may be on and off for the next hour :)
[14:29:59] <elukey>	 very nice, the twisted pair that I have in my last mile connection was oxidated, only one cable was used instead of two
[14:30:03] <elukey>	  /o\
[14:39:21] <klausman>	 How did that even work?
[14:39:29] <klausman>	 Or was it just "unpaired"?
[14:45:13] <elukey>	 I have no idea :D
[14:56:50] <elukey>	 all kubernetes workers on bullseye! \o/
[14:57:08] <elukey>	 now we can finally think about upgrading k8s
[14:57:35] * elukey plays ACDC - It’s a Long Way to the Top
[15:27:34] <aiko>	 nice!!! Luca \o/
[15:32:39] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson)
[15:34:31] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson)
[15:37:43] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Experiment with the Istio TLS mesh - https://phabricator.wikimedia.org/T297612 (10elukey) Next steps:  * create the `istio-cni` infrastructure user in puppet, and add the config to deploy its credentials in profile::calico::kubernetes *...
[15:39:55] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1001.eqiad.wmnet wit...
[16:01:49] <elukey>	 klausman: I am reading https://phabricator.wikimedia.org/T277191 for the cluster reset
[16:01:57] <elukey>	 (Alex suggested it)
[16:02:10] <elukey>	 so far the idea, IIUC, could be to
[16:02:33] <elukey>	 1) shutdown all nodes (master ctrl + workers)
[16:02:59] <elukey>	 2) cleanup ml-etcd via the command stated in the task
[16:03:25] <elukey>	 3) reimage the control plane and allow puppet to deploy the configs etc..
[16:03:32] <elukey>	 4) reimage all worker nodes as well
[16:03:41] <elukey>	 in theory after this the cluster should come up empty
[16:03:50] <elukey>	 (and we avoid any weird old state)
[16:03:56] <elukey>	 what do you think?
[16:04:09] <klausman>	 Re: 1) shutdown as in stop k8s services, I presume?
[16:04:21] <elukey>	 nono I mean shutdown the node
[16:04:40] <klausman>	 power-off?
[16:04:43] <elukey>	 yep
[16:05:20] <klausman>	 Ah, so when re0doing etcd, keep them off to be super safe, then set them for reimage-on-next-boot?
[16:05:36] <klausman>	 Then boot the ctrl plane first, then the workers?
[16:05:50] <elukey>	 exactly yes, just to prevent weird things like the new control plane booted and a worker with  stale configs joining
[16:05:59] <klausman>	 Ack. SGTM
[16:06:38] <elukey>	 I have a question mark about why the TLS cert for the LVS endpoint was changed in the task, but I'll follow up with service ops
[16:07:05] <klausman>	 Yeah, that seems unnecessary
[16:07:28] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1001.eqiad.wmnet with OS...
[16:07:43] <elukey>	 ack so I am going to create a task with the plan, and ask a comment from serviceops
[16:08:02] <elukey>	 when we have the green light for the new subnets we can proceed in my opinion
[16:08:21] <klausman>	 :+1:
[16:08:35] <elukey>	 I have basically completed the tests for the sidecars, and we can deploy them later on
[16:08:38] <elukey>	 super
[16:09:11] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet wit...
[16:09:15] <klausman>	 If you want extra hands+eyes during the thing, I'm game, of course
[16:09:50] <elukey>	 oh yes we can split the work for sure, or do it together on meet
[16:10:11] <elukey>	 it will be tedious but hopefully something done in a day
[16:13:52] <klausman>	 Ack
[16:18:20] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye
[16:35:58] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1002.eqiad.wmnet with OS bullseye completed: -...
[16:36:58] <chrisalbon>	 Morning all. A little late because I have a cold
[16:39:32] <elukey>	 o/
[16:44:27] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ml-cache1003.eqiad.wmnet with OS bullseye completed: -...
[16:46:34] <wikibugs>	 10Machine-Learning-Team, 10Observability-Logging, 10SRE: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10colewhite) @elukey This issue hasn't reappeared since we began dropping the field. If you're ok with keeping this mitigation in place, please feel free to c...
[17:00:39] <wikibugs>	 10Machine-Learning-Team, 10Observability-Logging, 10SRE: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10elukey) 05Open→03Resolved I am yes! Thanks a lot for the support!
[17:55:15] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson)
[17:55:57] <wikibugs>	 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Cmjohnson) 05Open→03Resolved Completed
[18:42:50] * elukey afk!