[08:47:22] hello folks [08:47:39] I am experimenting with network policies, so things in ml-serve-eqiad may not work correctly today [10:46:12] still working on the network policies, but a lot of progresses :) [10:46:18] going to file two code reviews soon-ish [11:43:44] * elukey lunch! [16:20:57] Hey ML folks! We're seeing a massive increase in logstash indexing failures since ~9:30 from knative-serving: activator. There is a type conflict on the "error" field which is an object, but is being rendered as a string. I think elukey has worked on these sorts of errors before? [16:22:02] cwhite: o/ it is again https://phabricator.wikimedia.org/T288549, there is a pod down due to some work in progress [16:22:20] is there a way to circumvent this indexing error on our side? [16:24:29] Hm, based on that task, a path forward doesn't seem clear. [16:25:35] cwhite: what I meant was if there is a way to explicitly instruct logstash to accept both formats for the error field (super ignorant about logstash) [16:26:34] IIUC the indexing schema is inferred by logstash based on the traffic right? [16:28:10] It's not possible to instruct ES to store them both. It's like trying to insert varchar into an integer column in mysql. Seems we're left with adding mutations to work around the issue? [16:30:01] no idea about mutations :( [16:52:50] The field is hardcoded :( [17:02:11] 10Machine-Learning-Team, 10Observability-Logging: Indexing errors from logs generated by Activator - https://phabricator.wikimedia.org/T288549 (10colewhite) This issue appeared again today. Having a look around, it appears the `error` key is hardcoded. We'll have to mutate it in the pipeline for now. * http... [17:38:44] added the egress policies for kserve and knative-serving [17:38:49] all working [17:39:11] cwhite: now the pod is up, so the errors should be gone.. but of course only temporarily :( [17:40:09] it has not been fixed up to all the 0.2x versions (and also 1.0) [17:40:34] we could try to send a pull request to fix this [17:40:45] (so that newer versions will be fixed) [17:41:22] It strikes me as odd that we would be the only ones affected [17:43:46] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Add network policies to the ML k8s clusters - https://phabricator.wikimedia.org/T289834 (10elukey) Some updates: * fixed label targeting for the `kserve-inference` chart, and added a specific rule for the `queue-proxy` container. * added basic Global... [17:46:28] klausman: it seems strange but maybe there are not a lot of logstash users, no idea how the cool new projects store logs from pods [17:46:37] maybe there is a different way that knative supports [17:47:24] cwhite: nice! https://gerrit.wikimedia.org/r/c/operations/puppet/+/737440 [17:48:03] Nice work [18:50:12] * elukey afk! [18:50:13] o/ [21:17:39] 10Lift-Wing, 10Machine-Learning-Team: Sunset MiniKF sandboxes - https://phabricator.wikimedia.org/T293677 (10ACraze) [21:20:30] 10Lift-Wing, 10Machine-Learning-Team: Sunset MiniKF sandboxes - https://phabricator.wikimedia.org/T293677 (10ACraze) As discussed in the ML Team Meeting today, I have terminated the MiniKF 1.3 sandbox as it has diverged too much from our production stack to be useful anymore (also I broke a bunch of things whi... [21:22:24] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Factor out feature retrieve functionality to a transformer - https://phabricator.wikimedia.org/T294419 (10ACraze)