[00:58:01] <wikibugs>	 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): 'Highlight likely problem edits' preference doesn't select any filters in mobile web - https://phabricator.wikimedia.org/T318683 (10Etonkovidova) The issue is confirmed.  Steps to reproduce:  - on...
[08:09:39] <elukey>	 hello folks
[08:13:23] <elukey>	 the pod issue that I mentioned yesterday in ml-serve-codfw seems related to not enough quota to recycle pods
[08:13:27] <elukey>	 I am going to raise it a little
[08:31:19] <elukey>	 doesn't completely work, some pods are still duplicated
[08:31:32] <elukey>	 so I am going through the knative revisions to delete them manually
[08:37:11] <elukey>	 ok cleaned up
[08:37:18] <elukey>	 this is another knative bug/weird-corner-case
[08:37:19] <elukey>	 sigh
[08:47:30] <klausman>	 Morning!
[08:47:59] <klausman>	 It's a bit odd that it seems to happen infrequently. I'd expect quota for that to not be so laser-thin, especially since you raised it
[08:51:57] <elukey>	 there were errors in the kube events related to quotas being reached, this is why I raised a little
[08:52:22] <elukey>	 but I think that the combination of our version of knative + limits may trigger some weird behavior
[08:55:36] <klausman>	 The meta-annoyance is that since we are on ancient versions, getting upstream to help is less than likely
[08:57:59] <elukey>	 understandably so
[08:59:11] <elukey>	 this is why I was saying that we should think about dedicating half of our SRE resources to the k8s 1.23 working group from now on
[09:10:20] <klausman>	 Agreed
[09:21:16] <elukey>	 TIL https://istio.io/latest/docs/reference/commands/istioctl/#istioctl-proxy-config-bootstrap
[09:21:54] <klausman>	 That will come in handy
[09:22:23] <elukey>	 I am testing various settings to disable zipkin's dns queries
[09:22:28] <elukey>	 and IIUC we have to
[09:22:35] <elukey>	 1) apply the new config with istioctl
[09:22:39] <elukey>	 2) kill a pod
[09:22:51] <elukey>	 3) run the pc bootstrap command above and check the config
[09:23:00] <elukey>	 if it doesn't contain zipkin's cluster, then we are good
[09:23:05] <elukey>	 (to roll restart the other pods)
[09:23:08] <klausman>	 ack
[09:23:30] <elukey>	 I expected something more streamlined to be honest..
[09:23:33] <klausman>	 Should make for faster turnaround when experimenting than going through the whole helm dance for everything
[09:24:10] <klausman>	 I guess it's not something that has to be done more often, so it's one of those "Yeah it sucks doing it this way, but once I got my cluster fixed, I stopped caring." problems
[09:24:26] <elukey>	 we can't even use helm for this use case, only istioctl
[09:26:40] <klausman>	 Weird that this aspect isn't exposed further up, yeah
[09:35:42] <elukey>	 I had to create https://gerrit.wikimedia.org/r/836734
[09:36:14] <elukey>	 due to how the helm manifests (inside istioctl) are working, this seems to be the only way to disable any tracing
[09:36:22] <klausman>	 I'll LGTM in a moment. Note the extra space at the end of line 58.
[09:36:22] <elukey>	 including the default zipkin dns queries
[09:36:50] <elukey>	 fixed
[09:37:15] <elukey>	 I don't like the fact that we have to set something not-zipkin to be able to disable zipkin
[09:37:39] <klausman>	 Yeah, it's a sign modules are not as modular as they should be. (Or maybe I am misunderstanding some interdependencies)
[09:38:34] <klausman>	 I mean, I get that "Disable Zipkin" -> "No tracing". But there should be a Zipkin: off like config option, instead of having to disable something else and autmagically getting rid of Zipkin.
[09:39:05] <elukey>	 yep yep
[09:50:00] <elukey>	 mmm I killed some pods on ml-serve-codfw but I don't see a decrease in zipkin's queries
[09:50:00] <elukey>	 better, I don't see a decrease in overall dns queries
[09:50:00] <klausman>	 I've seen pod restarts bump up DNS queries, maybe it needs some time to settle? How long ago did you do the kills?
[09:50:00] <elukey>	 some minutes ago
[09:50:49] <klausman>	 yeah, it should be visible by now.
[09:51:18] <klausman>	 What exact queries does Zipkin do? Are they distinguishable from the others? Also, what does the bootsratop config command say?
[09:54:32] <elukey>	 it is not zipkin doing the queries, but envoy trying to see if the record changed
[09:54:45] <elukey>	 the usual list of search domains for the zipkin svc
[09:54:50] <klausman>	 ah, right
[09:54:55] <elukey>	 and the bootstrap config for the new pods doesn't include zipkin
[09:55:52] <klausman>	 does Envoy need to be informed thatbit shouldn't bother, maybe?
[09:56:47] <elukey>	 in theory no, it gets configured by istio when the pod starts.. but I see from tcpdump that the new pods don't query for zipkin
[09:56:58] <elukey>	 so maybe the traffic is not that much and not super visible
[09:57:10] <elukey>	 most of the queries are for cluster-local-gateway.istio-system.svc.cluster.local.
[09:57:27] <elukey>	 so if it was possible to raise its TTL to say 30s we should be good
[09:57:43] <elukey>	 maybe via coredns, or some istio/envoy config
[09:58:04] <klausman>	 Yeah, I think raising the TTL is the most promising next step
[10:09:19] <elukey>	 https://coredns.io/plugins/rewrite/#ttl-field-rewrites looks nice
[10:14:59] <elukey>	 ah! I was able to set the ttl to 30s
[10:18:02] <elukey>	 klausman: https://grafana.wikimedia.org/d/-sq5te5Wk/kubernetes-dns?orgId=1&var-dc=codfw%20prometheus%2Fk8s-mlserve&from=now-30m&to=now&viewPanel=6
[10:18:05] <elukey>	 :D
[10:18:08] <elukey>	 5s -> 30s
[10:21:59] <elukey>	     rewrite continue {
[10:21:59] <elukey>	       ttl exact cluster-local-gateway.istio-system.svc.cluster.local. 30
[10:21:59] <elukey>	     }
[10:22:14] <klausman>	 ooh, we can do it per-record, that is fantastic
[10:22:25] <elukey>	 yes and also via regex etc..
[10:22:27] <elukey>	 super handy
[10:22:48] <klausman>	 And 4x reduction, almost. very nice work!
[10:23:51] <klausman>	 NXdomain also went down a bit, as expected
[10:24:09] <klausman>	 Do you think there are other A/AAA records that might benefit us?
[10:25:34] <elukey>	 yes definitely
[10:25:50] <elukey>	 going afk for lunch!
[10:28:32] <klausman>	 \o Will do so as well in a bit
[13:33:14] <elukey>	 created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/836811, let's see what serviceops thinks about it
[13:33:22] <elukey>	 it is not great that we apply these fixes
[13:33:29] <elukey>	 but I don't see an alternative path
[13:48:21] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Align ORES prediction output with Lift Wing's one (for revscoring models) - https://phabricator.wikimedia.org/T318932 (10elukey)
[13:50:12] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Align ORES prediction output with Lift Wing's one (for revscoring models) - https://phabricator.wikimedia.org/T318932 (10elukey)
[13:50:16] <wikibugs>	 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10elukey)
[13:50:20] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey)
[13:50:54] <wikibugs>	 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10elukey)
[13:50:58] <wikibugs>	 10Machine-Learning-Team, 10Data Engineering Planning, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score-<model> - https://phabricator.wikimedia.org/T317768 (10elukey)
[14:05:23] <wikibugs>	 (03PS1) 10Thiemo Kreuz (WMDE): [WIP] Various unfinished edits from my local dev environment [extensions/ORES] - 10https://gerrit.wikimedia.org/r/836827
[14:07:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Various unfinished edits from my local dev environment [extensions/ORES] - 10https://gerrit.wikimedia.org/r/836827 (owner: 10Thiemo Kreuz (WMDE))
[14:28:53] <klausman>	 elukey: to quote a coworker of mine many years ago: "I approve of the intent of your patch and am saddened by its necessity."
[14:44:22] <elukey>	 let's see what others think :)
[14:48:18] <elukey>	 klausman: I am reviewing lift wing tasks, do you have time for https://phabricator.wikimedia.org/T300259 during the next days/weeks?
[14:48:40] <klausman>	 I hope so :-/
[14:49:09] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10elukey)
[14:50:17] <elukey>	 klausman: otherwise I can pick it up, as you prefer
[14:50:46] <klausman>	 I will try and make at least some initial headway.
[14:50:56] <klausman>	 If I find I can't, I will let you know
[14:52:48] <elukey>	 sure sure, no pressure, I was just reviewing the assigned tasks
[15:01:26] <elukey>	 aiko: https://www.featurestoresummit.com/fss-2022/agenda-2022
[15:01:40] <elukey>	 it is in pacific time afaics, but they'll publish recordings
[15:04:38] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) 05Open→03Resolved
[15:04:40] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey)
[15:05:02] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Explore ingress filtering for Lift Wing - https://phabricator.wikimedia.org/T300259 (10elukey) a:05klausman→03None Removing assignee so we can prioritize/schedule the task during the next grooming.
[15:09:11] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Support (or not) the ORES batch scoring in Lift Wing - https://phabricator.wikimedia.org/T306986 (10elukey) 05Open→03Declined For the moment, let's decline it. We can restart the work if there will be the need in the future.
[15:09:13] <wikibugs>	 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey)
[15:11:02] <elukey>	 so in theory I see 4 "blocking" tasks before LiftWing as MVP
[15:11:30] <elukey>	 lemme know if you see more, but it looks relatively close
[15:17:00] <klausman>	 I have some jumbled notes here, I hope to integrate them into the doc tomorrow. If I think something is egregiously missing, I will ping you
[15:26:34] <elukey>	 ah lovely, I am tracking down the remaining traffic to coredns pods
[15:26:44] <elukey>	 and most of it seems to be related to queries with search domains
[15:26:52] <elukey>	 from a few pods, namely the knative ones
[15:27:09] <elukey>	 the autoscaler constantly checks dns domains but it doesn't have any envoy
[15:27:21] <elukey>	 now, how to change ndots in there?
[15:27:42] <klausman>	 Hmmm. Does the containber even have a custom resolv.conf?
[15:30:47] <elukey>	 it is set by the Pod resource 
[15:30:52] <elukey>	 by default ndots: 5
[15:31:11] <elukey>	 does the knative-serving chart has a Pod resource? Not sure, but so far it seems not
[15:32:14] * elukey bbiab
[15:33:10] <klausman>	 I can only find pod affinity settings, so I suspect it doesn't have a pod resource
[16:14:57] <elukey>	 klausman: ah found it! The deployment resource can do it as well
[16:17:29] <elukey>	 dns queries decreasing a bit
[16:19:06] <klausman>	 neat!
[16:20:49] <elukey>	 and  the zipkin settings will be picked up during the next deployment (so that we don't have to kill all pods again etc..)
[16:21:21] <elukey>	 going afk for today! have a nice rest of the day folks
[16:21:29] <klausman>	 \o