[00:58:01] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): 'Highlight likely problem edits' preference doesn't select any filters in mobile web - https://phabricator.wikimedia.org/T318683 (10Etonkovidova) The issue is confirmed. Steps to reproduce: - on... [08:09:39] hello folks [08:13:23] the pod issue that I mentioned yesterday in ml-serve-codfw seems related to not enough quota to recycle pods [08:13:27] I am going to raise it a little [08:31:19] doesn't completely work, some pods are still duplicated [08:31:32] so I am going through the knative revisions to delete them manually [08:37:11] ok cleaned up [08:37:18] this is another knative bug/weird-corner-case [08:37:19] sigh [08:47:30] Morning! [08:47:59] It's a bit odd that it seems to happen infrequently. I'd expect quota for that to not be so laser-thin, especially since you raised it [08:51:57] there were errors in the kube events related to quotas being reached, this is why I raised a little [08:52:22] but I think that the combination of our version of knative + limits may trigger some weird behavior [08:55:36] The meta-annoyance is that since we are on ancient versions, getting upstream to help is less than likely [08:57:59] understandably so [08:59:11] this is why I was saying that we should think about dedicating half of our SRE resources to the k8s 1.23 working group from now on [09:10:20] Agreed [09:21:16] TIL https://istio.io/latest/docs/reference/commands/istioctl/#istioctl-proxy-config-bootstrap [09:21:54] That will come in handy [09:22:23] I am testing various settings to disable zipkin's dns queries [09:22:28] and IIUC we have to [09:22:35] 1) apply the new config with istioctl [09:22:39] 2) kill a pod [09:22:51] 3) run the pc bootstrap command above and check the config [09:23:00] if it doesn't contain zipkin's cluster, then we are good [09:23:05] (to roll restart the other pods) [09:23:08] ack [09:23:30] I expected something more streamlined to be honest.. [09:23:33] Should make for faster turnaround when experimenting than going through the whole helm dance for everything [09:24:10] I guess it's not something that has to be done more often, so it's one of those "Yeah it sucks doing it this way, but once I got my cluster fixed, I stopped caring." problems [09:24:26] we can't even use helm for this use case, only istioctl [09:26:40] Weird that this aspect isn't exposed further up, yeah [09:35:42] I had to create https://gerrit.wikimedia.org/r/836734 [09:36:14] due to how the helm manifests (inside istioctl) are working, this seems to be the only way to disable any tracing [09:36:22] I'll LGTM in a moment. Note the extra space at the end of line 58. [09:36:22] including the default zipkin dns queries [09:36:50] fixed [09:37:15] I don't like the fact that we have to set something not-zipkin to be able to disable zipkin [09:37:39] Yeah, it's a sign modules are not as modular as they should be. (Or maybe I am misunderstanding some interdependencies) [09:38:34] I mean, I get that "Disable Zipkin" -> "No tracing". But there should be a Zipkin: off like config option, instead of having to disable something else and autmagically getting rid of Zipkin. [09:39:05] yep yep [09:50:00] mmm I killed some pods on ml-serve-codfw but I don't see a decrease in zipkin's queries [09:50:00] better, I don't see a decrease in overall dns queries [09:50:00] I've seen pod restarts bump up DNS queries, maybe it needs some time to settle? How long ago did you do the kills? [09:50:00] some minutes ago [09:50:49] yeah, it should be visible by now. [09:51:18] What exact queries does Zipkin do? Are they distinguishable from the others? Also, what does the bootsratop config command say? [09:54:32] it is not zipkin doing the queries, but envoy trying to see if the record changed [09:54:45] the usual list of search domains for the zipkin svc [09:54:50] ah, right [09:54:55] and the bootstrap config for the new pods doesn't include zipkin [09:55:52] does Envoy need to be informed thatbit shouldn't bother, maybe? [09:56:47] in theory no, it gets configured by istio when the pod starts.. but I see from tcpdump that the new pods don't query for zipkin [09:56:58] so maybe the traffic is not that much and not super visible [09:57:10] most of the queries are for cluster-local-gateway.istio-system.svc.cluster.local. [09:57:27] so if it was possible to raise its TTL to say 30s we should be good [09:57:43] maybe via coredns, or some istio/envoy config [09:58:04] Yeah, I think raising the TTL is the most promising next step [10:09:19] https://coredns.io/plugins/rewrite/#ttl-field-rewrites looks nice [10:14:59] ah! I was able to set the ttl to 30s [10:18:02] klausman: https://grafana.wikimedia.org/d/-sq5te5Wk/kubernetes-dns?orgId=1&var-dc=codfw%20prometheus%2Fk8s-mlserve&from=now-30m&to=now&viewPanel=6 [10:18:05] :D [10:18:08] 5s -> 30s [10:21:59] rewrite continue { [10:21:59] ttl exact cluster-local-gateway.istio-system.svc.cluster.local. 30 [10:21:59] } [10:22:14] ooh, we can do it per-record, that is fantastic [10:22:25] yes and also via regex etc.. [10:22:27] super handy [10:22:48] And 4x reduction, almost. very nice work! [10:23:51] NXdomain also went down a bit, as expected [10:24:09] Do you think there are other A/AAA records that might benefit us? [10:25:34] yes definitely [10:25:50] going afk for lunch! [10:28:32] \o Will do so as well in a bit [13:33:14] created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/836811, let's see what serviceops thinks about it [13:33:22] it is not great that we apply these fixes [13:33:29] but I don't see an alternative path [13:48:21] 10Lift-Wing, 10Machine-Learning-Team: Align ORES prediction output with Lift Wing's one (for revscoring models) - https://phabricator.wikimedia.org/T318932 (10elukey) [13:50:12] 10Lift-Wing, 10Machine-Learning-Team: Align ORES prediction output with Lift Wing's one (for revscoring models) - https://phabricator.wikimedia.org/T318932 (10elukey) [13:50:16] 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10elukey) [13:50:20] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [13:50:54] 10Machine-Learning-Team: Migrate ORES clients to LiftWing - https://phabricator.wikimedia.org/T312518 (10elukey) [13:50:58] 10Machine-Learning-Team, 10Data Engineering Planning, 10Research: Proposal: deprecate the mediawiki.revision-score stream in favour of more streams like mediawiki-revision-score- - https://phabricator.wikimedia.org/T317768 (10elukey) [14:05:23] (03PS1) 10Thiemo Kreuz (WMDE): [WIP] Various unfinished edits from my local dev environment [extensions/ORES] - 10https://gerrit.wikimedia.org/r/836827 [14:07:18] (03CR) 10CI reject: [V: 04-1] [WIP] Various unfinished edits from my local dev environment [extensions/ORES] - 10https://gerrit.wikimedia.org/r/836827 (owner: 10Thiemo Kreuz (WMDE)) [14:28:53] elukey: to quote a coworker of mine many years ago: "I approve of the intent of your patch and am saddened by its necessity." [14:44:22] let's see what others think :) [14:48:18] klausman: I am reviewing lift wing tasks, do you have time for https://phabricator.wikimedia.org/T300259 during the next days/weeks? [14:48:40] I hope so :-/ [14:49:09] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10elukey) [14:50:17] klausman: otherwise I can pick it up, as you prefer [14:50:46] I will try and make at least some initial headway. [14:50:56] If I find I can't, I will let you know [14:52:48] sure sure, no pressure, I was just reviewing the assigned tasks [15:01:26] aiko: https://www.featurestoresummit.com/fss-2022/agenda-2022 [15:01:40] it is in pacific time afaics, but they'll publish recordings [15:04:38] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) 05Open→03Resolved [15:04:40] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Load test the Lift Wing cluster - https://phabricator.wikimedia.org/T296173 (10elukey) [15:05:02] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Explore ingress filtering for Lift Wing - https://phabricator.wikimedia.org/T300259 (10elukey) a:05klausman→03None Removing assignee so we can prioritize/schedule the task during the next grooming. [15:09:11] 10Lift-Wing, 10Machine-Learning-Team: Support (or not) the ORES batch scoring in Lift Wing - https://phabricator.wikimedia.org/T306986 (10elukey) 05Open→03Declined For the moment, let's decline it. We can restart the work if there will be the need in the future. [15:09:13] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks): Lift Wing proof of concept - https://phabricator.wikimedia.org/T272917 (10elukey) [15:11:02] so in theory I see 4 "blocking" tasks before LiftWing as MVP [15:11:30] lemme know if you see more, but it looks relatively close [15:17:00] I have some jumbled notes here, I hope to integrate them into the doc tomorrow. If I think something is egregiously missing, I will ping you [15:26:34] ah lovely, I am tracking down the remaining traffic to coredns pods [15:26:44] and most of it seems to be related to queries with search domains [15:26:52] from a few pods, namely the knative ones [15:27:09] the autoscaler constantly checks dns domains but it doesn't have any envoy [15:27:21] now, how to change ndots in there? [15:27:42] Hmmm. Does the containber even have a custom resolv.conf? [15:30:47] it is set by the Pod resource [15:30:52] by default ndots: 5 [15:31:11] does the knative-serving chart has a Pod resource? Not sure, but so far it seems not [15:32:14] * elukey bbiab [15:33:10] I can only find pod affinity settings, so I suspect it doesn't have a pod resource [16:14:57] klausman: ah found it! The deployment resource can do it as well [16:17:29] dns queries decreasing a bit [16:19:06] neat! [16:20:49] and the zipkin settings will be picked up during the next deployment (so that we don't have to kill all pods again etc..) [16:21:21] going afk for today! have a nice rest of the day folks [16:21:29] \o