[07:03:58] hello hello [07:31:41] \o [07:31:51] elukey: they found my luggage! [07:32:00] elukey: for unclear reasons, it's in Berlin [07:32:16] klausman: o/ whatttttt???? [07:33:11] when are they going to ship it back? [07:33:34] It's flying back to ZRH today, so I expect to get it tomorrow or maybe Wednesday [07:33:55] sigh, but at least it is coming back to you [07:34:14] The whole process is stupid: Only after a week are you allowed to file a report of the baggage contents. I did that yesterdya. First thing today, they've found it. [07:34:28] Surely that is a coincidence ;) [07:34:41] really crazy [07:34:54] But yeah, at least I get it back. Maybe my contents filing helped. Stuf'ss not cheap :-P [07:35:05] Stuff's* [07:35:19] fingers crossed then [07:35:57] I am trying to rebuild knative with a patch on top, that in theory should allow it to accept dns configs and tune resolve.conf files in pods [07:36:06] yepyep. Of course a friend has already mentioned to not celebrate yet, since it might be damaged and/or missing stuff. [07:36:18] Oooh, can I see the patch? [07:36:22] desperate attempt, if it doesn't work we finally got to a stage in which we have a serious bug and we anno update knative [07:36:35] yeah I added you to a code change [07:36:51] Ah, I need to relogin to a million services :D [07:36:53] I backported it to our version, in knative 1.5.0 they allow to modify the pod's dns config [07:38:01] basically when we add `dnsConfig` to an InferenceService resource, the config is validated by the kserve controller/webhook that says "yes!". Then the controller calls knative to set up a new revision, that in turn says "nope can't support dnsConfig" [07:38:15] I see. [07:38:23] very sad [07:38:25] The only change I see is the updated package version. [07:38:44] i.e. 835070 [07:38:46] klausman: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/834553 [07:39:04] ah, it was merged so the dashboard hid it [07:39:07] I merged it earlier on, wanted to test this morning (didn't know if you were online) [07:39:35] LGTM'd the puppet side [07:39:47] the patch is very simple, hopefully it will work [07:40:00] https://github.com/knative/serving/pull/12897 [07:40:46] we probably need to set " kubernetes.podspec-dnsconfig " too [07:40:52] I'll test it in staging [07:40:55] Nice that it applied so cleanly [07:41:22] But then again, it's just passing through two values, really. [07:41:48] I had to adapt it to our code, and ran unit tests to check basic sanity [07:42:06] but it was very simple, in theory it should just set values indeed [07:42:17] a more complicated one would be un-backportable for sure [07:42:26] If you ever want help with backporting Go patches, I'm your man :) [07:42:27] we really need to go to k8s 1.23 [07:42:38] and agreed, re: 1/23 [07:43:12] I also +1'd the docker-images change, for the archives. [07:43:20] thanks! [07:43:58] I think it is all connected, we have high latencies sometimes after deploys, etc.. it should be all core-dns related [07:46:17] but the ndots trick is only going to attenuate the issue [07:46:30] we probably need something more sigh [07:47:18] yeah, it's a bit silly that the DNS timouts a re so short, as well. I get that it's an availability thing, but the issue screams to me that DNS is the wrong tool for the job. [07:48:09] 5s TTL is really aggressive [07:48:46] AIUI, it's so that a set of pods going down and being reschedueld is detected quickly by those that try to use them [07:49:16] But with a 5s TTL, I'd expect coredns to be a lot better at handling that kind of load. [07:49:43] Janis also told me that coredns is not shipped with k8s, and we have a relatively old version [07:49:52] so maybe upgrading it could help as well [07:49:52] Of coredns? [07:49:57] yes [07:50:06] it is another varible in the upgrade mess [07:50:14] Hmm, interesting. I sorta thought that coredns was, well, a core part of k8s [07:50:28] Do you know if there are any commonly-used alternatives? [07:50:35] https://gerrit.wikimedia.org/r/admin/repos/operations/debs/coredns [07:50:57] no idea [07:51:16] Ah, so at least it's a CNCF project [07:51:40] we have 1.5.2 [07:51:52] last release is 1.9.3 [07:52:00] oh wow [07:52:34] the thing that I noticed from the kubernetes pod dashboard is that a single coredns pod is hammered, and the others are relatively less busy [07:52:59] Hmm. Wonder if that is yet another k8s thing that was fixed later [07:53:16] Does Coredns depend on minimum k8s versions? [07:54:11] I know you need at least 1.9 to use coredns _at all_ for service discovery [07:54:17] I found https://github.com/coredns/deployment/blob/master/kubernetes/CoreDNS-k8s_version.md [07:54:30] ah nice [07:58:02] coredns-5cd6d7449f-hgdgf 1/1 Running 22 25d [07:58:02] coredns-5cd6d7449f-mp5bz 1/1 Running 21 25d [07:58:02] coredns-5cd6d7449f-mqgtq 1/1 Running 25 25d [07:58:02] coredns-5cd6d7449f-rhk6q 1/1 Running 24 25d [07:58:13] the fourth col is restarts :D [07:58:43] so one restart a day, roughly [07:59:22] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10PageTriage: Possible ORES outage for PageCuration tags "vandalism", "spam", "attack" - https://phabricator.wikimedia.org/T312843 (10MPGuy2824) 05Open→03Invalid Closing, since this seemed to be a false alarm. We also had the [[https://en.wikipedia.org/wi... [08:00:05] For k8s 1.16, the "default" coredns would be 1.6.2. Might be worth it to just use that, see if anything changes. AFAICT, all 1.6.x would be ok, since there are no deprecations until 1.7, and we likely could go further, if careful. Or try 193 directly, see what explodes :) [08:01:02] One downside I see for 1.7.0 is that they changed a lot pof Prom metrics names. So that would be some Grafana-side work [08:01:34] The latest coredns used in k8s (default for v1.24) is 1.8.6 [08:02:05] https://thanos.wikimedia.org/graph?g0.expr=irate(coredns_dns_request_count_total%7Bprometheus%3D%22k8s-mlserve%22%2C%20site%3D%22codfw%22%7D%5B5m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=2d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [08:02:14] the behavior of rps to single pods is very strange [08:05:30] It looks like they only ever use one IP for hours on end. [08:06:14] I _think_ most DNS client implementations respect the "resolv.conf" order of servers listed for historical reasons, that would explain (some of) this [08:06:36] So they'd always use the first entry in the equivalent config for k8s pods. [08:07:05] Which is not ideal, as we can see. I wonder if later k8s versions shuffle the file/do something different for discovery to avoid just that [08:07:45] about the only other fix for that 9if that -is_ the problem) would be a single-IP LVS-like setup [08:07:45] we have the service ip in the resolv.conf though [08:07:56] yeah. Weird. [08:08:24] also 14k qps is insane [08:08:37] (or am I reading that wrong?) [08:09:02] yeah it is insane [08:09:11] `sudo nsenter -t 2921346 -n tcpdump udp port 53` on ml-serve2004 to see the stream [08:09:41] there are a ton of nxdomains [08:09:54] so ndots:2 should really help [08:10:18] Yeah, but we're just buying more runway. [08:10:59] Oh wow, and that stream is already heavily clamped by the kernel [08:11:09] (for tcpdump) [08:11:28] 10k captured, ~300k dropped. [08:29:12] still waiting for the last couple of docker images to be upgraded [08:29:18] err published [08:33:22] done! :) [08:33:39] "Complaining more makes things go faster." :D [08:34:40] will start from staging [08:34:46] Remind me, was the DNS Storm also what made us have a whole mountain of logs in logstash? [08:35:16] not sure, I have to dig a little deeper on that, IIRC it was mostly knative spamming [08:53:17] options ndots:3 [08:53:19] yesssssss [08:53:21] \o/ [08:53:23] * elukey dances [08:57:11] ok so basic testing worked in staging, going to add the support to the knative chart + basic settings for all the pods [09:27:04] So in summary: we can reduce ndots, but we don't know yet how much of a difference it'll make? [09:28:44] One thing I see is that in Staging, the rate of DNS queries is much more evenly distributed than in prod [09:28:52] yep [09:29:01] But it's also very noisy, so that may be a red herring [09:29:24] so setting something like ndots:2 or 3 should avoid all queries with search domains, that are 4 IIRC [09:29:43] I'd expect, once rolled out, to see 1/4th of the actual dns requests [09:29:50] Akc. I think we should try 3, which is less likely to break, and may buy us enough runway [09:30:20] I thought the same, but then we have domains like api-ro.discovery.wmnet [09:30:33] that are configured in a lot of places [09:32:13] Yeah, ndots:2 is likely too aggressive on the other end of that spectrum, [09:32:32] why? [09:32:53] ndots is a "less than" thing: a threshold for the number of dots which must appear in a name before an initial absolute query will be made. The default for n is 1, meaning that if there are any dots in a name, the name will be tried first as an absolute name before any search list elements are appended to it. [09:33:14] I know yes [09:33:29] I am not sure about what the typical query we make is, but api-ro.discovery.wmnet is two dots, so already <3 [09:33:59] ohwait, I got that logic wrong [09:34:09] IIUC ndots:3 means that there must be 3 dots in the domain [09:34:32] That text up there is straight from the manpage [09:35:21] so for ndots:3, a two-dot name will go through the search domains approach [09:36:12] (note that using anchored DNS names like api-ro.discovery.wmnet. (trailing dot) would prevent the whole thing as well, no matter how many internal dots) [09:36:27] (but I feel that that is much more brittle and may be ebyond our powers) [09:36:31] exactly, but since we have some two-dots names configured in all istio-proxies (discovery records), we may still see a lot of queries [09:36:36] with ndots 3 I mean [09:36:44] Agreed. I was initially confused [09:37:06] I'm game for either 3 or 2 [09:37:18] I've read that istio may be confused with canonical domains (ending up with .) [09:37:31] Ah, there is that, too, then. [09:38:03] ok let's use 3 for now, and see where we go. Maybe canonical domains afterwards, and ndots:2 as final step [09:38:07] does it sound good? [09:38:16] I love that the very first search result I get on google for `ndots` is a page explaining the "ndots:5 may hit your performance on k8s" problem [09:39:04] Page by a guy named Marco Pracucci. Those clever Italians! [09:42:26] created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/835086 and next for the knative setting [09:48:24] One nit on that change [10:07:22] answered :) [10:09:37] lgtm'd [10:11:59] thanks! [10:12:01] also trying https://istio.io/latest/docs/ops/configuration/traffic-management/dns-proxy/#getting-started [10:13:15] ahahha https://grafana.wikimedia.org/d/-sq5te5Wk/kubernetes-dns?orgId=1&var-dc=codfw%20prometheus%2Fk8s-mlserve [10:13:18] klausman: --^ [10:13:40] Wouldya look at that [10:14:42] I think DNS proxying is especially helpful once a lot of things talk to each other on the same machine [10:15:15] https://thanos.wikimedia.org/graph?g0.expr=irate(coredns_dns_request_count_total%7Bprometheus%3D%22k8s-mlserve%22%2C%20site%3D%22codfw%22%7D%5B5m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D also looks a lot better [10:15:24] _including_ better distribution [10:16:16] still some awfully high peaks in there, but let's give it a few hours before we judge :) [10:16:55] https://grafana.wikimedia.org/d/-sq5te5Wk/kubernetes-dns?orgId=1&var-dc=codfw%20prometheus%2Fk8s-mlserve&viewPanel=36 [10:17:08] most of them are NXDomain answers, so I think that lowering ndots will help [10:17:20] Ack. [10:17:32] That now should be a relatively simple change we can experiment with [10:19:14] sadly the nxdomain queries got back to their prev value sihg [10:19:51] Weird. [10:20:25] yeah, on the Thanos page, the queries now also all land on a different instance than before. So much for better distribution :-/ [10:20:46] reverted the change.. [10:22:28] Should we try ndots:2 after lunch? [10:24:56] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10Edit-Review-Improvements-RC-Page, 10Growth-Team, and 2 others: Expose ORES topics in recent changes filters - https://phabricator.wikimedia.org/T245906 (10Aklapper) a:05Chtnnh→03None Removing task assignee due to inactivity as this open task has b... [10:25:08] yep yep [10:25:33] My idea is to add the field to the inferenceservice resource config, in the kserve-inference chart [10:25:57] after that we should be able to roll it out reliably across all pods (staging first etc...) [10:26:09] lunch! ttl :) [10:26:14] thanks for the brainbounce! [10:26:40] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10Growth-Team-Filtering, 10PageTriage: Add ORES topic prediction to the NewPagesFeed - https://phabricator.wikimedia.org/T218132 (10Aklapper) a:05Chtnnh→03None Removing task assignee due to inactivity as this open task has been assigned for more than tw... [10:26:47] np! Getting some Foccacia to celebrate :) [10:26:55] Focaccia* [12:55:08] Morning all! [12:56:14] Sorry you lost your luggage Klausman [12:56:29] Well, it's on the way to me now :) [12:56:48] But one week of depending on the clothes of others was... interesting [12:57:06] The ultimate team building activity [13:00:49] Fortunately, we didn't have to extend it to underwear :) [13:18:55] Did that meeting with Meta ever get scheduled? Enterprise offered us their support with AWS but we need to figure out what the “is” there is. [13:22:40] Sorry, no. But that's on my plate this week [13:23:04] (I figured out the Thu before the summit that I'd been using the wrong email address for Pau the whole time :( [13:34:28] chrisalbon: o/ [13:39:19] Hey elukey! [13:41:13] klausman Cool, Tajh is interested in your review doc (cost, security, etc) [13:51:30] there are other teams that have some interest in using public clouds, but mostly for public stuff [13:51:52] another use case that I discussed with Research is to host public datasets to publish on S3 [13:52:40] for example, the embeddings from commons that at the moment are on HDFS [13:53:11] they need to be published to allow the research community to use them etc.., so we may use S3 instead of our own infra [13:53:14] etc.. [14:12:55] Like, AWS s3? [14:13:19] Not like some on-prem file based data storage? [14:14:21] exactly yes, it is an idea [14:14:42] but things needs to go through a specific approval from SRE/Security etc.. [14:14:49] Innnteresting. Okay, let’s talk about it. [14:15:09] Tomorrow [14:15:12] Or we’d [14:15:15] Weds. [15:02:26] Definitely wedsnday because tomorrow is the tech dept meeting [15:32:15] it took me a bit but this is the change for the ndots: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/835186 [15:32:20] klausman: --^ (if you have a min) [15:32:33] on it [15:36:26] One clarification question [15:39:04] yep saw it thanks, answered [15:42:12] in theory the only ndots:2 values should be on fixtures [15:43:39] klausman: --^ [15:43:51] Roger [15:44:23] LGTM'd [15:44:36] thanks :) [15:44:40] going to deploy in staging [15:45:40] I think we should expect to se a reduction in NXDOMAIN there as well [15:46:09] I expect it yes, fingers crossed [15:46:53] fingers crossed! [15:47:24] https://grafana.wikimedia.org/d/-sq5te5Wk/kubernetes-dns?orgId=1&var-dc=codfw%20prometheus%2Fk8s-mlstaging&refresh=30s&viewPanel=36&from=now-15m&to=now for those who want to follow along :) [15:47:42] its me [15:48:17] nameserver 10.194.62.3 [15:48:17] search revscoring-articlequality.svc.cluster.local svc.cluster.local cluster.local codfw.wmnet [15:48:20] options ndots:3 [15:48:21] looks good [15:51:03] Dropping below 400 NXDOMAIN/s in staging, which sortof was the lower limit in the last 3h [15:51:09] Cautiously optimistic :) [15:56:36] it looks working :) [15:57:51] Yeah, the NXDOMAIN rate is still plummeting in staging [15:58:05] Now approaching 250 and starting to flatten out a bit. [15:58:30] at the same time, NOERROR is mildly increasing, which is also good verification that the rate drops for the right reasons :) [16:01:16] the rps towards coredns pods is 1/2 now [16:04:00] Yeah, the NXDOMAIN rate is <150 now, which is much closer to what I'd deem vaguely normal (with NOERROR being O(100)) [16:04:30] DO you want to wait 'til tomorrow with prod, or do it now? [16:05:29] I think it is safe to test at least one namespace in ml-serve-codfw, what do you think? [16:05:36] Sounds good. [16:05:48] I'll be off tomorrow, you can continue the rollout if you want [16:05:53] running that overnight will also give us useful insight in how stable this improvement is [16:06:05] othewise I'll pick it up on Wed [16:06:20] Yeah, I can do that (unless the graphs tell me not to :)) [16:06:23] oh wow [16:06:35] nice drop [16:07:16] still no enough chrisalbon :( [16:07:21] *not enough [16:07:28] but we have other tests to do [16:07:32] how much more do you need [16:07:54] In an ideal world, NXDOMAIN is <10% of non-error qps [16:08:13] so alot less [16:08:15] okay [16:08:21] staging is a lot different from production, so we have to see in there. But basically at the moment all the pods are hammering the coredns ones every x seconds for service discovery (thanks to istio/envoy) causing latency issues [16:08:26] the more pods the more hammering [16:08:49] and when dns is slow, everything slows down [16:09:08] Getting the error rate down is reducing the overall qps as well (as every error means there will be another query), but the base request rate should also be lower than it is [16:09:42] Still, the ndots approach looks like it will give us some breathing room as we look for futher improvements. [16:13:48] ah lovely, a rollout of knative causes all pods to be recreated [16:13:50] sigh [16:14:24] going to take a break and check later [16:14:32] but it will probably timeout or similar [16:14:33] k8s was a mistake, we should just put all the models on one big computer and go get drinks [16:14:40] definitely [16:14:42] lol [16:17:38] inb4 switching to DEC VMS [16:18:22] deployment completed, but all pods in ml-serve-codfw are recreated, it will take a while.. [16:18:46] so I'll log off and check later, but the isvc deployment needs to wait tomorrow/day-after-tomorrow [16:20:18] have a good rest of the day folks! [16:21:05] klausman: if you want to proceed tomorrow, just diff/sync the various ml-services in ml-serve-codfw (I'll do it on wed in case you are busy don't worry, only if you have time)\ [16:21:27] Roger [16:22:02] I am awaiting (with bated breath) the actual delivery of my luggage. Hopefully undamaged and unmolested. [18:57:29] (quickly checked pods on ml-serve-codfw after the knative deploy, all good)