[00:08:39] 10Machine-Learning-Team, 10ORES, 10Growth-Team, 10MediaWiki-Recent-changes, 10Moderator-Tools-Team (Kanban): 'Highlight likely problem edits' preference doesn't select any filters in mobile web - https://phabricator.wikimedia.org/T318683 (10eigyan) a:03eigyan [06:42:00] good morning :) [06:42:35] deployed the knative changes to ml-serve-eqiad, pods are being refreshed [06:53:06] we still have ~5k rps sadly, that are way too much [06:54:01] for some reason one coredns pod was serving most of the nxdomain queries, now that the load is decreased the A records returned without nxdomain are more evenly spread [06:54:44] atm I am seeing resolutions for stuff like [06:54:45] frwikisourcewiki-articlequality-predictor-default-drfrm.revscoring-articlequality.svc.cluster.local.codfw.wmnet [06:54:56] I don't even get why it tries to do it [06:59:14] and also [06:59:15] A? api-ro.discovery.wmnet.svc.cluster.local. [06:59:17] etc.. [06:59:43] I tried to quickly modify the istio destination rule adding a '.' at the end of the 'host' field, didn't really change much [07:23:50] Morning! The cluster.local. is likely a search: domain? [07:27:49] The first one you mentioned is puzzling, but I suspect it assembled from various bits (a search: domain should never show up in the middle) [07:31:07] yes it is a search domain, but it shouldn't really try to do it now with ndots:3 [07:31:26] or maybe knative sets stuff like "frwikisourcewiki-articlequality-predictor-default-drfrm.revscoring-articlequality [07:31:52] I suspect some part of k8s/istio trying to assemble hostnames a la paths.join(random, stuff, here) [07:32:26] _or_ different parts are tacking on various bits, ignorant of each other [07:32:52] mmm not sure, those seem search domains, istio doesn't know them [07:35:08] I'd like to try https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/836099 [07:35:32] I've never heard of the libc resolver combining multiple search domains, but who knows [07:35:59] what do you mean "combining multiple search domains" ? [07:36:55] If you have multiple domains in search:, they should only be tried one by one, not random combinations [07:37:10] +1'd 836099 [07:37:22] I think it does it, it seems random but it tests all of them [07:37:50] for api-ro I see [07:37:52] 07:36:13.866061 IP 10.194.22.193.48847 > 10.194.18.202.domain: 42902+ A? api-ro.discovery.wmnet.svc.cluster.local. (58) [07:37:55] 07:36:13.866819 IP 10.194.22.193.48847 > 10.194.18.202.domain: 39162+ A? api-ro.discovery.wmnet.cluster.local. (54) [07:37:58] 07:36:13.867542 IP 10.194.22.193.48847 > 10.194.18.202.domain: 42320+ A? api-ro.discovery.wmnet.codfw.wmnet. (52) [07:38:01] 07:36:13.868274 IP 10.194.22.193.48847 > 10.194.18.202.domain: 64830+ A? api-ro.discovery.wmnet. (40) [07:38:04] 07:36:13.895697 IP 10.194.23.134.37842 > 10.194.18.202.domain: 17055+ A? api-ro.discovery.wmnet.revscoring-editquality-reverted.svc.cluster.local. (90) [07:38:07] that falls out of the ndots: 3 setting of course [07:39:29] and the above is fired by all pods every 5s [07:41:03] sheesh [07:41:51] trying both cluster.local and codfw.wmnet I get, but combining them is weird [07:42:22] I hope the anchored name change you just sent works [07:43:30] search revscoring-editquality-damaging.svc.cluster.local svc.cluster.local cluster.local codfw.wmnet [07:43:40] it is not combining them, just going through them [07:45:42] so in theory ndots:2 should reduce queries even more [07:46:46] ah lovely with the last patch istio is broken [07:46:55] uff [07:47:57] The anchroing one? [07:48:29] the last one with the extra dot [07:50:36] yeah, you mentioned that istio doesn't like those [07:50:42] could be https://github.com/istio/istio/issues/28103 [07:51:24] revering.. [07:52:11] *reverting [08:00:10] klausman: I'd be inclined to test ndots: 2 in staging [08:00:29] Yeah, good idea. At worst, we'll find out it doesn't work. [08:26:41] klausman: created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/836102 [08:27:00] i have also added a small tweak to allow basic transformer settings without rendering where not needed [08:27:01] LGTM [08:27:04] like in revscoring only etc.. [08:27:09] ah ok thanks :) [08:27:31] The inference chart bump will restart everything, right? [08:27:42] yep [08:28:05] I mean, in this case, that's ok, but still a bit of a bummer in the general case [08:28:35] I am not sure for the chart's bump to be honest, but in this case we are touching all the isvcs so yes [08:28:59] we can always use the helmfile hierarchy if needed [08:29:19] ack [08:29:40] In root's bash history, there is a for loop I used to check the diffs on everything, just in case you find it useful [08:29:49] thanks [08:29:55] everything==all ml-services [08:34:56] https://grafana-rw.wikimedia.org/d/-sq5te5Wk/kubernetes-dns?orgId=1&var-dc=codfw%20prometheus%2Fk8s-mlstaging - nxdomain seems going down [08:34:58] nxdomain is decreasing already [08:37:35] I really wish Grafana had per-user folders for dashboards, so I can save my crummy edits without ruining stuff for everybody else [08:40:48] completed the rollout, but the dns graphs don't really show much of a gain [08:41:52] need to run an errand, let's see if things improve as time passes [08:42:00] ack, will keep an eye on things [10:04:55] not a big result but it loooks good [10:06:08] I found also https://github.com/istio/istio/issues/35603 that could be useful [10:08:55] now that I think about it I may already have tried it, but not focusing on nxdomain [10:10:16] in the meantime, I am going to rollout the ndots:2 change to ml-serve-codfw [10:16:29] ack [10:32:26] rollout completed, pods are slowly getting refreshed [10:33:24] hopefully we'll go below 4k rps [10:34:33] It's almost there. 3k noerror, 1.5k nxdomain [10:40:00] * elukey lunch! [11:15:05] ditto [12:58:56] we gained something with ndots:2, but nothing really significant [12:59:39] given the status of the cluster, I am inclined to close the task related to the rollout of async ores preprocess() since now performances are back to "comparable" between prod and staging [12:59:51] and I'll open a new task to work on the remaining 5k/rps [13:00:57] going to complete the rollout in ml-serve-eqiad [13:01:48] SGTM [13:30:52] going to roll restart ORES for expat updates [13:30:59] out for a bit for groceries and some errands. [13:43:32] ok ml-serve-eqiad rollout completed, from my tests it seems faster that the codfw clusters (that now are similar) [13:44:03] for example, enwiki-goodfaith with 10 clients, leads to ~40/43 rps in codfw (staging or ml-serve) and ~50/53 on ml-serve-eqiad [13:49:11] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move revscoring isvcs to async architecture - https://phabricator.wikimedia.org/T313915 (10elukey) The async code has been deployed to all clusters, and now we see better performances on all revscoring-based models. We have spent time... [14:09:07] https://github.com/istio/istio/issues/33968 [14:09:17] indeed I see a lot of calls to coredns for zipkin [14:12:33] the fix needs istio-proxy to be restarted sigh [14:12:38] testing on ml-serve-codfw [14:24:14] recycling pods works, less dns queries [14:28:19] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Reduce DNS queries from istio-proxies to coredns on ML clusters - https://phabricator.wikimedia.org/T318814 (10elukey) [14:59:04] very weird, at some point the pod recycling stopped to lower down dns queries [14:59:13] as if the setting was not picked up completely [16:03:40] not sure why but revscoring-editquality-damaging on ml-serve-codfw is a little broken atm, some pods stuck in init state, after the delete/recreate pod action [16:03:48] it happened in the past, I believe it is a bug/corner-case of knative [16:03:56] I am trying to clean up the ReplicaSet manually [16:10:30] I'll leave it as it is for the moment, usually the controller at some point is able to reconcile, will check tomorrow morning to see its status! [16:10:37] going afk for today folks, have a nice rest of the dya [16:10:39] *day [16:33:24] bye Luca! :) [19:09:21] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Connect Outlink topic model to eventgate - https://phabricator.wikimedia.org/T315994 (10achou) @Isaac thanks for answering. The reason why I was asking is because ORES articletopic has kind of the same output schema for the `revision-cr... [19:26:41] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Support pre-transformed inputs for Outlink topic model - https://phabricator.wikimedia.org/T315998 (10achou) > Maybe we pick this up in a week or two when I'm back from research offsite and coordinate directly (meeting where I kick off...