[07:01:27] good morning [07:01:45] so I was reading https://github.com/istio-ecosystem/istio-coredns-plugin and https://istio.io/latest/blog/2020/dns-proxy/ [07:02:04] that may be related to the current istio deployment problem [07:02:46] in theory we don't use the sidecar, plus I am not sure how it works for the gateway pods themselves [07:03:00] but I can try [07:03:00] ISTIO_META_DNS_CAPTURE: "true" [07:03:01] ISTIO_META_PROXY_XDS_VIA_AGENT: "true" [07:09:27] nope [07:20:04] with nsenter attached to the istio gateway container, I see the following resolv.conf [07:20:07] nameserver 10.64.77.3 [07:20:09] search istio-system.svc.cluster.local svc.cluster.local cluster.local eqiad.wmnet [07:20:12] options ndots:5 [07:25:54] if I do dig istiod.istio-system.svc.cluster.local @10.64.77.3 [07:26:03] I get an IP [07:26:35] so the base coredns seems to work fine [07:32:14] tried to override the config in istioctl for the discovery address, not working [07:50:12] and from any host, curl http://10.64.77.184:15014/debug/endpointz works fine [08:00:05] as test, I used https://istio.io/latest/docs/setup/install/istioctl/#uninstall-istio to clean up and applied a manifest with default gateway configs [08:00:54] not working [08:28:54] the weird thing is trying to parse [08:28:55] 2021-07-14T08:21:35.033220Z warning envoy config StreamAggregatedResources gRPC config stream closed: 14, connection closed [08:28:58] 2021-07-14T08:21:36.719286Z warn Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 0 successful, 0 rejected; lds updates: 0 successful, 0 rejected [08:29:18] so the stream closed blabla 14 seems to be fine, it says "connection closed" [08:29:31] not errors like "timeout, unavailable, etc.." [08:30:28] see https://github.com/envoyproxy/envoy/issues/14591 [08:33:14] so at this point it may be istiod not pushing the envoy config to the gateway [08:33:17] Incremental push, service istio-ingressgateway.istio-system.svc.cluster.local has no endpoints [08:34:52] and as Tobias pointed out yesterday [08:34:53] kubectl get ep -n istio-system [08:35:02] elukey@ml-serve-ctrl1001:~$ kubectl get ep -n istio-system [08:35:02] NAME ENDPOINTS AGE [08:35:05] istio-ingressgateway 35m [08:35:08] istiod 10.64.79.204:15014,10.64.79.204:15017,10.64.79.204:15010 + 1 more... 35m [08:36:41] but the istio-ingressgateway pod is not healthy, so it may make sense [08:38:41] (basically the health probe to the ingress gateway fails) [08:39:28] that is what happens [08:39:29] Warning Unhealthy 4m54s (x1036 over 39m) kubelet, ml-serve1001.eqiad.wmnet Readiness probe failed: HTTP probe failed with statuscode: 503 [08:46:05] Does a manual fetch of the health endpoint work? [08:46:54] nope, 503 [08:46:56] good morning :) [08:47:50] I am now wondering if the docker image is wrong [08:47:59] I am checking https://hub.docker.com/layers/istio/proxyv2/1.9.5/images/sha256-01fc76f6dd3665ff1dd4aa9e309e4b02380fa266763e172ca07c766e8d2fe2d7?context=explore to compare with ours [08:50:10] Morning :) [08:51:52] the container I saw at least did run pilot-agent and envoy (the latter started by the former, according to ps axf). So it's not that it's not running at all. But I dunno if the ingress gw would be another process [08:52:48] it seems that envoy on the ingress gw returns 503 to the health probe, failing to populate the endpoints [08:53:07] and istiod fails to push configs to it [08:53:35] but the istio ingress gw also complains that it doesn't get any config from istiod [08:54:11] I assumed that envoy failing the health probe was for the missing istiod config pushed to the pod, but at this point it might be something else [08:55:05] of course I don't find anything in the logs point to a problem [09:00:02] did the docker image comparison turn up anything? [09:01:28] we are missing to copy something like stats-filter.compiled.wasm [09:01:51] but they are not in the repo, trying to build istio to see if they pop up [09:02:04] I would expect some indication from envoy that something is missing though [09:04:11] I see that pilot-agent uses some values like .Values.global.proxy.componentLogLevel for setting envoy logging levels [09:04:17] trying to tweak those [09:12:25] ok I made it, the new ingress pod has info logging [09:14:26] klausman: The .compiled files are generated by the build, I can try to create a new docker image for them [09:17:05] The log still looks mostly the same to me :-/ [09:25:48] I am not confident that the problem is in the docker images since on minikube it worked fine [09:25:59] so maybe it is a perf improvement or similar [09:28:24] How do you mean? [09:29:56] I would have expected the same issue on minikube when testing the istio images if it was something related to missing .compiled files [09:30:18] so my theory is that the .compiled are meant to help envoy to speed up bootstrap time [09:30:30] rather than being critical [09:30:54] (envoy in theory should be able to compile wasm by itself) [09:35:11] Ah, I misunderstood. I thought you implied that the perf stuff broke our lookup somehow [09:35:29] ahhh nono [09:35:40] I found out that on port 15000 envoy exposes a nice admin api [09:35:42] root@ml-serve1004:/home/elukey# curl localhost:15000/ready -i [09:35:42] HTTP/1.1 200 OK [09:35:51] (I am attached to the namespace) [09:36:37] so it is very confusing [09:38:26] root@ml-serve1004:/home/elukey# curl localhost:15021/healthz/ready -i [09:38:26] HTTP/1.1 503 Service Unavailable [09:38:32] ahahhah [09:38:35] * elukey cries in a corner [09:38:52] So this "pilot not running" thing gives me pause [09:39:10] Is this referring ti pilot-agent or something else [09:40:59] pilot not running? Where do you see it? [09:41:09] ahhh in the logs [09:41:26] my assumption is that it refers to pilot discovery on istiod [09:41:45] since it says that it doesn't have any config [09:42:23] There is a command like /healthcheck/ok: cause the server to pass health checks [09:42:35] in the admin interface, I am going to use it and see how it goes [09:43:42] doesn't work [09:47:28] I've also checked if maybe it's a memory limit thing killing a component. Nothing I could find [09:47:29] the change for the docker images is https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/704507/1 [09:48:08] ah no it is wrong, fixing [09:48:57] I'll also merge the one for kfserving/knative images [09:51:17] ok fixed :) [09:52:46] 'avin' a look-see [09:53:23] +1'd [09:55:00] <3, building and the trying [09:59:40] same issue, nothing changed [10:04:26] couls this be an authentication error of some sort? [10:05:33] it could be anything [10:05:47] but I am wondering if it is something envoy-related [10:06:16] You mean pilot is working correctly, but Envoy is doing something wrong? [10:06:51] yes mainly because of this istiod log [10:07:11] Incremental push, service istio-ingressgateway.istio-system.svc.cluster.local has no endpoints [10:07:57] if it tries to push a config to it only via endpoints then there is no chance that envoy on the ingress will get a config [10:08:12] but at the same time, envoy seems not ready due to missing configs from pilot [10:08:33] and it seems reporting no real issue connecting to the discovery endpoint [10:10:55] Envoy says it has no certs, either. SHould it have some? [10:15:48] can you point me to the log? [10:15:57] sec [10:18:01] root@ml-serve1002:~# nsenter -n -t 13176 [10:18:02] root@ml-serve1002:~# curl http://localhost:15000/certs [10:18:04] { [10:18:06] "certificates": [] [10:18:08] } [10:18:47] ah, I believe it should be ok [10:19:45] I suspected so, but wasn't sure [10:34:25] going to take a break for lunch! [10:34:25] ttl [10:44:35] Aye. I got some Tagliatelle to cook :) [12:50:40] elukey: hi, can we turn off the ores precaching when the dc is not getting traffic? :D [12:50:49] it's just using a lot of resources for no reason [12:55:51] Amir1: hi! I wouldn't touch Ores if possible, it is using resources but not really a problem afaics :D [12:56:57] yeah, good point, let's leave that thing [12:57:04] I am worried that we forget about it when switching back or similar, ending up in people complaining etc.. [12:58:32] yeah [13:47:04] back to istio [13:47:29] I think that while we debug this it may be good to open a gh issue to upstream to get advices [13:47:39] worst case we close the issue with "PEBCAK sorry" :) [13:54:45] Yeah, sounds like a good plan [14:42:25] /quit [14:58:36] klausman: it works now! [14:59:03] while opening the bug report I noticed that I added a meshconfig option earlier on to debug why istiod wasn't working [14:59:08] I removed it and now it works [14:59:27] it was [14:59:27] # meshConfig: [14:59:27] # defaultConfig: [14:59:27] # controlPlaneAuthPolicy: NONE [14:59:34] (without the #) [14:59:39] all pods up and running now! [15:00:05] lemme try to apply our config now (with node ports etc..) [15:06:22] works as well! [15:06:27] * elukey dances [15:08:04] klausman: apologies for the waste of time, didn't think about checking all the config again up to now :( [15:08:17] but the good news is that we should be ready for knative! [15:08:26] (even if the helm chart needs some love) [15:11:42] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Install Istio on ml-serve cluster - https://phabricator.wikimedia.org/T278192 (10elukey) Finally! ` elukey@ml-serve-ctrl1001:~$ kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE istio-syst... [15:17:46] ok now knative is the next one [15:20:09] this is where some extra thought is needed, see https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/kfserving#Kfserving_stack [15:20:40] so up to now, IIUC, the istio operator config (istioctl etc..) takes care of setting up two ingress gateways, but no L7 config [15:21:26] Namely istio's Gateway (the CRD I mean) resources will be deployed by knative in its net-istio config [15:21:57] so, in my mind, the TLS certificate for inference.wikimedia.org will need to be deployed as part of the knative chart [15:22:07] in this way, we'll have [15:22:36] LVS endpoint --> istio ingress gw (doing TLS termination) --> routing to pods via Host header [15:23:15] in the Host header we'll have to set the InferenceService target pod svcs, like enwiki-goodfaith etc.. [15:23:24] (basically how it works when testing on minikube) [15:30:36] commented in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/699380 [15:36:43] elukey: nice one! that is great news :) [15:46:24] accraze: \o/ [15:46:25] Very nice! [15:47:00] Actually, when I saw the `controlPlaneAuthPolicy: NONE` was when I mentioned `couls this be an authentication error of some sort?` [15:47:02] :D [15:47:09] Should've pestered more ;) [15:53:04] yes it was the right question to ask, but it didn't occur to me that the error may have been in the config [15:53:41] to fix the istiod deployment I tried a million settings before finding the right one :D [15:54:58] The curse of our profession :) [15:55:36] I don't like though when it comes to try shot in the dark, logs should tell more [15:56:22] absolutely [15:56:38] *Something* should've mumbled something about auth [15:58:41] Nice! [18:06:30] * elukey afk! [23:13:51] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Prepare 4 ORES English models for Lift Wing - https://phabricator.wikimedia.org/T272874 (10ACraze) [23:13:59] 10Lift-Wing, 10artificial-intelligence, 10articlequality-modeling, 10revscoring, 10Machine-Learning-Team (Active Tasks): Create a KFServing model server for articlequality models - https://phabricator.wikimedia.org/T284678 (10ACraze) 05Open→03Resolved We have the enwiki-articlequality inference servi... [23:18:11] cool, i think that's all the model server images we need to host all of the ores models - editquality, articlequality, drafttopic, articletopic [23:19:22] ah nvm, forgot about draftquality: https://github.com/wikimedia/draftquality [23:20:11] will add a task to either make a custom model server or see if it runs using the base revscoring image [23:28:51] 10Lift-Wing, 10ORES, 10draftquality-modeling, 10Machine-Learning-Team (Active Tasks): Create a KFServing model server for draftquality models - https://phabricator.wikimedia.org/T286686 (10ACraze) [23:29:55] 10Lift-Wing, 10ORES, 10draftquality-modeling, 10Machine-Learning-Team (Active Tasks): Create a KFServing model server for draftquality models - https://phabricator.wikimedia.org/T286686 (10ACraze) p:05Triage→03Low [23:30:21] ^set this as low since there are only two models in this class (enwiki, ptwiki) [23:53:50] 10Lift-Wing, 10ORES, 10artificial-intelligence, 10draftquality-modeling, 10Machine-Learning-Team (Active Tasks): Create a KFServing model server for draftquality models - https://phabricator.wikimedia.org/T286686 (10ACraze) I have uploaded the model file into our public bucket, here is the storage uri: s...