[09:48:49] vgutierrez: Trying to DTRT with the ingressgateway LVS, I ended up with this https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/759749 - using a different port for monitor and monitor/ing/ is currently not supported in the new probes: stanza. Will work something out there as well [09:49:17] err [09:49:33] * jayme hides [09:51:17] I'm not sure that we should allow that TBH [09:52:50] I guess the concern is that one could end up checking something completely different? [09:53:21] yeah, and keeping pooled servers that they shouldn't be pooled [09:53:25] (or the other way around) [09:55:27] I do get that. It's just that checking the TLS port (via idleconnection) is not exactly the right thing to do [09:55:39] in this specific case [09:57:08] sure, but It's hard to consider checking another port "the right thing to do" [09:57:08] in general we shouldn't allow it I agree, for istio ingress it seems that its creators wanted to split health check traffic vs traffic. [09:57:28] I get it's an edge case. I'm a bit afraid it bites someone at some point (the fact that it will stop accepting connections on the TLS port) [09:57:44] *stop accepting in a "still healthy" way [09:58:00] how can be healthy and stop accepting traffic? [09:58:32] that is in case it has no backend routes configured [09:59:04] the reverse-proxy that ingressgateway is, is still fine in that case. Only there is nothing "behind" [09:59:23] so why it should receive user traffic in that case? [09:59:39] to return a 504? [10:00:02] something else to consider is what's pybal/lvs (and puppet) view on the services hosted by/on the ingress, will those be available to pybal and thus checking tls will be the right thing to do ? [10:00:23] (my two cents, please excuse the drive-by comment) [10:00:40] hm, I'm afrait it won't even do that vgutierrez [10:00:55] [same] what are we trying to monitor here, istio itself or the reachability of its backends? I assume istio does its own healthchecks of its backends [10:01:42] my understanding was that pybal will alert in case all potential nodes are down [10:02:05] it will alert sooner than that [10:02:14] as soon as the depool threshold is reached [10:02:28] volans: we are trying to monitor isio itself. The backends (e.g. services) will be monitored independently [10:02:42] so 50% of the servers down (by default) will trigger an alert [10:03:26] jayme: but you want to check that istio is "up" or that it can route incoming traffic to a specific backend and return that traffic to pybal? [10:04:13] volans: Ideally I would like to check it's up/"ready" as of it's own definition [10:04:17] *specific backend as in a specific service, not actual k8s backend [10:04:22] if checking that istio is "up" is enough then IdleConnection should suffice if istio can't provide a full L7 check [10:04:35] it does provide it, but on a separate port [10:04:37] (can't provide on the same port) [10:04:56] because of the issue that Janis explained, that routes vs health checks are considered differently [10:05:15] (on port for traffic towards backed services, one for the istio gateway's status itself) [10:06:10] could we have a backend service that lives together with istio? [10:06:50] yes, but then *all* istios would fail in case that one fails [10:07:27] why? [10:07:37] I meant each istio have a localhost backend basically [10:08:12] volans: having a dedicated pod as backend service for health checks seems to be the same as health checking on a different port, in the istio use case [10:08:26] I don't see a lot of differences.. [10:08:35] (failure use cases wise I mean) [10:09:08] if that service doesn't work (maybe gets throttled by k8s etc..) we'd be in trouble [10:09:12] sure, I don't see a big issue having pybal check on a different port if the other port is managed by the same process that handle the traffic port [10:09:26] and provides health status on that dedicated port [10:09:38] that is exactly the case here [10:09:43] it is the same pod, in theory same service daemon (the istio proxy), but Janis can confirm [10:09:45] well.. socket related issues are going to be ignored [10:09:46] either you trust that service to provide you a reliable health status [10:10:07] vgutierrez: true that's why I was asking before if we want to check istio is up or that the traffic can route back to pybal [10:10:11] are two different things [10:11:21] vgutierrez: at the same time we don't do checks on the VIPs on the backends with pybal, kinda the same thing [10:15:24] just to be clear on this: IdleConnection will, AFAICT, also work. It will only fail in the (potentially rare) case when there is no backend/route configured for ingressgateway [10:15:57] can we do a very basic tcp ping check + health on the other port? [10:16:13] that potentially rare case won't trigger any user facing issue on its own, right? [10:16:30] nope. Just noise for SRE I guess [10:18:02] (to "visualize" how this looks: https://paste.debian.net/1229879/ - the codfw cluster has a backend configured, the eqiad one has not) [10:19:06] yeah that seems right to me, as in if there's nothing to route pybal shouldn't send traffic [10:19:53] if all backends stop listening on their tls port then that's a problem for sure [10:20:12] godog: that won't be cached here [10:20:41] as soon as a backend is *registered* in ingressgateway, it will start to accept connections on 30443 [10:20:51] regardless of the state of the backend [10:21:39] jayme: err I meant pybal backends, but thank you I didn't know about the registered thing [10:21:55] ack [10:22:06] all those overloaded terms :| [10:22:10] that takes me back to the question re: services on top of istio, will pybal/lvs know about them in some shape or form? [10:22:14] yeah they are :| [10:23:15] pybal/lvs I think not. But they should remain in the service catalog [10:23:34] so monitoring:/probes: should be there for them [10:23:42] ack, thanks! (I agree) [10:25:41] So...first of all thanks for all the contributions :) - What we can take away is: [10:25:45] 1) We can configure ingressgateway lvs with IdleConnection monitor (biting the bullet that this might cause alerts during setup of new clusters) [10:26:29] 2) have monitoring/probes check a different port (to actually monitor the state of ingressgateway on application basis) [10:28:43] 3) the more "correct" approach would potentially be to do IdleConnection && ProxyFetch(different port) [10:30:02] as in that case we would not exclude socket errors while still checking the applications actual state [10:31:24] yeah that seems correct to me [12:41:41] 10Acme-chief, 10User-dcaro, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10aborrero) [12:41:45] 10Acme-chief, 10Toolforge, 10cloud-services-team (Kanban): problem with let's encrypt cert for star.tools.wmflabs.org - https://phabricator.wikimedia.org/T298353 (10aborrero) [12:42:35] 10Acme-chief, 10User-dcaro, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10aborrero) [12:46:19] 10Acme-chief, 10User-dcaro, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10aborrero) [12:56:32] win 24 [12:56:35] er [16:38:28] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10mpopov) Hello! I would prefer to not have an allowlist for external domains, but if the final decision is to have one t... [18:17:59] 10Traffic, 10netops, 10Infrastructure-Foundations: Allocate range/IP and enable IPv6 on Wikidough hosts - https://phabricator.wikimedia.org/T301165 (10cmooney) 05Open→03In progress p:05Triage→03Medium [18:20:11] 10Traffic, 10netops, 10Infrastructure-Foundations: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) [19:12:09] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10ssingh)