[09:48:49] <jayme>	 vgutierrez: Trying to DTRT with the ingressgateway LVS, I ended up with this https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/759749 - using a different port for monitor and monitor/ing/ is currently not supported in the new probes: stanza. Will work something out there as well
[09:49:17] <vgutierrez>	 err
[09:49:33] * jayme hides
[09:51:17] <vgutierrez>	 I'm not sure that we should allow that TBH
[09:52:50] <jayme>	 I guess the concern is that one could end up checking something completely different?
[09:53:21] <vgutierrez>	 yeah, and keeping pooled servers that they shouldn't be pooled
[09:53:25] <vgutierrez>	 (or the other way around)
[09:55:27] <jayme>	 I do get that. It's just that checking the TLS port (via idleconnection) is not exactly the right thing to do
[09:55:39] <jayme>	 in this specific case
[09:57:08] <vgutierrez>	 sure, but It's hard to consider checking another port "the right thing to do"
[09:57:08] <elukey>	 in general we shouldn't allow it I agree, for istio ingress it seems that its creators wanted to split health check traffic vs traffic.
[09:57:28] <jayme>	 I get it's an edge case. I'm a bit afraid it bites someone at some point (the fact that it will stop accepting connections on the TLS port)
[09:57:44] <jayme>	 *stop accepting in a "still healthy" way
[09:58:00] <vgutierrez>	 how can be healthy and stop accepting traffic?
[09:58:32] <jayme>	 that is in case it has no backend routes configured
[09:59:04] <jayme>	 the reverse-proxy that ingressgateway is, is still fine in that case. Only there is nothing "behind"
[09:59:23] <vgutierrez>	 so why it should receive user traffic in that case?
[09:59:39] <vgutierrez>	 to return a 504?
[10:00:02] <godog>	 something else to consider is what's pybal/lvs (and puppet) view on the services hosted by/on the ingress, will those be available to pybal and thus checking tls will be the right thing to do ?
[10:00:23] <godog>	 (my two cents, please excuse the drive-by comment)
[10:00:40] <jayme>	 hm, I'm afrait it won't even do that vgutierrez
[10:00:55] <volans>	 [same] what are we trying to monitor here, istio itself or the reachability of its backends? I assume istio does its own healthchecks of its backends
[10:01:42] <jayme>	 my understanding was that pybal will alert in case all potential nodes are down
[10:02:05] <vgutierrez>	 it will alert sooner than that
[10:02:14] <vgutierrez>	 as soon as the depool threshold is reached
[10:02:28] <jayme>	 volans: we are trying to monitor isio itself. The backends (e.g. services) will be monitored independently
[10:02:42] <vgutierrez>	 so 50% of the servers down  (by default) will trigger an alert
[10:03:26] <volans>	 jayme: but you want to check that istio is "up" or that it can route incoming traffic to a specific backend and return that traffic to pybal?
[10:04:13] <jayme>	 volans: Ideally I would like to check it's up/"ready" as of it's own definition
[10:04:17] <volans>	 *specific backend as in a specific service, not actual k8s backend
[10:04:22] <vgutierrez>	 if checking that istio is "up" is enough then IdleConnection should suffice if istio can't provide a full L7 check
[10:04:35] <elukey>	 it does provide it, but on a separate port
[10:04:37] <vgutierrez>	 (can't provide on the same port)
[10:04:56] <elukey>	 because of the issue that Janis explained, that routes vs health checks are considered differently
[10:05:15] <elukey>	 (on port for traffic towards backed services, one for the istio gateway's status itself)
[10:06:10] <volans>	 could we have a backend service that lives together with istio?
[10:06:50] <jayme>	 yes, but then *all* istios would fail in case that one fails
[10:07:27] <volans>	 why?
[10:07:37] <volans>	 I meant each istio have a localhost backend basically
[10:08:12] <elukey>	 volans: having a dedicated pod as backend service for health checks seems to be the same as health checking on a different port, in the istio use case
[10:08:26] <elukey>	 I don't see a lot of differences..
[10:08:35] <elukey>	 (failure use cases wise I mean)
[10:09:08] <elukey>	 if that service doesn't work (maybe gets throttled by k8s etc..) we'd be in trouble 
[10:09:12] <volans>	 sure, I don't see a big issue having pybal check on a different port if the other port is managed by the same process that handle the traffic port
[10:09:26] <volans>	 and provides health status on that dedicated port
[10:09:38] <jayme>	 that is exactly the case here
[10:09:43] <elukey>	 it is the same pod, in theory same service daemon (the istio proxy), but Janis can confirm 
[10:09:45] <vgutierrez>	 well.. socket related issues are going to be ignored
[10:09:46] <volans>	 either you trust that service to provide you a reliable health status
[10:10:07] <volans>	 vgutierrez: true that's why I was asking before if we want to check istio is up or that the traffic can route back to pybal
[10:10:11] <volans>	 are two different things
[10:11:21] <volans>	 vgutierrez: at the same time we don't do checks on the VIPs on the backends with pybal, kinda the same thing
[10:15:24] <jayme>	 just to be clear on this: IdleConnection will, AFAICT, also work. It will only fail in the (potentially rare) case when there is no backend/route configured for ingressgateway
[10:15:57] <volans>	 can we do a very basic tcp ping check + health on the other port?
[10:16:13] <vgutierrez>	 that potentially rare case won't trigger any user facing issue on its own, right?
[10:16:30] <jayme>	 nope. Just noise for SRE I guess
[10:18:02] <jayme>	 (to "visualize" how this looks: https://paste.debian.net/1229879/ - the codfw cluster has a backend configured, the eqiad one has not)
[10:19:06] <godog>	 yeah that seems right to me, as in if there's nothing to route pybal shouldn't send traffic
[10:19:53] <godog>	 if all backends stop listening on their tls port then that's a problem for sure
[10:20:12] <jayme>	 godog: that won't be cached here
[10:20:41] <jayme>	 as soon as a backend is *registered* in ingressgateway, it will start to accept connections on 30443
[10:20:51] <jayme>	 regardless of the state of the backend
[10:21:39] <godog>	 jayme: err I meant pybal backends, but thank you I didn't know about the registered thing
[10:21:55] <jayme>	 ack
[10:22:06] <jayme>	 all those overloaded terms :|
[10:22:10] <godog>	 that takes me back to the question re: services on top of istio, will pybal/lvs know about them in some shape or form?
[10:22:14] <godog>	 yeah they are :|
[10:23:15] <jayme>	 pybal/lvs I think not. But they should remain in the service catalog
[10:23:34] <jayme>	 so monitoring:/probes: should be there for them
[10:23:42] <godog>	 ack, thanks! (I agree)
[10:25:41] <jayme>	 So...first of all thanks for all the contributions :) - What we can take away is:
[10:25:45] <jayme>	 1) We can configure ingressgateway lvs with IdleConnection monitor (biting the bullet that this might cause alerts during setup of new clusters)
[10:26:29] <jayme>	 2) have monitoring/probes check a different port (to actually monitor the state of ingressgateway on application basis)
[10:28:43] <jayme>	 3) the more "correct" approach would potentially be to do IdleConnection && ProxyFetch(different port)
[10:30:02] <jayme>	 as in that case we would not exclude socket errors while still checking the applications actual state 
[10:31:24] <godog>	 yeah that seems correct to me
[12:41:41] <wikibugs>	 10Acme-chief, 10User-dcaro, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10aborrero)
[12:41:45] <wikibugs>	 10Acme-chief, 10Toolforge, 10cloud-services-team (Kanban): problem with let's encrypt cert for star.tools.wmflabs.org - https://phabricator.wikimedia.org/T298353 (10aborrero)
[12:42:35] <wikibugs>	 10Acme-chief, 10User-dcaro, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10aborrero)
[12:46:19] <wikibugs>	 10Acme-chief, 10User-dcaro, 10cloud-services-team (Kanban): acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP - https://phabricator.wikimedia.org/T273956 (10aborrero)
[12:56:32] <sukhe>	 win 24
[12:56:35] <sukhe>	 er
[16:38:28] <wikibugs>	 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10mpopov) Hello! I would prefer to not have an allowlist for external domains, but if the final decision is to have one t...
[18:17:59] <wikibugs>	 10Traffic, 10netops, 10Infrastructure-Foundations: Allocate range/IP and enable IPv6 on Wikidough hosts - https://phabricator.wikimedia.org/T301165 (10cmooney) 05Open→03In progress p:05Triage→03Medium
[18:20:11] <wikibugs>	 10Traffic, 10netops, 10Infrastructure-Foundations: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney)
[19:12:09] <wikibugs>	 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10ssingh)