[09:23:56] (VarnishTrafficDrop) firing: 59% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [09:48:56] (VarnishTrafficDrop) resolved: 67% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [10:55:49] hello! Just to give a heads-up, we'll be doing a bulk data transfer from aqs1007 and aqs1004 to aqs1010 and aqs1011. There's a fairly large amount of information to be transferred and it will probably cause some noise on your end - is there anything we can downtime/ack before it happens? [10:57:30] XioNoX, topranks ^^ [10:59:21] Noted: thanks for the heads up. [11:00:12] fwiw for the pure network part of it the #wikimedia-sre-foundations channel is probably better ;) [11:00:25] I don't believe there is anything we might want to do pro-actively, potentially the "port utilization" alarms might fire but we can just ack them. We don't want to adjust the (global) thresholds for those in case something else triggers them (which isn't expected). [11:00:46] I'm open to correction there XioNox has more experience here but I think we're ok. [11:00:49] thanks! [11:00:53] yeah, and as they're within the same DC we might not even hit any of those [11:01:19] we're around and will let you know if it causes any issues [11:08:18] they're also 1G hosts, so it will be all fine on the network infra side [11:11:01] thanks! [11:11:09] starting transfer now-ish [12:39:35] hello folks [12:39:57] I have a question about pybal - I am configuring a new k8s-based svc for the ML cluster [12:40:26] the healtch check port for it is not the same as the one defined for the service (where http traffic flows) [12:40:37] it is the same set of pods that answer to both ports [12:41:16] I am configuring service.yaml and I don't recall what to specify in [12:41:17] ProxyFetch: [12:41:17] url: [12:41:17] - http://localhost/healthz/ready [12:41:27] since no other service adds a port in there [12:41:43] (I am also checking pybal configs) [12:42:05] is the ProxyFetch assuming that the health check port is the same as the service port? [12:47:30] if it is https://github.com/wikimedia/PyBal/blob/b331a4a4cd62b2ec519b07a69a3cc8dd7b6711d5/pybal/monitors/proxyfetch.py#L127 it seems so [12:54:59] elukey: why not put the healthcheck on the same port? The reason they're the same on the pybal side is because the healthcheck should indicate whether the port the requests are flowing to is actually working (e.g. firewall metadata problems and a host of other things could make the healthcheck port work while the real port does not) [12:57:16] bblack: hi! It is a pre-backed config of Istio, an ingress gateway for kubernetes that we are testing on the ML cluster. It will be our L7 proxy to the backend pods hosting ML models, we will target them using the Host: header. [12:58:16] we could think about health checking one of the backend services, should be feasible [12:59:10] yeah, or even making some kind of default/noop backend service that exists just for this purpose, so that healthchecking doesn't fail because the service you picked gets de-provisioned a year later or whatever. [13:00:30] in general, I'm hoping our k8s-based services eventually don't use the generic L4LB/pybal services and talk BGP directly, but I think that's probably way outside the context of this one project's immediate needs [13:02:05] no idea if there is this capability, maybe with some calico-magic that I am not aware. We are kinda breaking the current k8s LVS standard set up since our stack introduces a mandatory L7 proxy in the middle, that handles the routing to all pods/models. [13:02:58] so my initial idea was to health check the L7 proxy, since the service will be based on it (and here comes the double port issue) [13:03:29] since it health checks by itself all the backend pods etc.. [13:03:49] anyway, I got my answer, will try to come up with something. Thanks :) [15:23:34] 10Traffic, 10Observability-Alerting, 10SRE: Prometheus Varnish exporter alert: add runbook and link to dashboard - https://phabricator.wikimedia.org/T289974 (10lmata) [19:02:57] (VarnishTrafficDrop) firing: 51% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [19:07:57] (VarnishTrafficDrop) resolved: 54% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org