[07:34:32] !incidents [07:34:32] 5931 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:34:33] 5930 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [07:34:33] 5929 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [07:34:33] 5927 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-d-eqiad.mgmt.eqiad.wmnet) [07:34:33] 5926 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [07:34:33] 5925 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-a-eqiad.mgmt.eqiad.wmnet) [07:34:34] 5923 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [07:34:34] 5924 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [07:34:35] 5922 (RESOLVED) ProbeDown sre (10.64.0.107 ip4 aux-k8s-ctrl1002:6443 probes/custom http_aux_k8s_eqiad_kube_apiserver_ip4 eqiad) [14:05:37] hello folks, we fundraising tech are seeing some odd network issues when trying to get to payments.wikimedia.org we are seeing ~50% packet loss [14:06:36] this is to eqiad, we are not seeing this issue with connections through codfw [14:07:40] https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue ;P [14:09:16] thanks. [14:10:44] sorry, in this case we are at our offsite and we've verified it across various paths (hotel wifi, tethering) and just wanted to check and see if it was known wider. [14:11:33] i'll reach out with a phab task and reassure the masses. :) [14:15:29] I can reproduce a packet loss between 20 and 50% from my home connection to payments.wikimedia.org [14:17:42] yeah, jeff has reached out to arzhel and i'm collecting data. thanks! [14:17:56] great [14:20:48] yeah something is wrong with pfw1-eqiad, can't reach it over mgmt neither, but I'm in through console, looking (cc topranks) [14:21:36] ack, so you don't need my report too :) [14:22:02] xe-0/2/0 up down Core: cr1-eqiad:xe-3/1/7 {#4026} [14:24:04] * topranks that's not good [14:24:44] good light on the cr1 side [14:25:03] it's receiving remote-fault from the pfw [14:28:18] just as a follow up, we pulled down the current fundraising banners when we noticed the issue so there should be minimal donor impact. [14:32:28] one uplink between pfw1 and cr1 is down because pfw1 is not receiving enough light [14:32:49] the other uplink between pfw1 and cr2 is flapping causing the 50% packet loss [14:33:11] DCops is on their way to fix it, they know it's urgent, but not sure we can do anything else until they get there [14:33:26] why both failed at the same time we will need to investigate [14:33:35] yeah that part seems very surprising [14:33:49] any idea why the second link is flapping? [14:35:01] it has a lower than expected light -10dBm [14:35:22] thanks [14:39:27] the moments before each link dying (thanks Cathal for finding them): [14:39:27] https://librenms.wikimedia.org/graphs/id=57148/type=sensor_dbm/from=1743582900/to=1743604500/ [14:39:27] https://librenms.wikimedia.org/graphs/id=57147/type=sensor_dbm/from=1743582900/to=1743604500/ [14:39:46] it's very weird co-incidence [14:41:38] XioNoX: we have no task do we? I'll open one if not [14:42:05] We should also do a incident response document, I'll start on that [14:42:06] topranks: Jeff was telling me Dallas was opening a task [14:42:16] ok [14:43:21] i did: https://phabricator.wikimedia.org/T390872 [14:44:11] it's "nice" this happened at the offsite since we can explain a lot to many people, but also a lot of people wanting to know all the details. :) [14:44:38] yeah, "look at all that redundancy" :) [14:45:47] I see them back up [14:45:57] wha..... [14:46:07] yep -1dBm on cr2 now [14:46:19] john must have replaced the optics [14:46:40] https://www.irccloud.com/pastebin/ozAh4B3c/ [14:46:42] yeah, just got some resolution pages. but we are happy to wait for any additional work/research you need before reenabling banners or anything. don't rush. [14:46:54] yeah absolutely [14:52:02] dwisehaupt: I'm going to add some info to the task description if that is ok [14:52:38] sure thing. i was just blatting things out. [15:04:08] topranks / XioNoX / jhathaway https://docs.google.com/document/d/1xjpEAKTpe1mjBJ3-bZQ-EMAJuoRKRxcn2Elidsj_X1Y/edit?tab=t.0#heading=h.95p2g5d67t9q [15:04:32] hello oncallers - if you are ok I'd upgrade changeprop-codfw to a new docker image version, carrying nodejs-20 [15:04:43] nothing major expected [15:05:37] sounds good [15:06:46] ack thanks! [15:07:18] Hi wikimedia-sre! I'd like to ask for help with testing my AQS staging service against the staging data-gateway in Kubernetes. My current issue is that the staging AQS service can only connect to the production data-gateway. The staging data-gateway already exists, but isn't accessible from Kubernetes it seems. I've discussed with Balthazar and it seems I "need a data-gateway-staging service to be added in puppet, configured in [15:07:18] LVS, etc, and then added to the service mesh". I'm happy to work on these if I can, but I'd need some pointers. Thanks a lot! [15:15:28] hola mforns! [15:15:47] cccccbukvgbcdncjdbneifdndbjefihfknurcrhhklig [15:15:51] ahahahhaah [15:15:54] i give up and unplug the damn thing [15:16:11] mforns: the canonical guide is https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service [15:16:49] but I'd ask a data platform SRE to do it for you, since we may or may not use Ingress etc.. (that simplifies the overall picture) [15:16:54] do you have a task for it? [15:17:22] also, there may be other quicker workarounds without the LVS service, let's follow up in pvt [15:19:05] OK! thanks elukey! [15:20:24] mforns: <3 [15:23:35] sukhe :-) [16:21:49] !oncall [16:21:57] !oncall-now [16:21:57] Oncall now for team SRE, rotation business_hours: [16:21:58] a.rnoldokoth, j.hathaway [18:00:04] Earlier today when `sre.loadbalancer.restart-pybal` was run search reported ~12k connection failures over 30s. Is that expected? [18:15:40] ebernhardson: do you have the full text or a snippet? [18:16:17] and do you mean the run in the morning by the Traffic team or someone else? [18:17:44] sukhe: the error message we get is `Status code 503; upstream connect error or disconnect/reset before headers. reset reason: connection failure`, the spike lasted for ~30s at 7:47 and matches a SAL log for running the same cookbook at the same time. It's not a major issue, i only brought it up incase it's unexpected and supposed to be more graceful [18:18:57] matching log is 07:46 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs2013.*,lvs1019.*} and A:lvs [18:19:27] (i've also put up a patch that will auto-retry 503's once at the envoy level, although not sure that would help here or not) [18:20:12] ebernhardson: sorry, I misread. so you believe that after the cookbook was run, errors increased? [18:21:41] sukhe: yes. The cookbook was run from 7:46 to 7:47, and at 7:47:20-7:47-40 we have ~12k errors: https://logstash.wikimedia.org/goto/b81f8cac51eff3ad3b811ca9d00e1580 [18:22:07] the only way this could have happened was that if there was another pybal change that was pending and pybal was not restarted and when we did restart it, it picked up that change and another change that we thought we were merging [18:22:22] so yes it's possible, whether related or not, let's find out [18:22:33] what is the service name or the backend server? [18:22:59] the service names are search-{psi,chi,omega}-{eqiad,codfw} [18:23:33] should be mostly eqiad since thats where most traffic is right now [18:23:50] ok thanks, that helps [18:23:50] chi is typically the busiest [18:23:53] let me dig in [18:27:17] I don't see chi, is it just search*? [18:27:54] hmm, i suppose search-chi-eqiad is the envoy service name, it talks to search.svc.eqiad.wmnet, but that is actually 3 separate things on ports 9243, 9443 and 9643, 9243 is the busy port [18:28:59] it might be called production-search-eqiad [18:29:20] I don't see anything having changed anyway [18:29:31] or search-https [18:30:11] interesting, so perhaps just lucky they line up together? [18:31:22] I mean it's quite the coincindence but what I meant was that I don't see any change on pybal that could have resulted in a restart picking up some config change resulting in this [18:31:32] I am now looking at the cluster/services [18:37:11] not related but note that name=cloudelastic1008.eqiad.wmnet is not pooeld for any of the services [18:37:18] *pooled [18:37:37] yea that has a hardware issue, it failed a reimage [18:39:13] looks like it got fixed on the 31st but we haven't pulled it back in yet [18:42:07] sukhe thanks for the reminder...Erik's correct, that one was broken for awhile, I need to make sure it reimaged properly before we add it back [18:42:52] no worries, it can't have caused this anyway but just pointing it out looking at the list [18:44:47] hmm how a restart of a load balancer triggers 503s? [18:45:02] just repooled [18:45:17] as soon as pybal goes off traffic switches to the secondary load balancer [18:45:56] are those 503s triggered on a service that uses another service behind the low traffic LB? [18:47:42] vgutierrez: a quick restart shouldn't result in these severe errors though? [18:49:42] pybal wipes the ipvs state so persistent connections could get RSTd [18:49:45] per service.yaml these are all on the low-traffic class [18:50:08] where those 503s have originated? [18:50:16] which service issued them? [18:50:26] mostly k8s, mw-api-ext [18:50:47] 11k on kube-mw-api-ext, 1k one kube-mw-web, [18:51:14] mw-api-ext keeps persistent connections against the search cluster? [18:51:20] envoy does, yes [18:51:47] and if the connection goes away immediately returns a 503 without trying to connect again? [18:53:01] i'm not entirely sure what envoy does there, it's not currently configured with retrys. I've added a patch today that will retry 503's once at envoy level (not yet merged) [18:54:34] it seems most envoy stuff isn't configured to retry, only some specific things [18:55:26] ebernhardson: anyway I guess I don't see anything personally. the pybal restart and the RST to reconnect should be pretty unimpactful for low-traffic anyway, which all of these services are. [18:56:52] sukhe: thanks for looking! I'll be optimistic the auto-retry on 503 will solve similar isues in the future [18:57:05] also I guess, we do this quite frequently and if I zoom out on the logstash link, it doesn't match up to previous restarts of lvs1019 (eqiad lowtraffic) [18:57:10] but if you see it again, let us know please :) [18:57:15] certainly [19:15:09] I'm curious about the connection retry against search cause 12k 503s suggests that the search cluster was unreachable for some time [19:15:34] I'll try to take a look tomorrow EU morning [19:15:56] ebernhardson: do we have some logs task dashboard to stare at? [19:16:05] *logstash [19:27:58] vgutierrez: hmm, the main logstash link would be the one posted above, https://logstash.wikimedia.org/goto/b81f8cac51eff3ad3b811ca9d00e1580 [19:31:07] 12k is over ~20s, typical query rate between kube-mw-api-ext and search-chi-eqiad is 1k-1.5k/s [19:32:01] not sure if it helps, but here's a panel showing QPS, doesn't seem to change during the restart https://grafana.wikimedia.org/goto/W3-RUqTHg?orgId=1 [19:33:03] i don't know how relelvant it would be, but there is one oddity right now that we are routing all traffic from codfw application servers to the eqiad search clusters while a platform migration is being done in codfw [20:17:45] cool, thanks