[08:34:35] o/ looking for advises/suggestions on how troubleshoot a connectivity issue between MW, envoy, lvs, elastic@eqiad (search cluster), since apr 24 21:30 we saw an noticeable increase in errors "Status code 503; upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 113" [08:34:51] errors: https://logstash.wikimedia.org/goto/640223cb5c9ef769d118a97a57ea1ea9 [08:36:23] seems to have started after merging this patch : https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023937 [08:41:45] this "delayed connect error: 113" error is something I don't think I ever seen before [09:58:38] GitLab needs maintenance for around 15-20 minutes in an hour. The downtime will be a bit longer because we also have to restart the host as well. [11:18:35] GitLab maintenance done [13:02:52] dcausse: could you put that into a task or add the serviceops tag to an already existing one if you have one already? [13:03:45] jayme: sure [13:20:01] dcausse: this re-assembles here: https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=search-omega-eqiad&var-destination=search-psi-eqiad&var-destination=search-chi-eqiad&viewPanel=30&from=1713960000000&to=now [13:24:59] so it seems to be happening more on chi as on omega [13:25:06] are you shure the new nodes behave well? [13:25:49] and how do the elasticsearch nodes terminate tls? [13:26:32] jayme thanks...I just got in and I'm taking a look w/ dcausse [13:26:38] jayme: it used to be nginx but it's perhaps envoy now, inflatador might know [13:26:45] inflatador: good morning :) [13:27:17] dropping the ball then. lmk if you need something [13:27:39] jayme good morning indeed ;) Nodes still using nginx, will keep troubleshooting [13:28:09] I tested these nodes this morning and they behaved properly, ebernhardson could trigger few errors with curl IIUC [13:28:27] chi is our large cluster so it's getting a lot more traffic [13:29:02] but yes this dashboard matches perfectly what I see in logstash [13:30:00] Elastic hosts don't use envoy, I guess that is looking at the k8s side? [13:33:44] I 've just depooled that nodes in https://phabricator.wikimedia.org/T361268#9747942 and we no longer see errors in that logstash dashboard [13:33:48] those* [13:34:56] interesting [13:35:33] sounds like we need better health checks at the very least [13:36:55] I think this might actually be a recent nginx change [13:39:04] Or maybe the pools are incorrect, so psi traffic is going to chi /etc [13:43:04] my bet would be on incorrect configuration of pools or nodes. [13:45:57] or maybe firewall rules need to be set up on the k8s nodes for the new hosts? [13:46:02] or vice versa? [13:46:24] nah, they hit the VIP, right? [13:49:21] can we compare k8s vs baremetal MW nodes? I still think it's VIP related but just curious [13:52:38] I think that mwmaint1002 exhibited the same behavior per the task [13:52:46] and that's a baremetal mw host. [13:52:59] there aren't many baremetal mw hosts left btw [13:53:08] we are above 70% in the migration [13:53:41] but yes, the pods should hit the VIP, via the service mesh (the envoys). [13:53:59] I can paste the service mesh configuration for the elasticsearch clusters in the task if you want [13:54:06] sure [14:02:13] done in https://phabricator.wikimedia.org/T363521#9748016 [14:02:34] and yes, it uses the VIP everywhere. [14:04:37] akosiaris thanks, where is the best place to get the full pybal and confctl config including ports/healthchecks etc? [14:05:01] I'm diggin thru puppet hiera but the live config would be more useful [14:05:33] inflatador: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/service.yaml [14:05:48] that's the service catalog, it does have both ports and healthchecks [14:06:48] conftool data is at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/conftool-data/node/eqiad.yaml but this is admin state. Operational state can be fetched on cumin1002, via conftool select 'name=elastic10.*eqiad.wmnet' get [14:07:15] note that it's a regular expression, so you can fetch data with whatever re matches your needs [14:07:51] and there are other keys aside from name, like service, dc, cluster and service [14:10:02] for getting operational state there's also per-service-zone exports as plain files at https://config-master.wikimedia.org/pybal/ [14:10:19] yeah, I tend to look at config-master [14:12:25] I thiiiiiink maybe Amir1 had made a cool js UI for config-master data at some point? [14:12:45] fault tolerance I think [14:12:56] https://fault-tolerance.toolforge.org/pools [14:13:02] https://fault-tolerance.toolforge.org/pools/table [14:13:09] oh very nice [14:13:53] also this: https://fault-tolerance.toolforge.org/map?cluster=elasticsearch [14:14:00] that's it, thanks [14:14:03] that's scary 😱 [14:31:01] dcausse do you have any example queries similar to the ones that are failing? I can cobble something together if not [14:31:45] curl commands that is [14:33:18] inflatador: I think anything could fail so just asking the header might be enough, e.g. curl https://search.svc.eqiad.wmnet:9243/ or via envoy from a mw host: curl http://localhost:6102/ [14:36:08] I could not repro the issue with curl myself this morning but reading backscroll I think Erik was able to reproduce a couple errors running curl in a loop over 1 munite [14:36:25] dcausse I'm trying to curl locally from the bad hosts, let me try a loop there [15:00:58] https://wiki.gentoo.org/wiki/Project:Council/AI_policy is interesting [15:12:36] heads-up, I'm repooling one of the "bad" elastic hosts to see what happens...elastic discussion will continue in #search if interested [18:40:00] gl/hf [18:40:13] sonnofa.... I didn't see the timestamp... [18:40:19] * brett is bad with computers