[08:34:35] <dcausse>	 o/ looking for advises/suggestions on how troubleshoot a connectivity issue between MW, envoy, lvs, elastic@eqiad (search cluster), since apr 24 21:30 we saw an noticeable increase in errors "Status code 503; upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 113"
[08:34:51] <dcausse>	 errors: https://logstash.wikimedia.org/goto/640223cb5c9ef769d118a97a57ea1ea9
[08:36:23] <dcausse>	 seems to have started after merging this patch : https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023937
[08:41:45] <dcausse>	 this "delayed connect error: 113" error is something I don't think I ever seen before
[09:58:38] <jelto>	 GitLab needs maintenance for around 15-20 minutes in an hour. The downtime will be a bit longer because we also have to restart the host as well.
[11:18:35] <jelto>	 GitLab maintenance done
[13:02:52] <jayme>	 dcausse: could you put that into a task or add the serviceops tag to an already existing one if you have one already?
[13:03:45] <dcausse>	 jayme: sure
[13:20:01] <jayme>	 dcausse: this re-assembles here: https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=search-omega-eqiad&var-destination=search-psi-eqiad&var-destination=search-chi-eqiad&viewPanel=30&from=1713960000000&to=now
[13:24:59] <jayme>	 so it seems to be happening more on chi as on omega
[13:25:06] <jayme>	 are you shure the new nodes behave well?
[13:25:49] <jayme>	 and how do the elasticsearch nodes terminate tls?
[13:26:32] <inflatador>	 jayme thanks...I just got in and I'm taking a look w/ dcausse 
[13:26:38] <dcausse>	 jayme: it used to be nginx but it's perhaps envoy now, inflatador might know
[13:26:45] <jayme>	 inflatador: good morning :)
[13:27:17] <jayme>	 dropping the ball then. lmk if you need something
[13:27:39] <inflatador>	 jayme good morning indeed ;) Nodes still using nginx, will keep troubleshooting
[13:28:09] <dcausse>	 I tested these nodes this morning and they behaved properly, ebernhardson could trigger few errors with curl IIUC
[13:28:27] <dcausse>	 chi is our large cluster so it's getting a lot more traffic
[13:29:02] <dcausse>	 but yes this dashboard matches perfectly what I see in logstash
[13:30:00] <inflatador>	 Elastic hosts don't use envoy, I guess that is looking at the k8s side?
[13:33:44] <akosiaris>	 I 've just depooled that nodes in https://phabricator.wikimedia.org/T361268#9747942 and we no longer see errors in that logstash dashboard
[13:33:48] <akosiaris>	 those*
[13:34:56] <inflatador>	 interesting
[13:35:33] <inflatador>	 sounds like we need better health checks at the very least
[13:36:55] <inflatador>	 I think this might actually be a recent nginx change
[13:39:04] <inflatador>	 Or maybe the pools are incorrect, so psi traffic is going to chi /etc
[13:43:04] <akosiaris>	 my bet would be on incorrect configuration of pools or nodes.
[13:45:57] <inflatador>	 or maybe firewall rules need to be set up on the k8s nodes for the new hosts?
[13:46:02] <inflatador>	 or vice versa?
[13:46:24] <inflatador>	 nah, they hit the VIP, right?
[13:49:21] <inflatador>	 can we compare k8s vs baremetal MW nodes? I still think it's VIP related but just curious
[13:52:38] <akosiaris>	 I think that mwmaint1002 exhibited the same behavior per the task
[13:52:46] <akosiaris>	 and that's a baremetal mw host.
[13:52:59] <akosiaris>	 there aren't many baremetal mw hosts left btw
[13:53:08] <akosiaris>	 we are above 70% in the migration
[13:53:41] <akosiaris>	 but yes, the pods should hit the VIP, via the service mesh (the envoys). 
[13:53:59] <akosiaris>	 I can paste the service mesh configuration for the elasticsearch clusters in the task if you want
[13:54:06] <inflatador>	 sure
[14:02:13] <akosiaris>	 done in https://phabricator.wikimedia.org/T363521#9748016
[14:02:34] <akosiaris>	 and yes, it uses the VIP everywhere.
[14:04:37] <inflatador>	 akosiaris thanks, where is the best place to get the full pybal and confctl config including ports/healthchecks etc?
[14:05:01] <inflatador>	 I'm diggin thru puppet hiera but the live config would be more useful
[14:05:33] <akosiaris>	 inflatador: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/service.yaml
[14:05:48] <akosiaris>	 that's the service catalog, it does have both ports and healthchecks
[14:06:48] <akosiaris>	 conftool data is at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/conftool-data/node/eqiad.yaml but this is admin state. Operational state can be fetched on cumin1002, via conftool select 'name=elastic10.*eqiad.wmnet' get 
[14:07:15] <akosiaris>	 note that it's a regular expression, so you can fetch data with whatever re matches your needs
[14:07:51] <akosiaris>	 and there are other keys aside from name, like service, dc, cluster and service
[14:10:02] <cdanis>	 for getting operational state there's also per-service-zone exports as plain files at https://config-master.wikimedia.org/pybal/
[14:10:19] <inflatador>	 yeah, I tend to look at config-master
[14:12:25] <cdanis>	 I thiiiiiink maybe Amir1 had made a cool js UI for config-master data at some point?
[14:12:45] <Amir1>	 fault tolerance I think
[14:12:56] <Amir1>	 https://fault-tolerance.toolforge.org/pools
[14:13:02] <Amir1>	 https://fault-tolerance.toolforge.org/pools/table
[14:13:09] <inflatador>	 oh very nice
[14:13:53] <Amir1>	 also this: https://fault-tolerance.toolforge.org/map?cluster=elasticsearch
[14:14:00] <cdanis>	 that's it, thanks
[14:14:03] <Amir1>	 that's scary 😱
[14:31:01] <inflatador>	 dcausse do you have any example queries similar to the ones that are failing? I can cobble something together if not
[14:31:45] <inflatador>	 curl commands that is
[14:33:18] <dcausse>	 inflatador: I think anything could fail so just asking the header might be enough, e.g. curl https://search.svc.eqiad.wmnet:9243/ or via envoy from a mw host: curl http://localhost:6102/
[14:36:08] <dcausse>	 I could not repro the issue with curl myself this morning but reading backscroll I think Erik was able to reproduce a couple errors running curl in a loop over 1 munite
[14:36:25] <inflatador>	 dcausse I'm trying to curl locally from the bad hosts, let me try a loop there
[15:00:58] <brett>	 https://wiki.gentoo.org/wiki/Project:Council/AI_policy is interesting
[15:12:36] <inflatador>	 heads-up, I'm repooling one of the "bad" elastic hosts to see what happens...elastic discussion will continue in #search if interested
[18:40:00] <brett>	 gl/hf
[18:40:13] <brett>	 sonnofa.... I didn't see the timestamp...
[18:40:19] * brett is bad with computers