[00:50:05] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10Papaul)
[06:57:10] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) >>! In T260663#6924279, @JMeybohm wrote: > The cookbook does not seem to work (tried d...
[09:04:57] <wikibugs>	 10serviceops, 10Traffic: Set per-request timeout on ATS-BE - https://phabricator.wikimedia.org/T315533 (10Vgutierrez)
[09:05:23] <wikibugs>	 10serviceops, 10Traffic: Set per-request timeout on ATS-BE - https://phabricator.wikimedia.org/T315533 (10Vgutierrez) p:05Triage→03Medium
[09:07:47] <vgutierrez>	 hi team, we've detected some connectivity issues between eqsin and eqiad, specifically an increased latency as recorded in T315429
[09:08:32] <vgutierrez>	 ats-be@eqsin seems to be experiencing more connect errors than the average ats-be instance.. and I'm wondering if a connect timeout of 1.0 seconds could be too tight for eqsin
[09:19:17] <vgutierrez>	 per https://www.envoyproxy.io/docs/envoy/v1.18.3/faq/configuration/timeouts.html?highlight=transport_socket_connect_timeout#tcp I understand that if transport_socket_connect_timeout isn't set, the connect_timeout applies to the TLS handshake as well
[09:25:26] <vgutierrez>	 poor's man benchmark
[09:25:35] <vgutierrez>	 https://www.irccloud.com/pastebin/cPqqcNDr/
[09:28:55] <_joe_>	 vgutierrez: if we're having such latencies to eqsin, is it even worth keeping pooled?
[09:29:33] <_joe_>	 I mean a 1s to connect seems unacceptable to me, but that's just coarse intuition
[09:30:01] <_joe_>	 also doesn't ats use TLS 1.3?
[09:30:06] <vgutierrez>	 yep
[09:30:22] <_joe_>	 not sure what s_client would do though
[09:30:27] <vgutierrez>	 TLSv1.3
[09:30:31] <_joe_>	 ok
[09:30:33] <_joe_>	 wow
[09:30:34] <vgutierrez>	 "New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384"
[09:30:38] <_joe_>	 those numbers are quite bad
[09:31:15] <_joe_>	 vgutierrez: to be clear, changing that timeout can be done, but it will need a deploy of all the involved services
[09:31:17] <vgutierrez>	 indeed... as a comparison... from esams: https://www.irccloud.com/pastebin/nijBGK2m/
[09:31:31] <_joe_>	 IME that... never goes without hiccups
[09:31:58] <_joe_>	 so, we can do it, totally, but it will take some time
[09:32:35] <vgutierrez>	 considering it's a fragile matter, I'll open a subtask of T315429 gathering this data
[09:33:13] <_joe_>	 yeah my suggestion would be to depool eqsin for now if the problems are serious enough
[09:34:47] <vgutierrez>	 it isn't that  bad considering that ats-be doesn't open a lot of connections per second
[09:35:57] <vgutierrez>	 but as you can spot in https://grafana.wikimedia.org/goto/VvB7C_mVz?orgId=1, T315429 heavily impacts TTFB for miss requests originating in eqsin, and in general eqsin performance is quite bad compared to esams or drmrs
[09:38:42] <vgutierrez>	 per https://grafana.wikimedia.org/goto/2BndC_i4k?orgId=1 ats-be@eqsin-text is reusing connections for a 96-99% of the requests depending on the backend
[09:52:38] <wikibugs>	 10serviceops, 10SRE, 10SRE-Access-Requests: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10Joe)
[09:52:53] <wikibugs>	 10serviceops, 10SRE, 10SRE-Access-Requests: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10Joe)
[09:53:03] <wikibugs>	 10serviceops, 10SRE, 10SRE-Access-Requests: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10Joe) p:05Triage→03Medium
[09:54:02] <wikibugs>	 10serviceops, 10SRE, 10SRE-Access-Requests: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10LSobanski) Approved.
[10:25:07] <hnowlan>	 the thumbor blubber file is ready for review if anyone has a few minutes - there's a helm chart to go with it too but I'll wait until we have images built and some testing done before bringing that up https://gerrit.wikimedia.org/r/c/operations/software/thumbor-plugins/+/813613
[10:28:52] <_joe_>	 hnowlan: i won't have time today, my afternoon is meeting after meeting
[13:54:57] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm)
[13:58:22] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) Reboot of staging clusters and codfw (batchsize 1, took ~3.25 hours) went smoothly without any a...
[13:58:34] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) 05Open→03Resolved
[14:54:32] <wikibugs>	 10serviceops, 10Thumbor, 10Thumbor Migration, 10Performance-Team (Radar), 10User-jijiki: Terminate Thumbor with SSL - https://phabricator.wikimedia.org/T180696 (10hnowlan) a:03hnowlan