[00:50:05] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10Papaul) [06:57:10] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10JMeybohm) >>! In T260663#6924279, @JMeybohm wrote: > The cookbook does not seem to work (tried d... [09:04:57] 10serviceops, 10Traffic: Set per-request timeout on ATS-BE - https://phabricator.wikimedia.org/T315533 (10Vgutierrez) [09:05:23] 10serviceops, 10Traffic: Set per-request timeout on ATS-BE - https://phabricator.wikimedia.org/T315533 (10Vgutierrez) p:05Triage→03Medium [09:07:47] hi team, we've detected some connectivity issues between eqsin and eqiad, specifically an increased latency as recorded in T315429 [09:08:32] ats-be@eqsin seems to be experiencing more connect errors than the average ats-be instance.. and I'm wondering if a connect timeout of 1.0 seconds could be too tight for eqsin [09:19:17] per https://www.envoyproxy.io/docs/envoy/v1.18.3/faq/configuration/timeouts.html?highlight=transport_socket_connect_timeout#tcp I understand that if transport_socket_connect_timeout isn't set, the connect_timeout applies to the TLS handshake as well [09:25:26] poor's man benchmark [09:25:35] https://www.irccloud.com/pastebin/cPqqcNDr/ [09:28:55] <_joe_> vgutierrez: if we're having such latencies to eqsin, is it even worth keeping pooled? [09:29:33] <_joe_> I mean a 1s to connect seems unacceptable to me, but that's just coarse intuition [09:30:01] <_joe_> also doesn't ats use TLS 1.3? [09:30:06] yep [09:30:22] <_joe_> not sure what s_client would do though [09:30:27] TLSv1.3 [09:30:31] <_joe_> ok [09:30:33] <_joe_> wow [09:30:34] "New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384" [09:30:38] <_joe_> those numbers are quite bad [09:31:15] <_joe_> vgutierrez: to be clear, changing that timeout can be done, but it will need a deploy of all the involved services [09:31:17] indeed... as a comparison... from esams: https://www.irccloud.com/pastebin/nijBGK2m/ [09:31:31] <_joe_> IME that... never goes without hiccups [09:31:58] <_joe_> so, we can do it, totally, but it will take some time [09:32:35] considering it's a fragile matter, I'll open a subtask of T315429 gathering this data [09:33:13] <_joe_> yeah my suggestion would be to depool eqsin for now if the problems are serious enough [09:34:47] it isn't that bad considering that ats-be doesn't open a lot of connections per second [09:35:57] but as you can spot in https://grafana.wikimedia.org/goto/VvB7C_mVz?orgId=1, T315429 heavily impacts TTFB for miss requests originating in eqsin, and in general eqsin performance is quite bad compared to esams or drmrs [09:38:42] per https://grafana.wikimedia.org/goto/2BndC_i4k?orgId=1 ats-be@eqsin-text is reusing connections for a 96-99% of the requests depending on the backend [09:52:38] 10serviceops, 10SRE, 10SRE-Access-Requests: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10Joe) [09:52:53] 10serviceops, 10SRE, 10SRE-Access-Requests: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10Joe) [09:53:03] 10serviceops, 10SRE, 10SRE-Access-Requests: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10Joe) p:05Triage→03Medium [09:54:02] 10serviceops, 10SRE, 10SRE-Access-Requests: Move Clement Goubert to ops - https://phabricator.wikimedia.org/T315538 (10LSobanski) Approved. [10:25:07] the thumbor blubber file is ready for review if anyone has a few minutes - there's a helm chart to go with it too but I'll wait until we have images built and some testing done before bringing that up https://gerrit.wikimedia.org/r/c/operations/software/thumbor-plugins/+/813613 [10:28:52] <_joe_> hnowlan: i won't have time today, my afternoon is meeting after meeting [13:54:57] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) [13:58:22] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) Reboot of staging clusters and codfw (batchsize 1, took ~3.25 hours) went smoothly without any a... [13:58:34] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10JMeybohm) 05Open→03Resolved [14:54:32] 10serviceops, 10Thumbor, 10Thumbor Migration, 10Performance-Team (Radar), 10User-jijiki: Terminate Thumbor with SSL - https://phabricator.wikimedia.org/T180696 (10hnowlan) a:03hnowlan