[06:58:20] 10Traffic, 10SRE, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [07:36:50] 10Acme-chief, 10Traffic: Provide second acmechief server configured for Puppet 7 in eqiad - https://phabricator.wikimedia.org/T352242 (10Vgutierrez) I don't think it's a problem of load as our puppetization doesn't balance Puppet API requests between different acme-chief hosts but as @MoritzMuehlenhoff mention... [08:27:06] 10Acme-chief, 10Traffic: Provide second acmechief server configured for Puppet 7 in eqiad - https://phabricator.wikimedia.org/T352242 (10MoritzMuehlenhoff) >>! In T352242#9381786, @Vgutierrez wrote: > Given that we have some acme-chief clients running Buster (alert[1001,2001].wikimedia.org,apt[1001,2001].wikim... [10:06:55] 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10Vgutierrez) [10:07:14] 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10Vgutierrez) p:05Triage→03High [10:08:18] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Vgutierrez) [10:12:59] 10netops, 10Infrastructure-Foundations, 10sre-alert-triage: Alert in need of triage: BGP status (instance cr2-eqdfw) - https://phabricator.wikimedia.org/T351083 (10ayounsi) 05Open→03Resolved Deleted. [10:19:32] 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10MoritzMuehlenhoff) JFTR, if these turn out to be relevant for our use of haproxy as well, we also have the option to move to a dual library approach. I already did this in the past with Debian jessie which only had openssl 1... [12:02:45] 10Traffic, 10SRE, 10Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10CodeReviewBot) fabfur merged https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/1 Basic retry mechanism for specific kafka errors [12:10:46] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) [12:11:07] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) p:05Triage→03Medium [12:31:16] Hi! I have a CR in which I'm switching the state of a new LVS service from `service_setup` to `lvs_setup` (https://gerrit.wikimedia.org/r/c/operations/puppet/+/980368/) If someone has already gone through the steps listed at https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers and has a bit of time, I wouldn't mind pairing on it. [12:31:17] Thank you! [12:32:06] the whole 'you can take down LVS for everyone' keeps me on my toes [13:32:36] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) [14:21:13] 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10BBlack) The perf issues are definitely relevant for traffic's use of haproxy (in a couple of different roles). Your option (making a libssl1.1-dev for bookworm that tracks the sec fixes that are still done for the bullseye... [14:36:54] vgutierrez: Are you around? I can pair with brouberol to prod his service, but I'd like backup if something goes wrong [14:37:40] ack [14:37:46] claime: you can go ahead, some of us are around [14:37:56] nice thanks [14:38:14] eqiad lvs are 'lvs1018' and 'lvs1020', and the cookbook works? [14:38:42] low-traffic lvs are indeed 1018 and 1020 [14:38:56] if you are referring to the pybal restart cookbook.. yes, it works [14:39:07] Needs the query trick right? [14:39:52] Add the actual cumin P{lvs[1018,1020].eqiad.wmnet} query [14:43:35] yep [14:44:02] TIL https://github.com/haproxy/wiki/wiki/SSL-Libraries-Support-Status#openssl [14:44:06] really interesting [14:44:19] (I got it from Valentin's comment about bookworm on cp nodes) [14:45:29] godog: I'd like your take on https://gerrit.wikimedia.org/r/c/operations/alerts/+/980280.. especially on the query, have I overdone it? there is a simpler way? [14:50:09] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, and 2 others: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) [14:53:07] vgutierrez: ack, I'll take a look in a little bit [15:07:27] 10Traffic, 10SRE, 10Patch-For-Review: Add version flag to purged - https://phabricator.wikimedia.org/T347839 (10CodeReviewBot) fabfur closed https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/2 Draft: Add version print option [15:12:11] It was 1019 and 1020 [15:12:56] yeah [15:12:58] see modules/profile/manifests/lvs/configuration.pp [15:12:58] Hmm the Pybal CRITs are blocking the cookbook execution [15:14:40] claime: oh wow [15:14:47] that's an interesting side effect [15:15:06] claime: I'd proceed manually then [15:15:12] ok just restart pybal then [15:15:16] proceeding [15:15:17] claime: +1 [15:15:24] lvs1020 first :) [15:15:30] ack [15:22:51] claime: all good? let us know if you need anything [15:23:32] I'll merge the discovery records and do further checks [15:23:52] Both pybals have been restarted, the service ip is mounted, but I see no backends, will check back if we can't find out why [15:28:32] ok, just a config mistake in the service definition, the conftool cluster was wrong, fixing it [15:54:14] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 (10cmooney) [15:56:28] sukhe: hmm it would appear the backend service isn't responding on the port, so obviously pybal can't connect to it [15:56:36] I suppose we should revert [15:56:40] brouberol: ^ [15:57:22] claime: I see some confd errors as well, can that be the cause? [15:57:31] 10:55:47 <+jinxer-wm> (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd [15:57:35] this [15:57:49] They're fixed now [15:58:04] that was the config error I was talking about earlier [16:01:09] oh col [16:03:03] cool [16:08:46] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, and 2 others: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 (10cmooney) [16:09:09] sukhe: We're going to rollback [16:09:18] claime: ok thakns [16:10:19] sukhe: If we go back to service_setup that should be fine, right? [16:11:36] then remove the ip:port from ipvsadm [16:13:09] claime: yeah should be fine [16:13:13] sounds about right [16:13:27] if not, we will see :) [16:21:17] sorry about the deploy/rollback. Somehow, the ingress gateway (downstream to pybal) isn't working properly even if config looks right, I wrongly assumed it was working as should [16:22:15] sukhe: looks like I don't need to remove the ipvs [16:22:28] 1020 is green after a puppet run and pybal restart [16:22:40] waiting another minute before doing 1019 [16:22:50] claime: thanks again for your time [16:24:28] claime: I will check once after the meeting [16:24:34] PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker1006.eqiad.wmnet, dse-k8s-worker1007.eqiad.wmnet, dse-k8s-worker1005.eqiad.wmnet, dse-k8s-worker1008.eqiad.wmnet are marked down but pooled [16:24:53] sukhe: yep, that's because the ingress isn't actually responding [16:24:56] yep [16:25:06] Hence rolling back because we don't have time to fix it in a reasonable time frame [16:25:42] sukhe: both 1019 and 1020 are green now [16:25:50] thanks claime [16:26:21] no problem, can I leave it to you to check everything if it does need additional action? [16:26:30] I would probably do more harm than good x) [16:26:32] yes please [16:26:45] icinga looking green is a good start, we will check later after our team meeting [16:26:52] Awesome, thank you <3 [16:26:59] Sorry for the additional work [16:27:18] np at all [16:50:25] 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10Joe) Almost anything relevant internally uses envoy to mediate TLS both client and server side, so it's probably useful to list the oddballs. Off the top of my head, I am sure changeprop doesn't use envoy, and I'm not sure... [17:00:51] ryankemper: give us a few seconds.. sorry :) [17:01:00] vgutierrez: nw [17:11:45] 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10MatthewVernon) I think `ms-*` swift will fall foul of this too, via the wmf-rewrite middleware (which is using python's `urllib.request.build_opener` to talk to e.g. `thumbor.svc.codfw.wmnet:8800` ) [I'm not 100% sure, that... [17:12:18] 10Traffic, 10SRE-swift-storage: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10MatthewVernon) [17:14:18] 10Traffic, 10SRE-swift-storage: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10hnowlan) >>! In T352744#9383998, @MatthewVernon wrote: > I think `ms-*` swift will fall foul of this too, via the wmf-rewrite middleware (which is using python's `urllib.request.build_opener` to talk t... [17:18:50] claime: hosts look good, thanks! [17:50:02] 10Traffic, 10SRE, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [18:25:23] bblack: Wrote up a little summary of our meeting here: https://phabricator.wikimedia.org/T351650#9384234 Would you be able to read over that comment and see if I butchered anything? :D I'll be bringing this to the rest of Search team but want to make sure I've got the right fundamental understanding in place first [20:56:04] 10Traffic, 10Data-Engineering, 10Movement-Insights: Identify and label prefetch proxy data in our traffic - https://phabricator.wikimedia.org/T346463 (10WDoranWMF)