[08:57:06] Hi, we are currently designing a solution to be able to have spark jobs running on K8S DSE cluster to access HDFS and Hive (kerberized secured service). [08:57:07] For the retrieval of the credentials to be used by the spark jobs we have some good solutions like relying on Hadoop Delegation Token or kerberos ticket cache. [08:57:07] But our main challenge is on how to share this credential within spark jobs running on K8S as it will be used by multiple users (human and service account). [08:57:07] We have a proposal but it is based on vault (https://www.vaultproject.io/) which is not an available tool at wikimedia. [08:57:07] Some other approaches have been tested/designed. One of them that could work rely on the only mechanism I know to ensure mutli tenancy on K8S which is based in multiple namespace (mean one namespace per user running jobs on the cluster). Not a big fan of that solution due to the number of namespace we could have at the end. [08:57:07] Could you please take a look/review our proposal in that doc https://docs.google.com/document/d/1Aub7lUr1nPGN3MXz8FI7CCCZ5a5Y1BRpY3poVmui6AM/edit# and let us know if you have some other ideas or strong preferences for some alternative solutions? [09:46:07] <_joe_> nfraison_: I think this is a discussion for the #wikimedia-k8s-sig channel [09:46:32] ack [09:46:45] <_joe_> nfraison_: I am personally generally wary of hashicorp products [09:46:55] <_joe_> vault being the possible exception [09:48:25] <_joe_> but I'd like us to chose a secrets repository that works across the board -so also for puppet-based stuff [09:58:21] we were using vault on my previous company to secure secrets deployed with chef. I'm quite sure that it could be done with puppet also but it is probably also the case for other secrets repository [10:02:25] <_joe_> yeah I think there was a deliberate choice NOT to use vault for our internal pkis though - I will share the doc with the person who did the evaluation back then [11:14:39] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10akosiaris) ` kosiaris@re0.cr1-codfw> show route receive-protocol bgp 10.192.0.195 detail inet.0: 904595 destinations, 1757164 routes (904... [12:08:08] 10serviceops, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MatthewVernon) [13:34:19] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: hw troubleshooting: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10Clement_Goubert) a:03Papaul [13:34:57] 10serviceops, 10DC-Ops, 10ops-codfw: hw troubleshooting: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10Clement_Goubert) [14:52:31] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) These hosts are delicate, they run the MediaWiki job queues :) We can take down a node but it is very important to preserve the /srv partition to avoid kafka to get all the data back from other brokers. From netb... [15:01:16] hi folks [15:01:18] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10elukey) Moreover it would be really great to couple this task with T319372, if possible, so that every new reimage will start from PKI directly. [15:01:26] I added some thoughs to https://phabricator.wikimedia.org/T332013 for next week [15:01:43] so ideally we could end up with kafka main on bullseye and PKI [15:02:17] it should be ok but I have only one doubt, namely if there are clients still using the puppet ca bundle (only) to validate TLS connections to kafka brokers [15:02:24] especially on wikikube [15:02:25] <_joe_> elukey: ideally NOT next week [15:02:42] <_joe_> I don't think it's a great idea to upgrade a cluster that delicate during a hectic week of work [15:02:55] _joe_ it is work part of the sprint week [15:02:57] <_joe_> without any of the service owners in attendance of the process [15:03:10] <_joe_> elukey: I'm saying it is a bad idea [15:03:19] <_joe_> and I'm about to express the opinion on the task [15:03:59] I don't really think that reimaging a kafka node in a 5 nodes cluster is a problem, we can surely avoid pki, but the reimages are totally fine [15:04:13] if you prefer we can skip but I don't really see this problems that you mentioned [15:04:20] anyway, I was just offering some help [15:05:14] 10serviceops: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 (10Joe) I don't think it's advisable to upgrade such clusters in a hurry during a sprint week, FWIW. If anything goes wrong with upgrading these hosts, we would have catastrophic failures to the functionality of the whole we... [15:05:16] <_joe_> elukey: 1 node, no [15:05:25] <_joe_> 5 nodes in a week, it is definitely [15:05:43] <_joe_> and say something goes wrong - who's on the hook for the upcoming issues? you, most probably :) [15:05:46] <_joe_> and me [15:06:12] of course I'd be on the hook, but I haven't really said all cluster :) [15:06:26] we can try one node only, the rest to be done later on [15:10:49] <_joe_> ok ok, sorry I didn't know you'd be involved :) [15:10:58] <_joe_> I thought you were giving instructions to a third-party [15:30:26] 10serviceops, 10DC-Ops, 10ops-codfw: hw troubleshooting: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10Papaul) a:05Papaul→03Jhancock.wm [15:47:43] 10serviceops, 10DC-Ops, 10ops-codfw: hw troubleshooting: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10Jhancock.wm) a:05Jhancock.wm→03Clement_Goubert the physical PSU is showing as up and the server does not have an amber warning light on. replaced PSU from decommed server. alert... [18:36:20] 10serviceops, 10Commons, 10MediaWiki-File-management, 10SRE, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10doctaxon) I get this 429 error with <50 thumbnails, nearly every day. Is this a part of solving this... [19:03:47] 10serviceops, 10DC-Ops, 10ops-codfw: hw troubleshooting: Broken PSU on parse2004 - https://phabricator.wikimedia.org/T332119 (10Papaul) 05Open→03Resolved Icinga checks for the psu's are all green .We can resolve the task. [21:16:20] 10serviceops, 10Commons, 10MediaWiki-File-management, 10SRE, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10TheDJ) @doctaxon There are basically 2 main causes for 429 errors, but both have the same meaning: Th...