[00:26:47] 10serviceops, 10Patch-For-Review: Productionise thumbor1005, thumbor1006, thumbor2005 and thumbor2006 - https://phabricator.wikimedia.org/T285477 (10Legoktm) thumbor1006 is ready now, will pool it tomorrow. [01:29:23] 10serviceops, 10Wikipedia-Android-App-Backlog (Android Release FY2021-22): Create and host assetlinks.json file. (Android 12 deeplinking support) - https://phabricator.wikimedia.org/T294776 (10cooltey) 05Open→03Resolved [08:17:32] 10serviceops, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Additional capacity on the k8s Flink cluster for WCQS updater - https://phabricator.wikimedia.org/T280485 (10Gehel) >>! In T280485#7506072, @akosiaris wrote: > Is T280485#7275149 related to blazegraph and not flink ? I am not sure wha... [09:26:35] 10serviceops, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Additional capacity on the k8s Flink cluster for WCQS updater - https://phabricator.wikimedia.org/T280485 (10akosiaris) >>! In T280485#7509094, @Gehel wrote: >>>! In T280485#7506072, @akosiaris wrote: >> Is T280485#7275149 related to... [10:14:08] hello folks, in https://phabricator.wikimedia.org/T291579 the dcops team is racking the worker nodes for the ML K8s cluster for training [10:14:47] after some chats we thought to expand a little the scope of the cluster to allow other DSE-related use cases (like Airflow for Data Engineering, etc..) [10:15:02] most of them will share the same painpoint, like authenticating via Kerberos to Hadoop [10:15:20] (this is how linked solved it https://engineering.linkedin.com/blog/2020/open-sourcing-kube2hadoop) [10:15:57] if you agree, I'd change the name of the hosts to kube-dse100X and assign them to the Analytics VLAN [10:16:45] the only doubt that I have is if we can reserve enough IPs in the analytics vlan for pods, but in theory it should be fine [10:16:51] (will check later on) [10:17:54] the alternative is to keep the kubernetes cluster in the private vlan(s), and change the firewall rules in analytics land (both ingress and egress) to allow these new IPs [10:32:24] elukey: what does the analytics VLAN have to do with the IPs of the pods? [10:33:08] actually, let's decouple for this conversation the nodes from the actual workloads [10:33:43] do we strictly need the k8s nodes to be in the analytics VLAN? will they be accessing data themselves? [10:34:02] if yes, then the analytics vlan it is. If not, it becomes more interesting [10:34:46] the reason it becomes more interesting is that we could have the pods IPs in something that is similar to the analytics VLAN firewall wise so that the workloads can fetch and store the data they want kinda unhindered [10:36:19] how that does open up a hole. Somehow some malicious workload manages to exploit some kernel vuln and escapes to the node (while clutching on the data). At that point it can probably just ex-filtrate it way more easily as it has passed inadvertedly 1 of the security mechanisms [10:38:09] which strengthens the argument for the nodes being in the analytics VLAN. On the other side, punching the various needed holes for the nodes to communicate with cr*, datastores, prometheus etc is going to a small pain [10:39:13] (no suggestion yet btw, just weighing pros and cons here) [10:41:23] one interesting thing is the pod IPs assignment and how we treat that. To keep the status quo we currently have (and thus use calico and not have to re-think networking just for this cluster) we 'll need dedicated ipv{4,6} prefixes for which the nodes will be the destination ones. We can't really reuse analytics IP space (as we haven't really used [10:41:24] private1-* IP ranges for the production one either) [10:42:28] I think we can just update the firewalling rules and add those prefixes to the analytics cr filters, but I might be missing something [10:43:49] ahhhh sorry I just realized that I had a wrong underlying assumption, namely the subnets of IP pods [10:44:02] I was naively assuming that those were private ones [10:44:19] going to answer the questions above [10:44:28] > do we strictly need the k8s nodes to be in the analytics VLAN? will they be accessing data themselves? [10:45:30] In theory yes to the latter - for example, kubeflow pods will need to fetch data from the feature store for training (and so far we'd like to re-use something like Hive or Spark in Data Engineering land rather than having a new datastore, but we don't have clear ideas yet) [10:46:27] and Airflow pods will need to kick off jobs on Hadoop, or run some workloads themselves to fetch data etc. [10:47:06] but now I realized that you asked for 'k8s nodes', not pods [10:47:13] * elukey brain crash, rebooting [10:47:27] so no the nodes themselves will not access data :) [10:47:33] the pods will need to [10:48:20] so to recap [10:48:36] 1) k8s nodes can stay in private land [10:50:03] 2) pod ip prefixes will be allocated as usual, but we'll need to allow them in the Analytics ferm firewall, and possibly in the juniper filters too (but in this case it would be only if hosts within the analytics vlan need to contact k8s pods right?) [14:23:09] akosiaris: o/ whenever you have a moment, is my summary above ( 1) and 2) ) correct? (I need to add some inputs to the task that dcops is handling today :D) [14:26:53] <_joe_> elukey: I'm not sure about 1) [14:28:23] <_joe_> I would feel that if the pods need to allocate IPs in a vlan that has access to analytics [14:28:34] <_joe_> then the nodes should probably be in the analytics vlan [14:30:23] elukey: for 2) you are correct. ferm+junipers for whatever needs the pods might have. [14:30:57] for 1), I am still ambivalent. As I pointed out above, by having the nodes in the private1-* land, we do shed a layer of protection (but we do gain some easy of mind) [14:31:13] <_joe_> I'm not sure what would change in terms of firewalling to the analytics vlan [14:31:24] <_joe_> I think we already allow prometheus to scrape things there [14:31:33] cr* access [14:31:41] bgp to them that is [14:31:52] that's 1 thing that comes to mind pretty easily. I am sure there is more. [14:32:49] ah, docker-registry access I guess too [14:33:13] there will probably be a couple more. [14:34:20] bgp is a good point, didn't think about it [14:35:05] 1 issue that will crop up if the nodes are in a different uhm, segment of the network? (VLAN isn't a good term in this anymore, there is no VLAN for the pod IPs) from the pods is that traffic between the nodes and the pods will be more difficult [14:35:13] and IIRC, there's plenty of that [14:35:21] thanks to istio that is. webhooks and so on [14:35:27] poor istio [14:35:28] :D [14:36:06] in my mind, after Alex's explanation, if the pod subnets are outside analytics (like they will be) I see few reasons to have the underlying worker nodes in analytics too [14:36:11] compelling reasons I mean [14:36:53] I'd like to avoid to add holes to the juniper filters for all corner cases (bgp, etc..) until we have something working [14:37:30] the pods subnets will need to be included in the list of allowed IPs for the services that they will use [14:38:09] ah but that's the thing. the pod subnets are going to have to be in the analytics cr* filters otherwise they won't be able to access hadoop and so on [14:38:09] If we can avoid to modify the vlan filters as well I'd be very happy :D [14:38:41] and the moment they are in that filter... well they are on the 1 side of the fence immediately (which is how we designed this back then anyway) [14:38:42] akosiaris: the filters are for traffic from analytics towards production (this is my understanding) [14:39:14] so if pods are outside, we'll just need to tweak ferm rules (pods -> analytics services/hosts) [14:39:22] is it the wrong picture? [14:39:56] ah yes, that rings a bell, they only block the path from analytics -> production, not vice versa [14:40:30] IIRC if the connection is initiated on the private1-* side, it isn't blocked ? [14:40:40] exactly [14:40:45] this is my understanding as well [14:40:55] ferm takes care of filtering [14:41:58] if that remains true, then yeah, the pod ranges don't need to be in cr* [14:42:09] which actually.... makes me want to ask [14:42:31] what stops you currently from reaching out to analytics hosts from the current ml cluster and fetch the data you need? just ferm ? [14:43:39] yes exactly [14:55:02] 10serviceops, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Additional capacity on the k8s Flink cluster for WCQS updater - https://phabricator.wikimedia.org/T280485 (10Gehel) >>>>! In T280485#7506072, @akosiaris wrote: >>> Is T280485#7275149 related to blazegraph and not flink ? I am not sure... [15:00:09] 10serviceops, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Additional capacity on the k8s Flink cluster for WCQS updater - https://phabricator.wikimedia.org/T280485 (10akosiaris) >>! In T280485#7510193, @Gehel wrote: >>>>>! In T280485#7506072, @akosiaris wrote: >>>> Is T280485#7275149 related... [15:07:58] 10serviceops, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Additional capacity on the k8s Flink cluster for WCQS updater - https://phabricator.wikimedia.org/T280485 (10dcausse) small precision: If we reuse the same cluster (same k8s namescape): - it's 3 more pods at 2.1G ram, cpu: 1000m each... [15:09:48] 10serviceops, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Additional capacity on the k8s Flink cluster for WCQS updater - https://phabricator.wikimedia.org/T280485 (10Zbyszko) Sorry all for the confusion my typo caused, different name for that magnitude in my native language is confusing me... [15:26:50] 10serviceops, 10Parsoid: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10ssastry) [15:27:18] 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10ssastry) [15:28:05] 10serviceops, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Additional capacity on the k8s Flink cluster for WCQS updater - https://phabricator.wikimedia.org/T280485 (10akosiaris) >>! In T280485#7510247, @dcausse wrote: > small precision: > If we reuse the same cluster (same k8s namespace): >... [15:29:35] 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10ssastry) This is currently blocking @ihurbain on T295837 and so would appreciate a quickish turnaround on this. [15:30:19] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Joe) [15:37:12] 10serviceops, 10Parsoid-Tests, 10SRE, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10ssastry) Ping! This would also make accessing test results less cumbersome without needing to set up ssh tunnels. [15:37:14] 10serviceops, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Additional capacity on the k8s Flink cluster for WCQS updater - https://phabricator.wikimedia.org/T280485 (10JMeybohm) I'd opt for "reuse the same [flink] cluster" from the perspective that we treat this snowflaky-ish in the k8s clust... [15:38:07] 10serviceops, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Additional capacity on the k8s Flink cluster for WCQS updater - https://phabricator.wikimedia.org/T280485 (10akosiaris) >>! In T280485#7510249, @Zbyszko wrote: > Sorry all for the confusion my typo caused, different name for that magn... [16:25:58] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) There are actually 2 levels of access, parsoid-test-admins and parsoid-test-roots. test-admins has these sudo privs: ` 654... [17:11:57] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10ihurbain) For the record: I don't seem to have access to bastions either - it asks for a password instead of doing a key auth, on what... [18:42:17] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) Hi @ihurbain, welcome to WMF. You can't login on those bastions because you don't actually have a shell account yet (in produc... [18:45:10] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) @ihurbain Also read L3 and sign it, please. [19:03:12] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) p:05Triage→03High [19:13:30] 10serviceops, 10GitLab, 10Security-Team, 10Release-Engineering-Team (Radar), 10SecTeam-Processed: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481 (10hashar) After talking with @Brennen today, a while back we got budget to add some hardware to the Ganeti cluster. The... [19:30:06] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10MSantos) @Dzahn and @ssastry I can't access `scandium.eqiad.wmnet` and `testreduce1001.eqiad.wmnet`. My shell user is `mbsantos`. [19:33:24] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Urbanecm) >>! In T295900#7511306, @MSantos wrote: > @Dzahn and @ssastry I can't access `scandium.eqiad.wmnet` and `testreduce1001.eqi... [19:36:15] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10MSantos) @Urbanecm I had the impression this task is for all [[ https://www.mediawiki.org/wiki/Content_Transform_Team | Content Transf... [19:36:58] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Urbanecm) >>! In T295900#7511332, @MSantos wrote: > @Urbanecm I had the impression this task is for all [[ https://www.mediawiki.org/w... [20:31:42] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Jgiannelos) I also don't have access to `scandium` and `testreduce1001`. Similar with Mateus, I am new to the content transform team. [20:44:15] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) @MSantos @Jgiannelos Try again now:) Since you had existing shell users it was in this case just adding... [20:48:56] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) This group gives you the following sudo privileges: ` 654 privileges: ['ALL = NOPASSWD: /usr/sbin/... [20:53:27] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) [20:54:18] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) added the team members from https://www.mediawiki.org/wiki/Content_Transform_Team to have their own check... [21:04:13] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10ihurbain) @Dzahn I read and agreed and signed the L3. Here's my brand new public key: `ssh-ed25519 AAAAC3NzaC1l... [21:13:58] 10serviceops, 10SRE, 10SRE-Access-Requests, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) @ihurbain Hey Isabelle, thank you. confirmed signature:) And yes, that is a public key and looks good to... [22:37:48] 10serviceops, 10GitLab, 10Security-Team, 10Release-Engineering-Team (Radar), 10SecTeam-Processed: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481 (10Dzahn) @Jelto I [[ https://wikitech.wikimedia.org/wiki/Ganeti#Verify_cluster_resource_availability | checked capacity... [23:57:39] 10serviceops, 10decommission-hardware: decommission thumbor1003.eqiad.wmnet - https://phabricator.wikimedia.org/T285479 (10AntiCompositeNumber) [23:57:41] 10serviceops, 10decommission-hardware: decommission thumbor100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T273137 (10AntiCompositeNumber) [23:59:30] 10serviceops, 10Patch-For-Review: Productionise thumbor1005, thumbor1006, thumbor2005 and thumbor2006 - https://phabricator.wikimedia.org/T285477 (10Legoktm) thumbor1005 is fully pooled, thumbor1006 is pooled at weight=5 and thumbor1003 is depooled but not removed from the memcache/nutcracker pool yet. thumbo...