[07:05:16] dcausse: o/ - I am not following 100% what you wrote re: dse-k8s and rdf-streaming-updater.. Afaics you have to files, values.yaml and a dse-k8s specific one in the helmfile.d config already dedicated to dse-k8s-services. The two will get merged by helmfile, so effectively they are like a single config (with the caveat that it is confusing to figure out what values are really picked up from a [07:05:22] quick glance). [07:18:38] elukey: correct, in this particular state it makes little sense but this service will have other apps and other envs. But in reality we have multiple envs (in wikikube) and multiple apps wdqs vs wcqs. I started to split this file early to discover any issues with how the flink-app chart can be configured with stacked helmfiles to re-use some shared config [07:22:14] more concretely we'll have a followup with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/896362 that declares a second app that shares a non negligible number of config bits that I'd like to re-use in both apps, values.yaml sounded the right place to share those [07:29:46] dcausse: mmm but how do you share configs in this setting? Afaics you have separate helmfile.d directories, and the config picks up the various yaml files from there [07:30:27] ah I see the wdqs yaml [07:40:33] something that I'm not sure if it's a practice we want to follow is I've been told (by Andrew I believe?) that it's generally good to avoid prod settings in values.yaml so that it can be easily re-used in local testing [07:41:23] I don't plan to do local testing from these helmfiles (the app has too many dependencies) so I've put prod settings in values.yaml [08:15:59] dcausse: maybe he meant the values.yaml in the chart, not the one in helmfile.d? [08:16:44] oh perhaps? sadly I don't remember that well [08:19:49] that would make sense :) [08:20:17] in helmfile.d it wouldn't as that is where we put evrything together to run in our environments... [08:28:54] dcausse: The approach we've taken for mediawiki is to create a _mediawiki-common_ folder with some global.yaml files, and symlink them in the deployments that need them [08:30:35] claime: interesting, did not know this was possible thanks! I guess we could use that if we could not/don't want to share the same service folder [08:32:17] dcausse: yep, you just need to reference them properly in helmfile.yaml (like any other value file) [09:57:05] dcausse: o/ when you have a min do you mind to read my last updates in https://phabricator.wikimedia.org/T344614 and let me know your thoughts? I am very confused :) [10:12:48] elukey: I don't know why Brian thought that zookeeper was not used, everything I found in the logs suggested that it's beind used properly [10:13:49] and sorry for re-using s3://rdf-streaming-updater-staging in dse-k8s... very confusing but I thought this experiment in dse-k8s would not last this long :( [10:15:43] we'll do some more testing (restarting the flink jobmanager) where the data stored in zookeeper is necessary [10:16:51] but tbh it's already necessary flink requires some leader-election mechanisms for many of its component and I don't see any the configmap it created when we relied on its k8s h/a mechanism [10:17:13] dcausse: yeah I was wondering what is being used between the s3 metadata and the zookeeper stuff, I assumed that some znodes where stored but probably not? Or maybe it uses zookeeper only on brief moments? [10:18:12] no it should use it quite often to very what jobmanager is the leader for instance [10:18:41] there's absolutely no traffic between k8s and zk? [10:19:48] I would assume that some data would be stored too, zk is totally empty? [10:20:39] exactly yes, empty [10:20:53] :/ [10:24:05] flink should fail if it's not able to connect to zk unless flink H/A got disabled somehow [10:24:34] I don't see connections to port 2181, nor traffic to that port via tcpdump [10:24:39] this is why I am puzzled [10:27:51] we might have messed-up the config and disabled h/a, will double check all the config options [10:28:22] from https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/ha/zookeeper_ha/#example-configuration I see that we are missing the cluster-id [10:29:12] high-availability.type is flink 1.17 not 1.16 [10:29:25] that might explain [10:29:45] we use 1.16 [10:30:23] yes cluster-id seems like something important to setup [10:31:28] I have a pairing session with Brian this afternoon we'll fix this [10:32:20] ah yes https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/ha/zookeeper_ha/#example-configuration [10:33:01] this is definitely not right "high-availability.type": ZOOKEEPER [10:34:54] I'll update the task [10:34:58] yes this probably disabled h/a completely [10:35:00] thanks! [10:38:13] https://phabricator.wikimedia.org/T344614#9163070 [10:38:15] dcausse: --^ [10:38:26] thank you for the brainbounce! [13:14:04] elukey re: FW rules, looks like the dse-k8s hosts are in 10.64.0.0/22 , is there some kind of proxying going on or something? [13:18:03] also, I'll get a patch up to fix the flink HA config, thanks for finding that [13:18:50] inflatador: o/ so the pods are in a different subnet than production, this is why you see a different range [13:20:00] if you check the FW rules they sat "DSE_KUBEPODS_NETWORKS" [13:20:40] yeah, I saw that, but was confused because 10.64.0.0/22 is not open on the zk hosts [13:21:01] I couldn't get bidirectional comms going until I one-offed an iptables rule to allow it [13:21:17] but that's from the worker as opposed to the container [13:23:15] inflatador: sorry I am not following [13:23:34] if you check calico-values.yaml for dse you'll see that the pod subnet is 10.67.24.0/21 [13:23:43] that is allowed by iptables on zk nodes [13:23:47] elukey OK that is what I was missing [13:23:58] those IPs are managed by calico [13:24:03] I was expecting to see the range of the worker hosts [13:24:08] ah okok [13:24:34] thanks again. sorry for the confusion [13:26:01] nono please this channel is here to ask questions :) [13:27:16] if you are curious jump on a dse worker node and exec "ip route" [13:30:46] * inflatador really needs to get back into SDN [13:31:12] one day I'll tell you a sad tale about STT ;P [13:49:41] inflatador: let us know if flink works with ZK with the new settings, now I am really curious :D [13:49:57] np, getting a patch up now [13:54:08] .30 [13:56:01] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/957298/ [13:56:29] think I'm gonna make a small change...sec [13:57:20] OK [14:23:09] inflatador / dcausse: as I see it from the sidelines, there is a huge amount of repetition in all that flink-app config. It would be super nice if you could plan on refactoring that a bit (in the chart probably) so we come to a proper standard. [14:24:15] like all those s3 paths, zookeeper paths, cluster-id's ... all of that looks like it could be hidden from the operator by just defining useful defaults [14:25:07] jayme: sure, good idea [14:26:33] cool! [14:28:40] please subsribe me to the task when you have one :) [14:32:25] sure will do [14:36:10] jayme looks like the flink operator (as opposed to the app) can't talk to zk. Do we need to set egress rules somewhere? [14:36:42] I recall that the operator talks to ZK as well, yes [14:37:06] not 100% sure though...there might be some knowledge in the phab tasks [14:37:22] yeah, it definitely talks to ZK, just not sure where to set the network policies [14:37:54] nm, think dcausse found it [14:47:19] jayme just to confirm, we're adding egress rules for the operator in helmfile.d/admin_ng/flink-operator/values.yaml identical to https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/dse-k8s-services/rdf-streaming-updater/values-dse-k8s-eqiad.yaml#L33 , LMK if that will cause problems [14:47:42] looking at the chart code I believe that it might be able to pull networkpolicies.egress config but not 100% sure [14:48:02] no idea if that's implemented in the chart... [14:50:34] maybe we can just pull in the new network policy version per Erik's recent patch [16:11:37] jayme: it does not seem to be able to pull fixtures for staging-codfw [16:14:54] seems like staging-codfw is skipped in Rakefile [16:17:17] probably here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/Rakefile#394 where it skips loading .fixtures/general-#{env_name}.yaml [16:27:12] dcausse: in a meeting, sorry [16:27:33] but yeah, totally plausible. No other thing in admin_ng uses those defaults [17:15:13] ended re-adding the fixtures, not sure to understand how to fix the build for admin_ng without them :/