[07:58:58] hi folks [07:59:32] I have rolled out the new version of istio to ml-serve/dse/aux, Janis is going to rollout to the wikikube clusters [09:11:57] inflatador | dcausse: where you able to figure out your oomk issues from yesterday or do you need some help? [09:13:51] as to memory QoS: IIRC that is still alpha stage even with k8s 1.27 - so I doubt we'll see that anytime soon* in our clusters [09:28:00] jayme: no still trying to figure things out by tuning some params, what's not clear to me is why it gets oomkilled, the app uses all the mem it is given and does not seem to "struggle" when the oomkill happens [09:28:28] in prod we tune requests < limits, with the k8s operator requests=limits [09:28:56] so it's flink I suppose? [09:29:01] yes [09:29:22] the updated version of the rdf-streaming-updater for the operator? [09:29:30] yes exactly [09:30:20] we get a oomkill of one container every 12 to 24 hours [09:35:01] can you share some more details? Like which container, is it the same one every time or is it "random"? [09:35:48] I do see a bunch of taskmanager pods and one that's just acalles flink-app-wqds - that one seems to be running for quite some time already [09:37:33] (if you want me to take a look that is ofc. :-)) [09:37:44] ofcourse I want! :) [09:38:05] so there's this dashboard: https://grafana-rw.wikimedia.org/d/gCFgfpG7k/flink-cluster?from=now-2d&to=now&var-datasource=eqiad%20prometheus%2Fk8s-dse&var-namespace=rdf-streaming-updater [09:38:39] uh, colorfull :) [09:38:51] the jobmanager pod is fine, these are the taskmanager ones that are causing oom [09:38:54] yes ... [09:39:09] ah, so the jobmanager is the one without suffix, right? [09:39:29] sorry, I keep forgetting the flink terms over and over again [09:40:34] I think the jobmanager one gets a random suffix assigned by something (not sure what helm or k8s?) [09:41:11] the taskmanager ones is created by flink itself and gets a suffix like -1-4 [09:43:50] as a comparison this is how it looks like on wikikube for the production job: https://grafana-rw.wikimedia.org/d/gCFgfpG7k/flink-cluster?orgId=1&from=now-24h&to=now [09:43:52] if anybody has time today/tomorrow I'd need a brain bounce for https://github.com/RadeonOpenCompute/k8s-device-plugin#prerequisites. They seem to offer only the daemonset deployment strategy, very simple but it requires --allow-priviledged=true on the kubelets (that afaics we don't use). Is it something that we can allow, or not? [09:46:29] elukey: --allow-priviledged is a switch for the apiserver and it's true for all clusters (because of calico) [09:49:49] for reference: looking at this helps with flink terms :) https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/concepts/flink-architecture/ [09:50:20] jayme: when looking at https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s-dse&var-namespace=rdf-streaming-updater&var-pod=All cpu throttling seems bad no? [09:53:39] jayme: is it only for kubeapi? They mention it for kubelet too, so not needed at all right? [09:53:56] (so basically the gpu plugin will run as calico runs its daemonset more or less) [09:54:41] dcausse: yeah, does not look good. For my understanding: You ask the flink operator to run 3 task-managers, correct? And the jobmanager distributes work to them? I wonder why one of the 3 seems more loaded than the other [09:55:32] elukey: I don't think kubelet has a that flag so I'd say it will "just work" [09:55:51] if a proper PSP applies ofc [09:56:35] jayme: yes it's not great but there's always some data skew and that's hard to avoid, we shuffle the data by page_id so in theory it "should" be evenly distributed [09:58:28] ack [09:59:26] jayme: in mem usage what's the difference between "Working" and "Used"? [10:00:35] (I'm piggybacking some learning on the operator on this so please bear with me) [10:02:05] sure, sorry about that, will read on that on my own [10:03:04] did not meant that to be a response to your question, sorry. [10:04:05] AIUI usage may contain things that can be evicted (like fs cache) and working set does not [10:04:51] thanks! :) [10:10:05] dcausse: the production version is running a different flink version, right? [10:10:25] yes [10:11:12] and is is allowed to use way more resources...I would start there and see how it behaves if you reduce throttling [10:11:37] might be that garbage collection is not able to complete (in time) because of that for example [10:11:55] very true, we allow 4 cpus per tasksmanager pods in prod [10:12:20] I'm assuming you're using an up-to-date jvm base image (cgroupv2 aware)? [10:12:26] yes [10:14:03] aiui the jobmanager does not really run "your code", right? [10:14:38] and even that one uses like 500M more memory with the new flink version [10:15:10] no it's just here to ack that all taskmanagers are running and does a final validation of the checkpoints [10:15:26] issue is that java will use everything you allow it to use :) [10:16:19] but it does not do that in prod (for the jobmanager at least. there is a 1.5G limit and it does not seem to fo past ~700M [10:17:08] and it's constantly >1.2G in dse [10:18:41] yes, lemme check mem settings there but I think I can constraint the jobmanager mem usage on the dse cluster to be similar to wikikube [10:19:36] there might be no need, it's just a difference that struck me [10:20:26] yes me too at first but with the operator the way to tune the various mem settings is widely different and I've been struggling to tune all this [10:27:54] all the options are indeed a bit overwhelming [10:31:10] yes... [10:32:09] still - I'd say run with the resource requests and limits from prod first to see if it looks any different [10:32:51] I'll go for lunch real quick and will continue when I'm back [10:35:10] sure, thanks for the help!! [12:44:31] dcausse: FYI, I have a newer WIP flink dashboard here: https://grafana.wikimedia.org/d/K9x0c4aVk/flink-otto-wip?from=1681130554455&orgId=1&to=1681303354456&var-datasource=eqiad+prometheus%2Fk8s-dse&var-namespace=rdf-streaming-updater [12:44:37] some more details on memory, etc. [12:44:50] Some kafka metrics too [12:45:11] ottomata: thanks! [12:47:28] something I just realized is that it's the first time we run flink with jemalloc [12:48:18] I might try to disable it to see if mem usage drops to something closer than what we've seen with flink 1.12 [13:03:45] interesting! cc gmodena [13:46:02] dcausse: should we bring the DISABLE_JEMALLOC env var conditional back into the docker-entrypoint? [13:46:03] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/production-images/+/refs/heads/master/images/flink/flink/Dockerfile.template#129 [13:46:06] https://github.com/apache/flink-docker/blob/master/1.16/scala_2.12-java11-ubuntu/docker-entrypoint.sh#L92 [13:46:22] we always use jemalloc in our flink production image now [13:48:41] ottomata: oh indeed, in its current shape I'll either have to completely override the entrypoint from my image or re-introduce this var [13:53:44] why do I only see rdf-streaming-updater in the flink dashboards btw.? Shouldn't there be the enrichment thing as well on dse? [14:11:51] dcausse: I saw this link based on a flink discussion of whether jemalloc should be the default, https://stackoverflow.com/a/33993215/1236063 [14:12:08] seems like at least on some work loads jemalloc uses quite a bit more memory [14:14:17] jhathaway: yes exactly, thanks for the pointer, I'd like to test this hypothesis to see if jemalloc is the cause [14:29:35] jayme mw enrichment is not deployed right now. We are doing some tuning/profiling on yarn [14:29:54] ah, okay. Thanks [14:33:47] jayme np. I'd like to run test on k8s again this week. The tuning is tracked in https://phabricator.wikimedia.org/T332948 [15:49:04] dcausse: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/908256 [15:54:15] ottomata: thanks! looking [16:30:58] elukey, jayme: we need a place to storeĀ flink checkpoints. So far now we've been (ab)using thanos swift. Data persistence has asked us to investigate the possibility of using k8s PersistentVolumes instead. What are our chances of using them in DSE and wikikube. And, if we could, what are the backing storage options? Just local filesystem I suppose (unless DP provides some other storage for this?) [16:31:11] re https://phabricator.wikimedia.org/T330693 [16:34:20] ottomata: we're not going to support persistent storage in our k8s clusters in the foreseeable future, sorry. IIUC there are some experiments(?) (btullis maybe) but those will probebly not reach wikikube [16:37:31] thanks, good to know [16:39:40] for reference: https://www.mediawiki.org/wiki/Kubernetes_SIG/Meetings/2023-02-28 [16:40:39] hm, how do I get invitted to these meetings :) [16:42:27] No chance :-p [16:43:49] (I send you an invite) [16:44:21] ty! [16:44:27] next one will be on 25th, continuing the current discussion about update process and work sharing [16:44:36] great, look forward to it [19:34:34] dcausse: gmodena FYI 1.16.0-wmf6 available (I did not test it :o )