[04:16:07] i.nflatador / g.ehel: I wonder if we can rip out https://gerrit.wikimedia.org/g/operations/puppet/+/ebfb23ce343a9d8adc6abbcb730f437ca65fa3dd/modules/profile/manifests/query_service/common.pp#79 since we're not using the netcat method to xfer journal files anymore [04:55:35] If we do still want it, then we need to add wdqs20[13-22] here which we missed: https://gerrit.wikimedia.org/g/operations/puppet/+/ebfb23ce343a9d8adc6abbcb730f437ca65fa3dd/hieradata/role/common/wdqs/public.yaml#33 (but I think we can prob just tear this stuff out) [09:10:56] o/ dcausse: Following up on yesterdays meeting, I looked into the S3 request for the enrichment application (https://phabricator.wikimedia.org/T330693) and sort out the requirements for zookeeper (ZK) and a file system (S3). IIUC, ZK is for coordination amongst the flink coordinators (job mangers & task managers) and S3 is needed to persist state of stateful operators, aka checkpoints. Are kafka offsets be such state of [09:10:56] the source operator? [09:13:28] Would ZK (or ConfigMaps as of now) be used to hold a reference to such checkpoint? [09:26:01] pfischer: yes, swift will store: the job artifacts (jars), the flink state (kafka offsets, the operator states which is mainly the dedup window) [09:26:21] so mainly we have to estimate what's the size of the dedup window [09:30:33] this will also store savepoints in addition to checkpoints, savepoints can be taken manually or automatically by the flink-k8s-operator where you can tell it e.g.: take a savepoint every 10 minutes and keep the last 10 savepoints [09:34:58] ZK/configmaps only stores pointers to checkpoints and do leader election for flink components (e.g. what's the jobmanager that a taskmanager should use) [09:49:41] Alright, thanks! I already tried to research how to estimate the snapshot size. As expected, this is highly dependent on the application. Do you think, running a local experiment with avg. sized revision-based events at 30 events/s + non-revision events at 300 events/s (with 7.5% deduplication rate) would give us an estimate when looking at the locally saved checkpoints? [09:52:39] pfischer: I think we could even take the event size (enriched) and do the math instead of running an experiment? [09:54:19] window_length_in_sec * rate * avg_event_size [10:01:44] doing quick math and assuming a 50k enriched event size, enriching all guesstimating something around 5G, enriching only rev_based we'd be at 500m, enriching nothing somewhere in the 100m [10:05:55] So we checkpoint every 5s? [10:08:28] 5s seems low but possible? [10:08:42] haven't thought too much about the checkpoint rate [10:10:15] but this should not affect much the storage needs unless we start to have multiple checkpoints taken in parallel (which might happen if we take a checkpoint evey 5s) [10:10:45] I was just doing the reverse math: 5000000 kb / 50 kb / 307 (ops/s) -> 5s [10:11:08] ah you mean the window size [10:11:16] hm my math are wrong then :) [10:11:53] Yes, sorry, that would be the window size. [10:13:34] lunch [10:14:31] so 50kb*330ops/s*300sec = 4950000kb (ignoring dedup for now simplicity) [10:15:44] https://docs.google.com/spreadsheets/d/1Fp44MdLxUVlxi03MBD_64m0zQErny-9jUD5C6RGf_bU/edit#gid=2037151092 [10:16:11] thanks will be way easier in a spreadsheet indeed! :) [10:16:35] But state would pile up inside the dedupe window since we only flush at the end [10:26:27] pfischer: not sure I understand? [10:28:35] So if we use full (not diff) checkpoints, with min pause of 1 s, we would write 300 checkpoints within a dedupe window. Each checkpoint would be the size of its predecessor + its own size. [10:29:18] If we use diff checkpoints the math should be as you suggested above. [10:33:30] for full checkpoints (and allowing parallel checkpoints) it depends on the time a checkpoint takes to be taken so no it's unlikely that 300 chekpoints will be pending [10:34:12] if we don't allow parallel checkpoints only one full checkpoints + the pending one will take up storage [10:34:43] checkpoint rate and window size are not closely related [10:37:09] by parallel I mean tuning the "number of concurrent checkpoints" from https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/ [10:37:38] lunch [11:15:32] Yes, I saw that. I think I got it now. What I confused was checkpoint retention. Since they can be dropped once a new one has been written successfully, our max storage requirement would be determined by the maximum parallel checkpoints * the number of retained checkpoints. [11:24:54] I created https://phabricator.wikimedia.org/T342620, would tag SRE-swift-storage once you had a chance to look at it. [11:57:45] Lunch + errands [13:26:10] pfischer: thanks! made some adjustments but all this is up for debate (i.e. # retained savepoints and the headroom %) I just picked these numbers based on my experience operating flink with WDQS [13:41:30] Sure, thanks dcausse: [13:44:23] pfischer I added tags to the phab task, thanks for creating it [14:02:23] inflatador: are we doing the DPE K8s check-in? [14:04:22] nvm, re-reading emails I believe we discussed about removing it [14:10:06] dcausse I thought I deleted it already, sorry! [14:10:12] np! [14:13:53] But...if dcausse and pfischer do have time to meet, I have a quick question on https://phabricator.wikimedia.org/T341625 as it seems that kafka/swift usage are interdependent. I wanted to respond to the ticket with that but just want to make sure my response is accurate [14:14:19] draft response here: https://etherpad.wikimedia.org/p/kafkaswift . If you can QC/edit and get back to me it would be appreciated [14:14:59] Just a sec. [14:15:09] inflatador: happy to jump in a meet to discuss this [14:16:12] dcausse cool, up at https://meet.google.com/oom-cfms-yar [16:57:05] pfischer: was about to merge https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/5 but sonar complaining is this something we can ignore? [16:58:40] relatedly we should try to keep https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/856507 up to date with the flink code [16:59:38] and when you have a moment to review it I'd like to deploy https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/465 to unblock a broken job [17:00:04] dinner [17:23:12] lunch, back in ~45 [17:59:18] dcausse: the failure seems to be with the build itself, not with sonar. I wonder why it is passing on the main build. Maybe different JDK ? [18:11:52] back [18:33:01] ryankemper we're in SRE pairing [18:33:06] ryankemper: https://meet.google.com/eki-rafx-cxi