[08:38:34] consumer-cloudelastic failed during the night with an OOM (https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2024.14?id=dAGrq44BX0U9mJhKrM7i) might perhaps be the time to increase k8s mem limits? [08:56:15] dentist appointment, back soon-ish [10:13:03] lunch [12:56:57] Trey314159: I'm reading the standup notes late. The test refactoring seems like a good next task! [13:14:04] o/ [13:17:29] o/ [13:18:32] dcausse: I’ll look into the OOM, thanks! [13:22:29] pfischer dcausse I can get a patch up for the memory limits if no one else is on it [13:23:41] I might be wrong but scanning the logs I feel that the job did not even try to restart [13:23:52] Recovery is suppressed by FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=10, backoffTimeMS=180) [13:26:10] cannot find the 10 restart attempts... [13:28:29] I see them now, but they're spread over a long period of time, wondering if that restart counter is ever reset [13:31:41] ah no my bad, they happen over 10 minutes: https://logstash.wikimedia.org/app/dashboards#/view/7b67aa70-7e57-11ee-93ea-a57242f792cd?_g=h@75fb871&_a=h@193af5e [13:32:04] sorry wrong link, https://logstash.wikimedia.org/goto/5c461278a9a816d4442eaafc2a55d38e [13:32:43] ryankemper, inflatador : I'm movign T354670 to in progress. Can you please make sure that tickets that are being worked on show up on our board? [13:32:44] T354670: cleanup apifeatureusage indices on the Cirrus elasticsearch cluster (fix curator) - https://phabricator.wikimedia.org/T354670 [13:34:18] looking closer filtering on "RUNNING to RESTARTING" I only see 5 restarts [13:53:28] gehel ACK, sorry that one got lost [14:35:16] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1017296 patch up for adding memory to cloudelastic taskManagers [14:38:32] inflatador: thanks! but did not realize that peter increased the replica count to 5, 5*6gb might be above what we can request at the namespace level [14:40:24] dcausse yeah, I was a bit confused on that, `limits.memory 5500Mi 100Gi` is what I see in the namespace [14:40:48] if I'm reading that correctly, we have 100Gi quota? [14:40:55] seems huge [14:42:27] True, but like I always say, I'm OK with throwing HW at the problem. Happy to adjust down if you had another number in mind, though [14:44:53] here's the k8s capacity dashboard for context: https://grafana.wikimedia.org/d/pz5A-vASz/kubernetes-resources?orgId=1 [14:45:57] I'll let pfischer ponder on this, dunno if by increasing the replica he also wanted to address the oom issues or increase throughput or simply both :) [14:47:50] sure, we can wait [14:55:40] 100Gi seems to be about .007 of available capacity FWiW [14:56:49] :) [14:58:02] I feel that 2g for the flink process is indeed to small, this leaves only 700m for the heap, I would perhaps bump to 3g first and see [14:58:07] s/to/too [15:01:11] and the containers do seem to rarely use more 1.5G so we could fine-tune this to win few hunders of MB for the head [15:01:30] i.e reduce overhead-fraction [15:01:53] s/hunders/hundreds [15:03:29] \o [15:05:26] o/ [15:05:32] for the memory issue, if we can simply throw memory at it thats probably fine. I suspect something else is happening here though. taskmanager heap memory was all reporting ~500MB. memory saturation was 55% [15:05:48] it feels like something tried to take hundreds of mb of memory in a few ms [15:06:07] from flink-app dashboard [15:06:17] yeah, I wonder what is triggering that [15:07:19] not sure, our batch sizes are down at 25mb so even a couple copies while flushing shouldn't be that bad [15:07:37] yes I think flink is aggressively limiting itself to stay under the requested 2Gb, I'm suspecting something like the overhead-fraction (defaults to 10%) [15:08:26] I suspect that its buffering more than we think [15:08:51] hmm, perhaps need to poke around and see if we can see how full various buffers are in real time [15:08:52] the chechpoint sizes increased with the new sink, suggesting that there are more docs pending [15:09:31] could be related to the replica increase tho [15:09:37] i suppose on the other hand, if giving it an extra gb per taskmanager makes it stop failling...thats easy [15:09:54] it's not like we have a budget :P [15:10:01] yes I would agree [15:11:03] i wonder how saneitizer will effect this, it's going to increase the indexing load [15:11:04] inflatador: so perhaps 3g first instead of 6g to see how it goes? [15:11:18] dcausse sure, will work on patch now [15:12:23] ebernhardson: no clue, might require some experimentations to find the right settings I suppose? [15:14:05] perhaps. The defaults are actually super low, rerender defaults to 16, = 32 weeks [15:15:56] i also decided to take some 10% time yesterday, wrote most of a thing that sits above the reindexer and does all wikis. General idea is to track reindexing by recording the indices attached to the prod aliases, and considering reindexing done when they've changed. [15:16:25] allows writing a single state file at the beginning, and referencing it later instead of tracking state updates through the process [15:17:05] nice! the aliases are what matters indeed so they should hold the truth [15:22:10] downside is it's 500 lines :P and another 300 for the base reindex script...this is all quite complicated [15:22:48] brings parallel reindexing limited by concurrent shard counts and concurrent wiki counts, batches the backfills [15:23:02] and i suppose there are 0 tests [15:24:24] these things are super to test but perhaps big enough to think about moving it into a repo that has some python support but not sure where? [15:24:59] well perhaps not now, another repo means a way to deploy... [15:25:21] yea perhaps it should live somewhere with proper support, it's a bit complicated to just sit next to related bits [15:25:30] yes... [15:26:23] but overall this should greatly speedup the process, batching multiple wikis for the backfill is going to help a lot I suppose [15:26:30] it could actually also get the other cirrus scripts like check_indices.py [15:27:09] i suppose the lazy option would be to put it in the Cirrus repo, consider adding a tox test runner to the cirrus repo. We could call it a micro-mono-repo :P [15:27:19] :) [15:27:34] personnaly I fine either way :) [15:27:39] *'m [15:27:59] will ponder, but i agree it should live somewhere. testing currently amounts to me running flake8 on it, and running the script [15:40:25] quick break, back in ~20 [16:14:54] back [16:15:14] heading out, have a nice week-end [16:15:28] Merged/deployed the SUP patch...should have 3GB RAM for consumer-cloudelastic.yaml now, up from the default 2GB [17:54:26] taking a break, back in ~40 [20:07:41] sorry, been back [21:08:40] not feeling so hot...going to take off a little early. Happy Friday! [21:08:58] aww, hope the weekend helps out!