[01:18:20] <ebernhardson>	 meh, i don't like setting this backoff value directly, but apparently to set it the way i want i need to solve for x, where x is inside a summation. Back to remedial math for me :P
[02:08:08] * ebernhardson cheats, uses desmos to chose a plausible value, and calls it a day :P
[10:18:05] <gehel>	 lunch
[10:19:15] <gehel>	 ejoseph: I just saw your meeting in 40'. I'm not going to be back from lunch yet. Any chance we could move it to 1h later?
[10:20:32] <dcausse>	 lunch
[10:25:17] <ejoseph>	 Sure
[11:07:20] <pfischer>	 lunch
[12:54:32] <dcausse>	 pfischer: hey, Antoine pointed me at https://github.com/moabualruz/docker-arm-wikimedia-dev-images
[14:08:58] <ejoseph>	 gehel: can i get this patches reviewed https://gerrit.wikimedia.org/r/c/mediawiki/libs/metrics-platform/+/836208, https://gerrit.wikimedia.org/r/c/mediawiki/libs/metrics-platform/+/838230
[14:09:26] <ejoseph>	 I need them to fix the other patches I have opened
[14:57:26] <dcausse>	 ouch a burst at 16k failed error/miun just now
[14:57:32] <ebernhardson>	 dcausse: ahh, yea that one we are seeing none of for some reason :S it's not showing up in the graphite data
[14:57:52] <volans>	 there was an issue during a netwrok maintenance
[14:57:52] <volans>	 is being rolled back
[14:57:57] <dcausse>	 ah
[14:58:01] <volans>	 we also depooled eqiad from the CDN
[14:58:05] <volans>	 right now
[14:58:08] <volans>	 see -ops
[14:58:12] <dcausse>	 volans: thanks for the heads up, might be related
[14:58:32] <ebernhardson>	 ok, i was also starting to worry because i also aw a spike of eqiad.failed of 15k/min, that sounds plausible :)
[14:58:56] <volans>	 should be recovering now, if it doesn't let us know
[14:59:01] <volans>	 or could be unrelated
[14:59:02] <dcausse>	 sure
[15:01:27] <dcausse>	 seems to recover indeed
[15:01:27] <ebernhardson>	 so now the promql query works, it certainly shows tripping brakers, minimally in omega and mostly in psi
[15:01:45] <ebernhardson>	 and chi is fine (which makes sense, the graphs show tons of heap available in primary cluster as well)
[15:02:06] <ebernhardson>	 but we don't really know the source .... iwonder if envoy records anything about the 429's
[15:05:55] <ebernhardson>	 or maybe we shouldn't spend so much time investigating, and simply give omega/psi another GB of memory
[15:06:28] <dcausse>	 did we deploy new jvm options?
[15:06:34] <ebernhardson>	 not yet
[15:06:46] <ebernhardson>	 was intending more memory as the next step i suppose
[15:06:52] <dcausse>	 ok
[15:10:06] <ebernhardson>	 hmm, i suppose i could allow this dump more parallelism, each query targets a single shard and will thus only use a single search pool thread in the cluster
[15:10:42] <ebernhardson>	 have to start it back up anways
[15:11:35] <ebernhardson>	 will try the batch size we use in cirrus as well and see how it goes, although this one has 10 retries with backoff (up to 2 minutes on the later reteries)
[15:17:19] <dcausse>	 oh that might explain why it's slow?
[15:17:40] <ebernhardson>	 yea the backoff could explain it, we sadly don't get any metrics out about that. Will have to ponder how
[15:17:40] <dcausse>	 if it always fall in the worst case scenario
[15:18:07] <ebernhardson>	 we have the prometheus push gateway now so can in theory push out metrics from batch jobs, not sure how that works if we had many executors all submitting the same metrics though
[15:19:25] <ebernhardson>	 but based on cluster stats, only psi and omega are giving circuit breaker errors, i suppose i was expecting the long jobs to be the giant shards in chi
[15:22:31] <dcausse>	 me too at first but it seems more likely that it's caused by retries on circuit breaker errors
[15:31:42] <ejoseph>	 gehel I dropped a comment on this ticket https://gerrit.wikimedia.org/r/c/mediawiki/libs/metrics-platform/+/838230
[15:31:49] <ebernhardson>	 yea seems possible...will have to ponder on instrumenting these so we can get some actual stats out. It will be a similar problem with moving mjolnir reads into the yarn side, we will want metrics like we currently get
[15:33:19] <dcausse>	 indeed
[15:38:56] <ebernhardson>	 the difficult part with pushgateway is we need to delete the metrics when done, or we need to use the same metrics (no machine-specific labels) every time.  I'm not sure how to square those though...because we would emit the same metrics from multiple machines which will overwrite each other without a machine specific label...will think of something :)
[15:39:26] <ebernhardson>	 because the push gateway essentially remembers every metric ever sent and always publishes the last value sent to it when scraped, until the metric is deleted
[15:40:28] <ebernhardson>	 there is a textfile exporter that works around that, basically write a metrics file to a specific directory and the default node-exporter will report them and then delete the file. But only root and prometheus group can write there
[15:44:07] <dcausse>	 can we attach a job_name (always the same) and a kind of partition id (0 to X) to the metric?
[15:44:56] <ebernhardson>	 hmm, if we drop down to rdd's we can get a partition id, but doing this in spark dataframes we dont get to know.  Or at least not obviously and a function argument, can poke around if that metadata is still somehow in the python executor somewhere
[15:56:14] <dcausse>	 also got reminded by Joseph that spark3 is available and they're hoping to remove spark2 march next year (but more realistically june)
[16:12:17] <ebernhardson>	 oh nice, last time i checked spark3 wasn't yet available
[16:14:08] <ebernhardson>	 running the job with a batch size of 500, perhaps simply the tiny batches was slowing it down a lot, the largest shard collected so far is 4GB/437k docs in 19 min, which is 23k per min and 380/s
[16:25:31] <ebernhardson>	 it has memory difficulties though :S I've bumped the memory limits a few times but trying to run it inside the cluster keeps getting killed with exit code 143, which should be due to SIGTERM and commonly because the nodemanager saw excessive memory usage. but the nodemanager doesn't report killing it and only that it died. always fun :P
[16:32:56] <gehel>	 dcausse: pentest meeting is starting: https://www.google.com/url?q=https://teams.microsoft.com/l/meetup-join/19%253ameeting_NTM4MGRhZGEtNjg1Yi00YTA0LTkwZDItNWY0MDUyMTM5MmM0%2540thread.v2/0?context%3D%257b%2522Tid%2522%253a%252211081781-e6e1-4303-bf68-46fae514bbe2%2522%252c%2522Oid%2522%253a%2522d6129995-9517-4ba7-81c3-ad17dacb488e%2522%257d&sa=D&source=calendar&ust=1665473392851040&usg=AOvVaw3Cpt1BYuT68_4MKcfY-4X7
[16:34:50] <gehel>	 ejoseph: sorry for the delay, I've replied inline. Looks like we forgot to reimplement metrics sampling
[17:31:50] <ebernhardson>	 8gb executors and still runs out of memory, guess have to split fetching a shard into multiple spark partitions
[18:02:10] <ebernhardson>	 doc sizes are wildly variable :S 38k docs = 2.1GB.  Also 254k docs = 1.5GB
[19:55:20] <ebernhardson>	 hmm, about 30 minutes ago eqiad-chi search thread pools went from a typical ~700 up to ~1k, but no similar increase in qps. I stopped the dump i was running from yarn, but it should have only been able to use at most 96 entries in the search thread pool at a time
[19:55:36] <ebernhardson>	 (similarly it's reporting increased latency, and high load on many nodes)
[20:05:31] <ebernhardson>	 values are declining now, but have been for ~20 minutes. They didn't drop immediately upon stopping my thing so that probably wasn't the cause
[20:21:42] <ryankemper>	 ebernhardson: we restarted 3 row E hosts (incr # of masters to 5 in eqiad), so that might be why
[20:29:02] <ebernhardson>	 hmm, perhaps. There was a variety of shard recovery happening at a similar time (~19:37), but the increase started earlier at 19:20
[20:29:26] <ebernhardson>	 it seems to be back down to normal-ish now.  high load graph is back down to mostly empty
[20:33:20] <ebernhardson>	 i suppose will start my thing back up and see what it does, if it comes back it might be me :)
[20:46:47] <ebernhardson>	 it does seem to be inducing some load, not to the level seem earlier but perhaps more than i'm comfortable with. re-pointing my thing at codfw
[21:05:36] <ebernhardson>	 i've also run '_cache/clear?fielddata=true&fields=_id' against the three eqiad clusters to clear extra memory they were using related to my thing, was uptto ~350MB per instance at max which isn't insane, but is also unnecessary