[00:13:19] <ebernhardson>	 random other thought, the http client has an IOReactorConfig that we don't configure, maybe we could consider adjusting it's backlogSize and ioThreadCount
[00:13:37] <ebernhardson>	 (or maybe i missed where we configure that)
[00:23:38] <ebernhardson>	 perhaps collecting thread dumps will help pinpoint if it's stuck in the http client somewhere? Ran out of time today, but might look into those bits tomorrow
[07:56:11] <pfischer>	 Another thought: The async operator keeps the order of events. So any slow response in-between would prevent already processed events from leaving the outgoing buffer of this operator. If we split by (fast/slow) wiki and have dedicated async operators, that might also help.
[09:17:06] <pfischer>	 Uh, and we have a backoff for retries. So events might linger in the async operator’s queue without leasing a connection. Maybe we should decouple queue size and connection pool
[09:18:02] <dcausse>	 the async operator for rerenders should not keep event ordering
[09:18:39] <dcausse>	 we should track what we try in T353460 I think
[09:18:39] <stashbot>	 T353460: The consumer job of the SUP does not achieve its expected throughput - https://phabricator.wikimedia.org/T353460
[09:23:34] <pfischer>	 Sure, I’ll do that
[09:25:24] <pfischer>	 dcausse: as of now, both operators use ordered outputs
[09:26:31] <dcausse>	 pfischer: hu?! then that might explain the problem we see, main goal for separating revbased and rerenders fetchers was to allow unordered on rerenders
[09:29:53] <pfischer>	 Okay, let me fix that. Still, to leverage the connection pool, we should somewhat decouple queue size and pool size. Maybe with a ratio? Sadly the async operator does not give us metrics on the retries.
[09:34:12] <dcausse>	 oh you mean the async op retry logic? this means we have 3 levels of retries: envoy, the http client and the async op?
[09:34:40] <pfischer>	 Yes.
[09:35:48] <dcausse>	 hard to tell if it's useful nor hit often without metrics indeed :/
[09:36:18] <dcausse>	 perhaps it's only retrying unsalvageable events?
[09:36:34] <pfischer>	 I could add that by adding a counter to the UpdateEvent
[09:36:55] <pfischer>	 So we get an idea of the retry rate
[09:37:20] <dcausse>	 sure
[09:42:26] <dcausse>	 pfischer: if you're about to add another field to UpdateEvent perhaps see my comment on https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/92 to add another dedicated one for the async operator time spent metric (perhaps doing this in a single patch to limit the hassle of the serialization test checks)
[09:44:40] <pfischer>	 yes, I was lazy and didn't want to break state 🙊
[09:44:56] <pfischer>	 AFK back in 15'
[09:46:12] <dcausse>	 :)
[09:46:36] <dcausse>	 sigh Erik mentionned this in https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/77/diffs#b7e645457cc5c32b2ac6bbeae668d794528ca109_256_292
[09:47:12] <dcausse>	 and I said "oh sure! thanks, will fix!" but never fixed it somehow :(
[10:13:52] <pfischer>	 That happens, we learned to better understand flink along the way, so that’s worth it.
[10:35:34] <dcausse>	 yes... but glad you noticed it! :)
[11:08:47] <pfischer>	 BTW: it’s only two levels of retires, we enforce both http clients to never retry
[11:14:06] <dcausse>	 oh ok, makes sense
[11:34:41] <dcausse>	 lunch
[12:46:03] <pfischer>	 dcausse: here are oder fix + retry counter: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/93
[13:43:17] <dcausse>	 pfischer: lgtm, just one small comment
[14:01:31] <pfischer>	 Thanks, fixed.
[14:07:12] <pfischer>	 BTW: I changed the gitlab settings for merge requests for this project: It now enforces having to resolve open threads before allowing to merge
[14:08:30] <dcausse>	 oh good idea, thanks for doing this
[14:55:28] <dcausse>	 seems like throughput is much higher
[15:00:00] <dcausse>	 looking at https://grafana-rw.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&refresh=30s&var-dc=eqiad%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main&var-container_name=All&var-site=&from=now-2d&to=now at least it seems to have unblocked something
[15:12:52] <pfischer>	 Yes, that’s promising. I added more detailed request (duration) shares and retries (both repeated by $fetch_operator_name). Looks like the majority of non-rerender updates are cause by Wikidata and yet this is the slowest responder
[15:13:39] <pfischer>	 In other words we only benefit to some extend from discarding the order for re-renders
[15:16:15] <dcausse>	 wikidata accounts for ~50% of the rev based updates across all the wikis
[15:17:24] <dcausse>	 we should investigate why it's slow, it should not trigger the MW parser so in theory should be faster to generate...
[15:18:34] <dcausse>	 another option to perhaps keep in mind would be to skip "non-latest" revisions, might be useful to speed-up backfills
[15:23:38] <dcausse>	 surprising that wikidata is the top "retrier" for rerenders, it's even close to the top 3 of wikis with most rerenders
[15:27:18] <dcausse>	 *not* even close
[15:32:13] <inflatador>	 <o/
[15:32:43] <pfischer>	 O/
[15:33:50] <dcausse>	 o/
[15:36:11] <ebernhardson>	 \o
[15:36:38] <ebernhardson>	 looks like success :) 600-800 events/sec processed. More would be interesting, but this is what we said we wanted the other day
[15:37:11] <dcausse>	 +1
[15:38:00] <dcausse>	 we still need to re-add envoy into the mix possibly giving much more resources
[15:38:22] <pfischer>	 Shall I go with the default for now?
[15:38:45] <ebernhardson>	 yea, should be reasonable. We could probably back out some of the changes too (skipping envoy, and devnull skipping doc merge)
[15:39:20] <pfischer>	 I’ll start with envoy, that’s config only.
[15:40:44] <ebernhardson>	 i'm just looking how to revert from gitlab's ui, not finding the button :)
[15:40:46] <inflatador>	 Nice, sounds like y'all are making progress
[15:41:08] <pfischer>	 inflatador: could the apple folks help you yesterday?
[15:43:10] <inflatador>	 pfischer apparently they require an appointment, so I had to take it in this morning. They said it would take about a week
[15:59:37] <pfischer>	 inflatador: that’s annoying, I hope they can fix it by then. Is this covered by some corporate apple care plan?
[16:04:30] <pfischer>	 dcausse: could we check if any I/O actions correlate with the spikes in wikidata response times? I just looked at https://grafana.wikimedia.org/d/dIsNiroVk/wikibase-rest-api-on-wikidata?orgId=1&refresh=30s&from=now-1h&to=now and it looks like around 15:52 UTC there was an increase in requests which falls together with the response times going up
[16:06:13] <dcausse>	 hm... the wikidata rest api is probably mostly served from the external api cluster 
[16:06:53] <dcausse>	 so if there was any contention that might be at the persistence level (mariadb or some other store, memcached?) 
[16:08:00] <dcausse>	 hm... looking closer at numbers of the wikidata rest api they're very low (I don't think it reached v1) 
[16:08:09] <inflatador>	 Y, it's under warranty. This is the first time in ~8 years of using Macs I've had a problem I couldn't fix myself. Which is good, since it's basically impossible to fix their hardware ;(
[16:08:55] <dcausse>	 thinking about it the wikidata rest api might not have enough trafic to draw any conclusions/correlation 
[16:13:00] <dcausse>	 hm... with envoy the retry rate seen from flink exploded
[16:13:05] <ebernhardson>	 :S
[16:13:53] <dcausse>	 well.. k8s reports more than 1s of cpu throttling for its containers
[16:16:56] <dcausse>	 might need some guidance to know what's a reasonable throughput/concurrency we can expect with envoy and 1cpu
[16:18:36] <ebernhardson>	 it's curious that flink could do the same requests without throttling, also that the taskmanager cpu usage doesn't seem to have gone down at all, but indeed envoy is throttling heavily
[16:19:30] <dcausse>	 true...
[16:19:51] <inflatador>	 hmm, so y'all were getting more or less acceptable performance until you re-enabled envoy?
[16:19:52] <ebernhardson>	 i don't really know, but it certainly seems like something is up with envoy there. 
[16:20:54] <ebernhardson>	 inflatador: interestingly, throughput actually looks reasonable, still seeing 600-800 
[16:21:17] <ebernhardson>	 (adding envoy is the restart at ~15:45UTC iiuc)
[16:21:26] <pfischer>	 I pinged Janis about envoy
[16:22:11] <ebernhardson>	 inflatador: for this info the flink-app dashboard is best:
[16:22:13] <ebernhardson>	 https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?forceLogin&from=now-1h&orgId=1&to=now&var-app=flink-app-consumer-devnull&var-datasource=eqiad%20prometheus%2Fk8s-staging&var-flink_job_name=cirrus_streaming_updater_consumer_devnull_staging&var-helm_release=consumer-devnull&var-k8sds=eqiad%20prometheus%2Fk8s-staging&var-namespace=cirrus-streaming-updater&var-operator_name=All&var-opsds=e
[16:22:15] <ebernhardson>	 qiad%20prometheus%2Fops&var-service=cirrus-streaming-updater&var-site=eqiad
[16:22:19] <dcausse>	 could be staging related, there's still "some" little cpu throttlink on the flink side eventho it's not even close to the configured request/limit
[16:23:05] <inflatador>	 interesting...I wonder how we could check that
[16:26:24] <inflatador>	 we could technically deploy to prod since it's a net-new service, wouldn't hurt anything
[16:26:49] <pfischer>	 Janis is looking into the envoy issue, too, seems like he already knows about the envoy throttling
[16:26:59] <ebernhardson>	 pfischer: thanks
[16:27:43] <dcausse>	 perhaps time to move out of staging indeed and write to cloudelastic from wikikube@eqiad
[16:29:02] <pfischer>	 According to Janis it’s a configuration issue with envoy which is to greedy in opening threads or sth. like that. I’ll create a ticket
[16:30:10] <pfischer>	 https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limits#envoy
[16:34:49] <dcausse>	 thanks!
[16:34:56] <dcausse>	 going offline, have a nice week-end
[16:35:03] <pfischer>	 You too!
[16:38:05] <inflatador>	 pfischer nice, that page is a great resource overall for learning about k8s' weird ideas around throttling
[16:57:12] <pfischer>	 inflatador: https://phabricator.wikimedia.org/T353460#9437962 - is this something you can change/set that value?
[16:58:17] <pfischer>	 Or does that simply go into our helm values?
[16:58:43] <pfischer>	 Ah, looks like it
[17:00:49] <inflatador>	 pfischer ACK, looks like you found it
[17:02:57] <pfischer>	 ebernhardson: thanks for rolling back, I’ll redeploy with the new version and reduced concurrency once CI has passed.
[17:03:14] <ebernhardson>	 pfischer: excellent
[17:04:56] <pfischer>	 We finally hit the limit of our connection pool around 16:20 UTC. Shall I expand that too?
[17:05:37] <ebernhardson>	 hmm, yea seems reasonable
[17:16:21] <ebernhardson>	 hmm, was looking into the fake-job idea. /rpc/RunSingleJob.php is not available from mw-api-int-ro. We could use jobrunner.discovery.wmnet, but i think that's the old cluster. puppet services suggest mw-jobrunner.svc.discovery.wmnet net should be the new k8s cluster, but that name doesn't resolve
[17:18:00] <pfischer>	 Hm, who could help with k8s name resolution?
[17:19:19] <ebernhardson>	 hugh added the mw-jobrunner to puppet services, i suppose i can poke him
[17:20:16] <pfischer>	 Alright, I got to go. I re-deployed, looking forward to the results
[17:20:21] <ebernhardson>	 thanks! great progress
[17:20:25] <pfischer>	 Have a nice weekend
[17:42:59] <ebernhardson>	 seems better, envoy throttling is down to a few hundred ms instead of 1s+. Throughput might be a little faster. application level retry rate doesn't seem improved though, rerender retrys are up :S
[17:44:37] <ebernhardson>	 upstream request rate is way up,  envoy retry rates are also 3x
[17:50:35] <inflatador>	 eek
[18:28:55] * ebernhardson meh's at my dev env now OOM's while initializing. fun :P
[18:55:06] * ebernhardson is mildly disapointed en-x-piglatin.l10n.php doesn't seem to have pig latin. It has basically one message in pig latin and inherits en for the rest.
[18:57:29] <inflatador>	 another underrepresented language...paging Trey ;P
[19:49:31] <ebernhardson>	 somehow this doesn't seem right: tried to allocate 94252030325520 bytes
[19:50:20] <hare>	 impressive, how did it even generate 94 terabytes of data
[19:51:07] <ebernhardson>	 indeed, i'm also impressed :) testing something with the job runner locally and thats coming out of something creating a job instance
[19:57:07] <inflatador>	 appointment, back in ~2h
[21:18:41] * ebernhardson has a terrible first draft of the job...now to make it not crap :)
[22:25:49] <inflatador>	 back