[00:13:19] random other thought, the http client has an IOReactorConfig that we don't configure, maybe we could consider adjusting it's backlogSize and ioThreadCount [00:13:37] (or maybe i missed where we configure that) [00:23:38] perhaps collecting thread dumps will help pinpoint if it's stuck in the http client somewhere? Ran out of time today, but might look into those bits tomorrow [07:56:11] Another thought: The async operator keeps the order of events. So any slow response in-between would prevent already processed events from leaving the outgoing buffer of this operator. If we split by (fast/slow) wiki and have dedicated async operators, that might also help. [09:17:06] Uh, and we have a backoff for retries. So events might linger in the async operator’s queue without leasing a connection. Maybe we should decouple queue size and connection pool [09:18:02] the async operator for rerenders should not keep event ordering [09:18:39] we should track what we try in T353460 I think [09:18:39] T353460: The consumer job of the SUP does not achieve its expected throughput - https://phabricator.wikimedia.org/T353460 [09:23:34] Sure, I’ll do that [09:25:24] dcausse: as of now, both operators use ordered outputs [09:26:31] pfischer: hu?! then that might explain the problem we see, main goal for separating revbased and rerenders fetchers was to allow unordered on rerenders [09:29:53] Okay, let me fix that. Still, to leverage the connection pool, we should somewhat decouple queue size and pool size. Maybe with a ratio? Sadly the async operator does not give us metrics on the retries. [09:34:12] oh you mean the async op retry logic? this means we have 3 levels of retries: envoy, the http client and the async op? [09:34:40] Yes. [09:35:48] hard to tell if it's useful nor hit often without metrics indeed :/ [09:36:18] perhaps it's only retrying unsalvageable events? [09:36:34] I could add that by adding a counter to the UpdateEvent [09:36:55] So we get an idea of the retry rate [09:37:20] sure [09:42:26] pfischer: if you're about to add another field to UpdateEvent perhaps see my comment on https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/92 to add another dedicated one for the async operator time spent metric (perhaps doing this in a single patch to limit the hassle of the serialization test checks) [09:44:40] yes, I was lazy and didn't want to break state 🙊 [09:44:56] AFK back in 15' [09:46:12] :) [09:46:36] sigh Erik mentionned this in https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/77/diffs#b7e645457cc5c32b2ac6bbeae668d794528ca109_256_292 [09:47:12] and I said "oh sure! thanks, will fix!" but never fixed it somehow :( [10:13:52] That happens, we learned to better understand flink along the way, so that’s worth it. [10:35:34] yes... but glad you noticed it! :) [11:08:47] BTW: it’s only two levels of retires, we enforce both http clients to never retry [11:14:06] oh ok, makes sense [11:34:41] lunch [12:46:03] dcausse: here are oder fix + retry counter: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/93 [13:43:17] pfischer: lgtm, just one small comment [14:01:31] Thanks, fixed. [14:07:12] BTW: I changed the gitlab settings for merge requests for this project: It now enforces having to resolve open threads before allowing to merge [14:08:30] oh good idea, thanks for doing this [14:55:28] seems like throughput is much higher [15:00:00] looking at https://grafana-rw.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&refresh=30s&var-dc=eqiad%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-api-int&var-release=main&var-container_name=All&var-site=&from=now-2d&to=now at least it seems to have unblocked something [15:12:52] Yes, that’s promising. I added more detailed request (duration) shares and retries (both repeated by $fetch_operator_name). Looks like the majority of non-rerender updates are cause by Wikidata and yet this is the slowest responder [15:13:39] In other words we only benefit to some extend from discarding the order for re-renders [15:16:15] wikidata accounts for ~50% of the rev based updates across all the wikis [15:17:24] we should investigate why it's slow, it should not trigger the MW parser so in theory should be faster to generate... [15:18:34] another option to perhaps keep in mind would be to skip "non-latest" revisions, might be useful to speed-up backfills [15:23:38] surprising that wikidata is the top "retrier" for rerenders, it's even close to the top 3 of wikis with most rerenders [15:27:18] *not* even close [15:32:13] O/ [15:33:50] o/ [15:36:11] \o [15:36:38] looks like success :) 600-800 events/sec processed. More would be interesting, but this is what we said we wanted the other day [15:37:11] +1 [15:38:00] we still need to re-add envoy into the mix possibly giving much more resources [15:38:22] Shall I go with the default for now? [15:38:45] yea, should be reasonable. We could probably back out some of the changes too (skipping envoy, and devnull skipping doc merge) [15:39:20] I’ll start with envoy, that’s config only. [15:40:44] i'm just looking how to revert from gitlab's ui, not finding the button :) [15:40:46] Nice, sounds like y'all are making progress [15:41:08] inflatador: could the apple folks help you yesterday? [15:43:10] pfischer apparently they require an appointment, so I had to take it in this morning. They said it would take about a week [15:59:37] inflatador: that’s annoying, I hope they can fix it by then. Is this covered by some corporate apple care plan? [16:04:30] dcausse: could we check if any I/O actions correlate with the spikes in wikidata response times? I just looked at https://grafana.wikimedia.org/d/dIsNiroVk/wikibase-rest-api-on-wikidata?orgId=1&refresh=30s&from=now-1h&to=now and it looks like around 15:52 UTC there was an increase in requests which falls together with the response times going up [16:06:13] hm... the wikidata rest api is probably mostly served from the external api cluster [16:06:53] so if there was any contention that might be at the persistence level (mariadb or some other store, memcached?) [16:08:00] hm... looking closer at numbers of the wikidata rest api they're very low (I don't think it reached v1) [16:08:09] Y, it's under warranty. This is the first time in ~8 years of using Macs I've had a problem I couldn't fix myself. Which is good, since it's basically impossible to fix their hardware ;( [16:08:55] thinking about it the wikidata rest api might not have enough trafic to draw any conclusions/correlation [16:13:00] hm... with envoy the retry rate seen from flink exploded [16:13:05] :S [16:13:53] well.. k8s reports more than 1s of cpu throttling for its containers [16:16:56] might need some guidance to know what's a reasonable throughput/concurrency we can expect with envoy and 1cpu [16:18:36] it's curious that flink could do the same requests without throttling, also that the taskmanager cpu usage doesn't seem to have gone down at all, but indeed envoy is throttling heavily [16:19:30] true... [16:19:51] hmm, so y'all were getting more or less acceptable performance until you re-enabled envoy? [16:19:52] i don't really know, but it certainly seems like something is up with envoy there. [16:20:54] inflatador: interestingly, throughput actually looks reasonable, still seeing 600-800 [16:21:17] (adding envoy is the restart at ~15:45UTC iiuc) [16:21:26] I pinged Janis about envoy [16:22:11] inflatador: for this info the flink-app dashboard is best: [16:22:13] https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?forceLogin&from=now-1h&orgId=1&to=now&var-app=flink-app-consumer-devnull&var-datasource=eqiad%20prometheus%2Fk8s-staging&var-flink_job_name=cirrus_streaming_updater_consumer_devnull_staging&var-helm_release=consumer-devnull&var-k8sds=eqiad%20prometheus%2Fk8s-staging&var-namespace=cirrus-streaming-updater&var-operator_name=All&var-opsds=e [16:22:15] qiad%20prometheus%2Fops&var-service=cirrus-streaming-updater&var-site=eqiad [16:22:19] could be staging related, there's still "some" little cpu throttlink on the flink side eventho it's not even close to the configured request/limit [16:23:05] interesting...I wonder how we could check that [16:26:24] we could technically deploy to prod since it's a net-new service, wouldn't hurt anything [16:26:49] Janis is looking into the envoy issue, too, seems like he already knows about the envoy throttling [16:26:59] pfischer: thanks [16:27:43] perhaps time to move out of staging indeed and write to cloudelastic from wikikube@eqiad [16:29:02] According to Janis it’s a configuration issue with envoy which is to greedy in opening threads or sth. like that. I’ll create a ticket [16:30:10] https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limits#envoy [16:34:49] thanks! [16:34:56] going offline, have a nice week-end [16:35:03] You too! [16:38:05] pfischer nice, that page is a great resource overall for learning about k8s' weird ideas around throttling [16:57:12] inflatador: https://phabricator.wikimedia.org/T353460#9437962 - is this something you can change/set that value? [16:58:17] Or does that simply go into our helm values? [16:58:43] Ah, looks like it [17:00:49] pfischer ACK, looks like you found it [17:02:57] ebernhardson: thanks for rolling back, I’ll redeploy with the new version and reduced concurrency once CI has passed. [17:03:14] pfischer: excellent [17:04:56] We finally hit the limit of our connection pool around 16:20 UTC. Shall I expand that too? [17:05:37] hmm, yea seems reasonable [17:16:21] hmm, was looking into the fake-job idea. /rpc/RunSingleJob.php is not available from mw-api-int-ro. We could use jobrunner.discovery.wmnet, but i think that's the old cluster. puppet services suggest mw-jobrunner.svc.discovery.wmnet net should be the new k8s cluster, but that name doesn't resolve [17:18:00] Hm, who could help with k8s name resolution? [17:19:19] hugh added the mw-jobrunner to puppet services, i suppose i can poke him [17:20:16] Alright, I got to go. I re-deployed, looking forward to the results [17:20:21] thanks! great progress [17:20:25] Have a nice weekend [17:42:59] seems better, envoy throttling is down to a few hundred ms instead of 1s+. Throughput might be a little faster. application level retry rate doesn't seem improved though, rerender retrys are up :S [17:44:37] upstream request rate is way up, envoy retry rates are also 3x [17:50:35] eek [18:28:55] * ebernhardson meh's at my dev env now OOM's while initializing. fun :P [18:55:06] * ebernhardson is mildly disapointed en-x-piglatin.l10n.php doesn't seem to have pig latin. It has basically one message in pig latin and inherits en for the rest. [18:57:29] another underrepresented language...paging Trey ;P [19:49:31] somehow this doesn't seem right: tried to allocate 94252030325520 bytes [19:50:20] impressive, how did it even generate 94 terabytes of data [19:51:07] indeed, i'm also impressed :) testing something with the job runner locally and thats coming out of something creating a job instance [19:57:07] appointment, back in ~2h [21:18:41] * ebernhardson has a terrible first draft of the job...now to make it not crap :) [22:25:49] back