[00:08:32] o/ ebernhardson: While renaming all those operators I noticed another curiosity in our job graph structure and came up with another PR: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/90 [06:49:38] Are we cancelling retro tomorrow in favor of quiet week? Probably makes sense with g.ehel gone anyway [08:18:23] o/ [08:21:54] o/ Happy New Year, David! [08:22:01] thanks! you too :) [11:37:46] lunch [14:25:37] I'm having laptop issues atm-be in as soon as i can [14:48:34] dcausse: How can I check if a MW config change has been rolled out? In particular, I would like to know if page_rerender is now enabled for the expected wikis [14:49:24] pfischer: hm probably with the cirrus config dump api or by listening for messages in the kafka topic [14:49:48] e.g. https://en.wikipedia.org/w/api.php?action=cirrus-config-dump [14:50:41] Ah, just found that. šŸ‘€ [14:52:05] hm CirrusSearchUseEventBusBridge is not allowed to be exported, so you might not be able to see it from this api :/ [14:53:38] so might not be testable using test servers :/ [14:54:51] Yeah, couldnā€™t find anything. Hm. Should we make that setting API-visible somehow? [14:56:07] In the meantime, hereā€™s the related SUP deployment filter update https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/987783 [14:56:19] pfischer: sure, it's in cirrus at \CirrusSearch\Api\ConfigDump::$PUBLICLY_SHAREABLE_CONFIG_VARS [14:57:03] šŸ‘ļø [15:10:48] apparently envoy is heavily cpu throttled (~1s) the devnull consumer, might be perhaps something to investigate [15:32:03] \o [15:33:15] oh! hadn't thought to check the envoy instance for cpu throttling. Curious that manages to use up it's quota, i kinda expected that to be pretty light weight [15:35:14] o/ [15:35:35] yes me as well, a bit surprised to see it struggling :/ [15:35:50] o/ cycling home, back in 30ā€™ [15:46:38] telemetry report some "retry overflows" which I'm not sure of the meaning, according to https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/circuit_breaking there's overall retry budget [15:47:24] there is something curious happening with retries, the graphs look like they are connection errors between envoy and the application cluster [15:47:39] at a rate of 10 (per second? minute? unclear) [15:58:40] To retro or not to retro, that is the question... [15:59:21] Whether 'tis nobler for the team to review the slings and arrows of outrageous fortune... [15:59:32] lol, no clue [16:00:39] :) [16:01:32] we could try to bypass envoy (targetting https://mw-api-int-ro.discovery.wmnet:4446 directly) to confirm and also see how the apache async http client behaves [18:36:09] maven being annoying :P It's not seeing the updated elasticsearch connector with the RequestIndexer.flush method. But clearly CI can build so it's available [18:55:25] got my SSH keys and my pws...now I just need my Firefox profile ;) [19:01:07] it's progress, nice! I don't actually think i have backups of my ssh keys... [19:01:33] Me neither [19:07:34] Mine are password-protected with a password that's not in my keychain, so I feel comfortable enough [19:08:44] thankfully ssh keys aren't too crazy, can submit a patch to operations/puppet to update them, then protocol is to video chat with someone that can merge and verify you are you [19:09:15] * ebernhardson wonders how long until deep fakes make that questionable [19:14:59] the backups themselves are encrypted too...I should probably get an SSD backup drive. 5400 RPM mechanical drives take awhile to restore, as I've learned today ;) [19:59:41] deployed change to skip merging the fields after doc fetch in the devnull consumer, Not clear yet if it made much difference, beyond the fetch bytes in and bytes out being very close now instead of expanding significantly [20:01:54] That sounds like a positive for troubleshooting at the very least [20:02:13] Off to appointment/dropping off laptop at Apple Store. Hoping that won't take too long [20:02:16] yea, ruling things out is usefull too :) Will let this run awhile, collect more stats [20:08:51] Brian and I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/987189/1 and its predecessor patch. Currently haven't seen `elastic2087` join the cluster, might be some wonkiness related to the puppet7 migration [20:35:04] Oh, duh. We didn't manually run puppet on the rest of the fleet so it wasn't able to connect to the other nodes until the automatic puppet runs concluded [20:48:17] maybe it is doing something, 24h ago we had some backpressure, but the since updating it hasn't backpressured except for the initial startup [21:05:10] ahh there we go, backpressured 100% :S [21:30:34] connection pool still only ~30% full, cpu usage at ~5%. Was perhaps suspicious of checkpointing too often, but docs say the state checkpointing is async with the built in state backends, and our sink doesn't do anything on checkpoint either. checkpoint duration does climb to ~1min now though, making me think there is some interaction [21:31:48] perhaps ordering is also causing something, if the buffers aren't big enough [21:42:47] since they are tied together, deployed with 2x fetch-retry-queue-capacity (name feels a little awkward, almost looks like the capacity of the retry queue) [22:04:23] another thing i'd like to try is increase the checkpointing interval, perhaps from 10s to 1m or even a couple minutes. reprocessing should be fine, and maybe we are having awkard interactions between long requests and frequent checkpoints? Not seeing anything clearly in flink docs to indicate this, more a random suspicion [22:04:40] but seems i should wait to get data on the 2xcapacity first [22:07:13] hmm, maybe can get better backpressure info by destroying devnull and starting it back one hour. right now it seems like waiting for something to push input event counts over ~60/s to see how bad it backlogs [22:14:10] back [22:34:10] * ebernhardson realizes he started the devnull consumer at 1y1h instead of 1h ago :P [23:02:53] probably want to wait for moritz-m approval, but I have a patch up for adding the elastic components to bookworm apt repo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/987859 [23:11:35] finished 1h backfill with 2x async capacity, minimal difference. throughput of perhaps 75 events/s. Rerunning with a 2m checkpointing interval and back to default async capacity [23:14:21] ebernhardson: thanks! I added a few more panels to the SUP dashboard. Looks like wikidata is fetched from rather often *and* itā€™s rather slow [23:19:58] ouch, yea those wikidata renders are a bit long [23:23:38] rate of event processing doesn't seem to be noticably changed by the checkpointing inverval either. Will let it run through the hour, but i would hope to see something useful in the first 15m [23:42:47] Quick run, back in 30 [23:56:44] ebernhardson: This remains curiousā€¦ Iā€™ll add a counter for retries inside the async operator and a measure the time an event takes to move through that async operator. Hopefully we can track down what brings the rate down. [23:58:59] pfischer: yea, something odd but i'm still not sure what. Perhaps notable is the records in rate of the source was reasonably flat when i ran 2xasync capacity, versus the current test with longer checkpointing intervals, maybe it helps a little do have more capacity but not enough to get where we want to be [23:59:45] pool usage is hard to stay, it's quite spikey