[08:41:14] pipeline failed again because of the request size but now it was rejected by nginx I think with "413 Request Entity Too Large" [08:55:47] hm gitlab does create a MR with a description that has nothing to do with my branch... wondering what I'm doing wrong [08:59:25] it's me failing to realize that it was not properly rebased.... [09:03:17] pfischer: I'm looking at standup notes "We now capture the http client metrics (https://grafana-rw.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater)", but that graph is empty. [09:03:27] I'm not really sure where the data is supposed to be [09:10:01] ryankemper: Could you add the result of your conversation with traffic to T351650 ? [09:10:02] T351650: Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 [09:14:42] ryankemper: did you update https://docs.google.com/spreadsheets/d/1Obj5ozGQYl7Zei0MBLELVD8eDGqqsF_t9T3ZbrOsmZg/edit#gid=0 based on the work in T351671 ? [09:14:43] T351671: Service implementation for wdqs10[17-21] - https://phabricator.wikimedia.org/T351671 [09:16:19] ryankemper: also, did you create a task for dc-ops for decommissioning wdqs10(09|10) ? [09:43:40] Weekly updates published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2023-12-15 [10:03:37] gehel: re http metric, sorry it's me creating this link, filters are not preset on this dashboard, saving the dashboard now with proper filter values set on k8s-staging [10:04:51] dcausse: I’m on the size estimate fix, will create a PR shortly [10:05:37] pfischer: nice, I have https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/86 in the meantime [10:08:11] Thanks! But does any time/number-of-actions based limit work at all? As soon as one large document (4mb) enters a bulk that would blow up. As far as I understood Erik, the limit for incoming requests was 5mb [10:10:16] 5mb? I think that was the default for setBulkFlushMaxSizeMb? [10:10:37] looking at nginx config but I believe the max it accepts is around 100Mb [10:11:35] yes I see "client_max_body_size 100m;" [10:12:20] if it returned 413 Request Entity Too Large then it means elastic managed to craft a request that's > 100m [10:13:29] You’re right, I checkt the IRC logs and it failed with ~128 m [10:14:15] the elastic http client might have another hard limit somewhere, cause I remeber Erik saying that this 128m request did not even reach elastic [10:14:42] Does envoy limit us in any way? [10:15:25] Erik encountered an OoM, too [10:15:33] hm I don't think so? at least not below 100m otherwize we might not have reached nginx when it returned "413 Request Entity Too Large" [10:15:48] on the elastic-client I hope? [10:15:50] After restarting after the pipeline failed due to the 413 [10:16:02] - yes on the client [10:16:45] yes perhaps retrying such large request might have caused too much gc pressure? [10:44:32] pfischer: please lemme know if you have objections me shipping https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/86 to get the pipeline running again [10:44:37] errand+lunch [10:45:27] dcausse: I am sorry, no, pleas go ahead (merged it already) [12:06:20] dcausse: bulk size calculation is ready: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/87 [13:56:39] .o/ [13:57:16] o/ [13:59:29] pfischer: nice, started to have a look, IIUC you introduced a "flush" function to RequestIndexer? if yes could you push the code somewhere [14:21:53] Sure, there’s an upstream PR: https://github.com/apache/flink-connector-elasticsearch/pull/85 (https://issues.apache.org/jira/browse/FLINK-33857) [14:30:43] dcausse: latest flink-connector-elasticsearch:3.0.1-PATCHED contains both changes (the merged BulkInspector and this WIP one) [15:03:51] Hm, sonar is not happy with volatile primitives and want’s us to use their Atomic wrappers instead. Our coding conventions do not say anything about it. Does anybody have an opinion? [15:17:04] pfischer: I guess it depends, with volatile I guess it's "easier" to have broken concurrent logic while Atomic classes tend to encourage you to use getAndSet [15:17:46] here I thought that this class was mainly called from the mailbox thread and haven't noticed the volatile tbh :/ [15:18:04] looking again [15:22:43] pfischer: if your class is called concurrently then I think the logic might not be correct and I think that checkCapacity and flush would have to be synchronized somehow [15:25:49] Hm, I followed the implementation of ElasticesarchWriter they work with volatile. There shouldn’t be concurrent access after all since we have 1 estimator per emitter and only one emitter per operator. [15:27:44] For the underlying BulkProcessor/BulkRequest it’s different since both, limits and checkpoints, may trigger a flush. But the RequestIndexer is only called from our codebase. [15:28:15] the ElasticsearchWriter volatile kind of make sense since they might be polled by the thread collecting metrics, and the closed bool flag makes sense to be volatile I guess [15:30:06] I’ll get rid of it for the estimator. [15:30:53] sure [15:33:53] ElasticesarchWriter.pendingActions is not volatile so most likely all this is called from the same thread [15:34:06] Done, hopefully sonar and javadoc are happy now. [15:38:39] - my bad, I wonder where I saw those volatile counters… [15:43:37] /clear [15:52:46] dcausse: build passed, do you want to give it a try, or shall we wait until Monday? [15:53:12] pfischer: we should try it and let it run for the week-end? [15:54:59] sonar seems to have cached the connector artifact? [15:55:16] cannot find symbol method flush() [15:56:52] I approved the MR, please feel free to merge anyways and ship it [15:56:57] Yeah, I noticed. I’d have to update the included script to run with `-U` (hoping that this will fetch -PATCHED classified artefacts too [15:57:36] Maybe there’s a way to force clearing gitlabci caches somehow? [15:57:50] bump the client jar version? [15:58:08] s/client/connector/ [16:05:32] Hm, that should work. Main pipeline failed too, I’ll fix that tonight and deploy it afterwards. Have to leave for now. [16:05:42] workout, back in ~40 [16:56:52] ryankemper I didn't think of this until last night, but the massive amount of 500s generated by the LDF endpoint checks will affect our SLOs...I can help make adjustments later or next wk if need be [21:30:39] back [22:35:39] gehel: yeah `wdqs10[09,10]` is https://phabricator.wikimedia.org/T353482, and I added it to the EOL spreadsheet [22:46:11] ryankemper: thanks ! [22:46:33] Will need to look over the operational excellence spreadsheet and update counts accordingly