[07:13:24] sup failure this week-end: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2024.14?id=M3Vjt44B8T_a4T-eJkQe [07:14:09] The request entry sent to the buffer was of size [8964209], when the maxRecordSizeInBytes was set to [8388688]. [08:00:15] hello, I noticed that there are 5 elastic hosts alerting for the ElasticSearch setting check 9600, not sure how urgent that is: https://alerts.wikimedia.org/?q=alertname%3DElasticSearch%20setting%20check%20-%209600 [08:00:50] volans: hey, thanks, looking [08:01:13] thx :) [08:03:26] the summary of this check is not super helpful :( [08:04:43] the "cut" part is search).remote.omega.seeds [08:04:46] because it has a space [08:04:55] you can see it in icinga [08:04:55] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=elastic2075&service=ElasticSearch+setting+check+-+9600 [08:05:16] if that helps [08:05:24] kk now I understand, thanks! [08:05:51] diffing, it's complaining that elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500 are there but should not [08:06:03] sorry not a space, the separator for icinga between status and performance [08:06:21] it should be a pipe [08:07:08] should not be urgent and can wait for Brian to wake up, I think it relates to the decomissioning of old hosts [08:08:23] ack, then no prob [08:08:35] the fix in puppet for the "cut" message should be done in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/icinga/manifests/monitor/elasticsearch/cirrus_settings_check.pp#13 [08:09:04] the $.(cluster|search) part is interpreted by icinga as a separator between the check result and the performance data [08:10:08] ok makes sense, we should perhaps craft a better description message? [08:10:27] or escape the pipe so that icinga doesn't interpret it [08:13:31] inflatador, ryankemper re-opened T361305 because of the settings check that is complaining about elastic2047.codfw.wmnet:9500 & elastic2052.codfw.wmnet:9500 [08:13:31] T361305: decommission elastic20[37-54].codfw.wmnet - https://phabricator.wikimedia.org/T361305 [08:14:01] silenced the alert for 8hours [09:17:39] pfischer: out of curiosity where did you find that p50s of fetch latencies are around 5s? [09:17:49] https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?from=1712394883911&orgId=1&to=1712473511526&var-app=flink-app-consumer-cloudelastic&var-datasource=thanos&var-destination=All&var-kubernetes_namespace=All&var-prometheus=k8s&var-site=eqiad [09:18:31] envoy telemetry suggests it's more on the on the 100ms-200ms range [09:20:11] Interesting, I looked at the SUP dashboard. Maybe something is wrong with SUP's HTTP metrics? [09:20:58] ok, tried to load this one but grafana is struggling a bit [09:21:08] Most wikis should be way below 5s Most of the time [09:21:58] Yeah, too many wikis... [09:22:19] yes... per wiki metrics is always difficult to graph :/ [09:26:27] ah I see it now, worst case wiki for the period I'm looking at is eswikinews with p50s around 5.6s [09:27:17] percentiles on a per wiki basis might not be accurate, low rate wikis might have very volatile percentiles [09:58:38] lunch [13:16:11] gehel cancelled my vacation for today...extremely cloudy conditions so I don't think I'm going to drive 3 hrs just to maybe see the eclipse [13:19:43] I'll still be able to get ~90 seconds of totality [13:20:33] also, 👀 on T361305 [13:20:34] T361305: decommission elastic20[37-54].codfw.wmnet - https://phabricator.wikimedia.org/T361305 [13:28:33] o/ [13:42:13] flink-app-consumer-cloudelastic looks to be missing its taskmanagers again? [13:46:29] inflatador: yes it failed during the week-end (https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2024.14?id=M3Vjt44B8T_a4T-eJkQe) [13:46:52] The request entry sent to the buffer was of size [8964209], when the maxRecordSizeInBytes was set to [8388688]. [13:47:10] I guess that's not a memory issue then [13:47:36] no, suspecting a problem in how we estimate cirrus doc size [13:47:37] still, I'd expect an automatic recovery [13:48:29] it tries to recover a few times (probably not enough) but I think this is an application bug so it would be in a never ending crashloop in this case without a fix [13:49:01] ah OK, understood [15:00:38] ahh okta...log me out at exactly 8 am:P [15:00:52] always [15:00:54] :) [15:04:05] gehel: triage / planning meeting [15:04:43] i got it ebernhardson [15:06:07] random idea .. could we early serialize the bulks into strings, and then count the size of the strings? [15:06:13] maybe thats inside the client though [15:13:32] ebernhardson: I'm on vacation this week! I'm sure you'll survive triage without me! [15:59:04] heading out early, back later tonight [15:59:10] dcausse: SUP is up and running again. [15:59:15] nice! [15:59:23] Increases limit to 9mb, lets see [16:05:26] curious, the cirrussearchrequests logger sets meta.dt from the global clock, and it doesn't create the events until it's actually flushing. The event times should be lined up with realtime, even if the actual query happened awhile ago [16:16:12] looks like the ccs settings are present on the omega cluster, but in the wrong place [16:23:49] ebernhardson: How often does the cirrus request logger flush? Does that happen asynchronously (in a PHP application)? [16:26:14] lunch, eclipse, etc...back in ~3h [16:26:53] pfischer: it happens when the process shuts down, in something we call DeferredUpdates [16:27:31] pfischer: in normal operation what happens is the web request closes, and then deferred updates run during shutdown. Maintenance scripts do the same, but it ends up being odd because it was intended to work with things that run for a few hundred ms [16:28:55] Do we know, how much we overdraw the lifetime of a request? Like a CPU-time per request ratio? [16:29:16] (just out of curiosity) [16:29:45] Hmm, i don't know if thats tracked anywhere, i haven't seen it in the perf dashboards [16:44:50] oh perhaps i was being unclear, by a few hundred ms i mean it expects to queue up tasks over a few hundred ms then run on shutdown. a maint script might queue up tasks for 14 hours and then flush [16:50:10] Ah, alright. Thank you. [17:30:49] back, WF coffee shop. About an hour before the eclipse and it's still very cloudly [17:35:51] clear skys, but only expecting 30% here :P [17:39:15] * ebernhardson wonders how to make an unrenderable page in mediawiki [17:44:57] hmm, apparently the easiest way is rev-delete and then try and render by rev_id [17:46:06] OK, looks like CCS settings go into `persistent.cluster.remote` (re https://www.elastic.co/guide/en/elasticsearch/reference/7.10/modules-cross-cluster-search.html) , codfw psi also has stuff in persistent.search.remote , which I believe is harmless, but is triggering those alerts [17:48:43] inflatador: that would perhaps be some old BC, in either 6 or 7 they renamed it [17:48:51] probably safe to remove the old [18:09:30] ah OK, the old cluster settings are unrelated to the alert. Looks like psi has outdated CCS settings for omega [18:20:44] hmm, it doesn't like the way I'm adding the cluster settings, even though it accepts them...checking [18:41:22] The totality came and went ;( . It got super dark at 1:34 PM local time. Only got a glimpse of the eclipse for a few seconds though. Too many clouds [18:46:22] cluster seeds are fixed for CODFW psi as well. [19:54:59] * ebernhardson remembered the deploy window this time! [19:55:33] Nice, so we’ll get your new APIs? [19:58:37] no :P those are on the normal train. This is moving queries back to codfw after they were moved away. Also hopefully a fix for T359215 [19:58:37] T359215: mediawiki_cirrussearch_request data is regularly late - https://phabricator.wikimedia.org/T359215 [21:58:31] ebernhardson: Got it. BTW: Running SUP against cloudelastic with your resources patch and some more tweaks is now pushing 1.6k actions/s towards ES which is twice as much as before. Drastically increasing the queue for in-order-fetches also reduces the back pressure coming from that part of the graph [22:00:45] lag recovery is down to 2.7 days (from >6). I still wonder why the kafka partitions sizes are skewed so much that a single partition/consumer is under disproportional high load [22:01:34] lag recovery for everyone else is below 5h or 0s already