[09:07:33] morelike p95 latencies in codfw jumped from 460ms to 1.5s :/ [09:10:13] qps increased slightly but nowhere near what we serve from eqiad... [10:16:54] search SREs, FYI I've updated the rolling-operation cookbook to adapt to spicerack's API, I've quickly tested with a dry-run and all seems good (is also a very simple change). Let me know if I should do more invasive tests or you will at the next need to use the cookbook. Change is: [10:16:59] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1084734/2/cookbooks/sre/elasticsearch/rolling-operation.py [10:17:26] related spicerack change: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/1084733/2/spicerack/elasticsearch_cluster.py [10:22:57] volans: thanks! [10:25:14] dcausse: should we be worried and do something? [10:25:39] The increased latency seems to have started before the increase in QPS, so probably not realated? [10:26:23] gehel: for the morelike I'm starting an incident doc as suggested by Alex, recording few findings in there and will share in the sec channel [10:29:50] thanks! [10:30:44] gehel: anytime :) lmk if there is any issue with it [14:16:23] o/ [14:23:57] * inflatador reads the incident doc [14:26:52] \o [14:32:46] .o/ [14:33:24] o/ [14:37:40] where should gerrit jvm packages be published in gitlab? A repo per project? a "gerrit published" repo? [14:56:12] ebernhardson: naively I'd say a single gitlab project under search but haven't thought much about the consequences... [14:58:32] it's not completely clear to me either, also still trying to figure out where i put secrets in jenkins [15:03:18] didn't we have a sheet somewhere to re-evaluate at some point the shard count of the search indices [15:03:47] hmm, we've done it a time or two. It's probably been awhile [15:03:48] wondering if cebwiki morelike is not slow just because it might benefit from a higher number of shards [15:06:05] shard size are not that different from what we have in enwiki but the doc count is very different, 3.8mil doc/shard on average for cebwiki_content vs 600k for enwiki_content [15:07:14] interesting. How do we control that? CirrusSearch config? [15:07:35] inflatador: yes, should CirrusSearchShardCount or something like that [15:08:32] morelike on cebwiki could be slow because of the language itself as well, unsure how to evaluate that... [15:09:34] hmm, maybe. It seems unlikely that one language that's not all that visited would be pulling the stats down [15:09:43] iirc cebwiki has lots of generated articles but not that much traffic [15:10:14] poked at the graphs a bit, indeed nothing obviously changes other than the codfw morelike latency at ~17:40 :S [15:12:51] can see in the per-node percentiles dashboard its ~12 servers giving >1s responses, the rest are >200ms avg [15:13:11] testing few morelike queries there they're all above 1s :( [15:13:21] 12 matches the number of shards for cebwiki [15:14:03] hmm, interesting. I guess could check the mapping. Might be nice to have a tool that takes a list of elastic hosts and says what indices they share [15:14:52] indeed [15:18:32] the answer is, unless i did this wrong, no index shared by all 12 [15:21:38] wrong in the dumbest way possible, i queries eqiad and used codfw hosts :P [15:21:40] {'enwiki_general_1727918668', 'commonswiki_file_1727868143', 'cebwiki_content_1728036753', 'wikidatawiki_content_1727942964'} [15:22:13] so indeed, cebwiki has 12 shards and is on the 12 machines reporting high p95 latencies [15:26:08] As far as I can see, weighted tags do not get ingested into search (see T378983 for some examples). In Growth, we're waiting for the ingestion more than 24 hours. Would it be possible to take a look, please? [15:26:08] T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) - https://phabricator.wikimedia.org/T378983 [15:27:35] urbanecm: thanks, we'll triage this task today and take a look shortly [15:27:55] thanks [15:29:26] cc MichaelG_WMF ^^ [15:31:47] https://en.wikipedia.org/wiki/Cebuano_Wikipedia says most articles were created by bots? Not sure if relevant [15:32:58] inflatador: this might explain why the wiki has a high number articles comparatively to its index size (many short pages) [15:33:55] school run, back in 30 [16:02:33] Trey314159, pfischer: triage in https://meet.google.com/eki-rafx-cxi [16:09:46] I'm sorry, can't make it today [16:42:43] * MichaelG_WMF reads up [16:43:08] MichaelG_WMF: more in the ticket, looks like there is an event validation issue that we need to fix [16:44:33] ebernhardson: thanks, then I'll focus on the ticket [19:19:22] lunch/medical appointment. Back in ~90 [19:24:00] dinner