[10:21:45] pretty bad insomnia tn (maybe related to possibly being sifk [10:22:10] ryankemper: ouch, take care! [10:22:12] sick but can’t tell yet* will be out first few hours of morning [10:22:21] lunch [12:59:39] ryankemper: take care! [13:00:05] Do we have an estimate of how many languages we support in Search? [13:01:05] There are various level of support. Probably "well supported", "we've done some efforts", and "we have a default non language specific analysis chain that might or might not work" [13:18:16] I think this is a question for Trey, if I had to answer I would look at https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/eb95f78d5dd142b16f32c5b425f017a806b1ba55/includes/Maintenance/AnalysisConfigBuilder.php#1679 [13:20:10] + the list in the $elasticsearchLanguageAnalyzersFromPlugins var [14:59:50] \o [15:01:50] o/ [15:02:57] was about to write a small maint script to reship some page_rerenders, mainly to reship the lexemes (T365692) but I think that might be usefull in other situations [15:02:58] T365692: PHP Notice: Undefined index: lexeme_language / lexical_category - https://phabricator.wikimedia.org/T365692 [15:14:21] main drawback is that it can only support already indexed pages for now, which I think is fine for this particular lexeme issue [15:14:47] hmm, and the problem is we rejected all the updates with unrecognized fields? [15:15:22] I don't think we reject them, we simply ignores the fields [15:15:38] ahh [15:15:54] newly created lexemes are there but without any the important fields [15:16:00] which makes them unfindable [15:16:25] yea, makese sense. [15:17:00] I should probably do a quick scan on index dumps in hive to see if there are others [15:18:04] I guess this is a SUP equivalent some portion of ForceSearchIndex i suppose? [15:18:24] unrelated, looks like cloudelastic finished reindexing. under a week [15:18:36] i guess i have to check that it actually did what it was supposed to do :P [15:18:49] yes that would be but I might source the data from elastic itself... unsure [15:18:52] nice! [15:20:02] hmm, the goal is to reindex all lexemes? I guess the question is how does SUP know all lexemes, and yea elastic is the obvious answer. Except we know it's incomplete when this is being run [15:20:50] i don't have a better answer :P [15:20:59] :) [15:21:16] api.php with continue options could probably scroll somehow, no clue how performant [15:22:15] perhaps a list of page_id from a file? would be up to you to source that from the source you want? might not be practical but perhaps usefull? [15:22:48] hmm, wouldn't be too hard to source that from sql i suppose [15:23:48] for api.php, looks like can fetch lexemes 500 at a time from the allpages generator set to lexeme ns [15:25:46] so perhaps a simple python shipping to eventgate and using the api might do the trick [15:26:07] might be enough, fake the producer events [15:26:53] allpages generator looks like it shouldn't have any perf issues with deep scroll, it should be a point query and scan forward in the index [15:27:02] ok [15:27:33] I'll start with that and we'll extend as we see more use-cases [15:28:13] sounds reasonable [15:28:49] Do y'all think reindexing is a decent benchmark for the performance governor stuff? Re: T365814 [15:28:50] T365814: Test whether or not CPU performance governor helps Elasticsearch performance - https://phabricator.wikimedia.org/T365814 [15:29:59] I'd say that the decision trees built by mjolnir would be a better fit [15:30:13] perhaps testing with varying window length? [15:31:13] ACK. R450 is the chassis type we need to test...looks like cloudelastic1007-1010 would work [15:31:24] hmm, yea if you want to make them work the cpu's harder would want to give it more re-ranking work [15:31:35] window length would be an easy one to increase [15:31:53] if we want something representative, what else triggers alerts? [15:32:12] I guess reindexing does manage to trigger more alerts from cloudelastic last week, so plausible [15:32:17] we had few search latencies alerts yesterday [15:37:21] I guess I missed them ;( Last alert I see is for cloudelastic old GC from the 25th [15:38:47] i guess thats not a big deal, and perf governor wouldn't change GC behaviour. It's just complaining because it's working hard (could maybe be tuned, but not concerned) [15:39:13] and we don't have the regular alerts on cloudelastic...i need to do the full-cluster reindexes this week anyways though [15:39:54] Plan is to review how things are working, turn on SUP writes for the rest of eqiad, turn off cirrus writes for eqiad, then start the reindexes [15:40:02] i guess codfw can start any time [15:40:09] Yeah, this is low priority so if there are better things to do, by all means [17:03:00] https://github.com/apache/airflow/pull/39900 Adding WMF to the list of Airflow users (assuming they approve) [17:16:16] P63465 [17:16:35] extra fields from the cirrus dumps: https://phabricator.wikimedia.org/P63465 [17:16:44] getting another percentiles alert ;( [17:20:39] well labels_all should not be in the cirrus doc it's populated by elastic but we set it to null from wikibase cirrussearch :( [17:24:45] going to add it to the schema... not going to change the \Wikibase\Repo\Search\Fields\WikibaseIndexField interface to support this... [17:25:46] looks like elastic1056's load is pretty high. Banning... [17:29:21] https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38&from=now-15m&to=now looks like it helped (?) [17:52:00] inflatador: https://grafana.wikimedia.org/d/000000486/elasticsearch-per-node-percentiles is probably a reasonable one to look at [17:52:46] 56 was the highest but not too out of band, it seems from the sorted p95 all the old machines are just a bit slower [17:53:33] 53-66 all in the 170s, everything else >145 [17:54:02] err, <145 [17:55:34] dinner [18:01:30] ebernhardson yeah, maybe there was something else going on...I agree the load wasn't crazy [18:03:19] FWiW, response time seems like it's staying down. Will reboot/unban 1056 once I get back in ~45m [18:03:24] sounds reasonable [18:17:50] * ebernhardson hmms, 247 backfill operations for ~900 wikis (1024 - private). Perhaps parallelizing the backfills only partially made the difference. I suppose the other big difference is this runs backfill and reindex in parallel instead of sequentially, so it's always reindexing [18:23:39] ebernhardson it's just you and me for pairing today. I moved by to 2 PM your time so I can help my son w/something [18:23:48] err...moved "back" that is [18:23:50] inflatador: kk [18:45:36] back [20:16:12] Categories update lag is in a broken state across all of WDQS (https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=categories). Is this a known issue? [20:17:20] hmm, i don't think so [20:17:42] ryankemper news to me. I know d-causse has been working on a cook-book for categories [20:19:06] I also don't see any emails for this one [20:21:05] I forwarded you 1. I got one for each individual WDQS host so it was pretty easy to spot in my email :D [20:22:36] no email? sounds like another task for alerts review ;( [20:24:29] modules/profile/manifests/query_service/monitor/categories.pp [20:35:45] cirrus writes are now off for page updates, except private wikis [20:36:35] one thing this might change is the latency percentiles we observe, i suppose this is removing a bunch of requests that would usually pull them down [20:38:57] interesting. Do we have a replacement for those metrics on the SUP side? [20:39:49] we have latency information there too, although this is slightly different. I mean our top level percentile metrics (that alert) are not bucketed at all, it's simply all requests that cirrus sends. And this removes a few hundred per sec [20:41:32] * ebernhardson also wonders what saneitizer does with the empty cluster list [20:45:51] ebernhardson the alert you're referring to is "CirrusSearch full_text eqiad 95th percentile" and the like? Just trying to wrap my head around this [20:46:00] oh, maybe we did change it at some point [20:46:40] maybe my memory is just failing, we used to have alerts on the top right graph of https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1 [20:48:25] hmm, that's a good idea. Maybe we could bring that back [20:48:51] we might have switched to full_text because it's more representative of users. As i mentioned this had all the indexing requests mixed in :) [20:48:58] * ebernhardson was looking for why, didn't find yet [20:54:52] i did find another alert thats going to go off in 30 minutes though :P [20:55:57] pre-crime ;P [20:57:55] hmm, i suppose i shouldn't just delete it and instead add a comparable sup metric... [21:04:40] ryankemper we're in pairing if you wanna join [21:05:22] Cooking food but will drop by in 15’ [21:10:59] ACK [21:11:18] inflatador: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1036752 [21:12:14] `/usr/local/bin/foreachwikiindblist ` [21:20:09] inflatador: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1036754 [21:20:24] inflatador: https://grafana-rw.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater?orgId=1&viewPanel=50&forceLogin=true [21:23:57] ebernhardson: https://integration.wikimedia.org/ci/job/alerts-pipeline-test/1748/console [22:01:40] back [23:09:32] curious, two big latency spikes that alerted. The QPS rates for cirrus don't change, but the search threadpool went from ~500 to 1.2k for ~20 minutes [23:09:46] suggests it was doing something very expensive [23:21:35] interesting, it's something to do with commonswiki file. The per-index time taken stats we added clearly show commonswiki going from ~40 cores to 800 cores [23:22:03] still not sure what the queries are though :S [23:57:37] well, no clue, but if it alerts again we can probably record `https://search.svc.eqiad.wmnet:9243/_tasks?detailed=true` every minute and figure out from there what was running that was odd