[09:17:55] pfischer: discovery-parent-pom has released successfully. Still waiting on the central sync: https://repo1.maven.org/maven2/org/wikimedia/discovery/discovery-parent-pom/ [10:00:07] lunch [10:23:18] lunch 2 [12:44:21] o/ [13:20:43] Trey314159 I cancelled our mtg today, theoretically I will be with ServiceOps [13:37:25] makes sense, inflatador [15:08:17] \o [15:08:30] o/ [15:16:44] * ebernhardson will someday remember to drop files from the rdf repository instead of the deploy repository [15:18:15] :) [16:54:59] how could i convince elasticsearch to do a really slow query? I'm trying to find some way to verify if elasticsearch will cancel the search task if one of the nodes throws EsRejectedExecutionException during query rewrite. Or maybe it's not that important? Poked a bit at the phased-query execution and there is some bit when it's merging shard results that will throw, but a bit hard to [16:55:02] follow [16:55:30] really slow query i would do something silly like start a 2 node cluster, run some query that sucks up cpu for 30s, and convince one of the two nodes to reject early and see if the other stops [16:56:31] i'm suspecting it doesn't though, other things in elastic will return partial results and report 1 shard failed [16:57:00] ebernhardson: not sure what you want to test? [16:57:14] test if the slow query still runs in the background? [16:58:03] when looking at timeouts there was some cancelation code IIRC [16:58:38] dcausse: yes, i was curious if when the degraded query router throws the reject execution, which results in elasticsearch returning the 429 http code, if that will also short circuit the other shards or if they will continue burning cpu [16:58:58] in the end it might not matter, they will all short circuit if the load is too high, and i probably can't do anything about it if they dont. Perhaps i'm just curious :P [16:59:07] but in the best case if elastic wants to stop the query it'll be done on a best effort, it won't call things like Thread.interrupt [16:59:31] it's done when switching segments [16:59:50] so a slow query has to be slow the "right way" :) [17:00:27] i noticed while poking they now do more frequent cancel checks: default true, Enables low-level, frequent search cancellation checks. Enabling low-level checks will make long running searches to reactto the cancellation request faster. [17:00:35] i didn't look how frequent that is thogh :P [17:02:15] thats search.low_level_cancellation [17:02:27] it's the CancellableBulkScorer iirc [17:03:01] ah did not know this setting [17:04:36] the initial patch says it checks for each collected document [17:04:53] oh wait, thats the old way it replaced [17:05:26] This commit changes the approach to wrap the bulk scorer rather than the collector and exponentially increase the interval between two consecutive checks in order to reduce the overhead of those checks [17:07:06] and yea, thats the CancellableBulkScorer [17:08:05] ok so it's what we use at the moment [17:08:49] * ebernhardson wonders what he broke in idea that elasticsearch repo only shows top level files and no directories [17:11:15] i suppose in the end it doesn't really matter, it's certain i can't do anything about it if it doesn't cancel [17:14:04] yes... could you get all the data you need during the first rewrite (on the coordinator node)? [17:15:48] hmm, i suppose it depends on what we want to monitor and reject by. Rejecting on node load average seems nice since it's flexible, but for our use case where a single index dominates all cpu usage i suppose it could somehow maintain the enwiki_content query rate estimate on each node and reject during coordination [17:16:28] i'm not entirely sure how that would happen, some sort of service would need access to the cluster service and make regular queries similar to what mjolnir was already doing [17:17:00] something about basing it off a hardcoded query rate seems more awkward than using system load average vs # cores though [17:18:48] unrelated, enwiki_content reindexing just failed with: Reindex task was not successful: Failed: [{"index":"enwiki_content_1664293791","type":"_doc","id":"AVQXnGmF62ewIKYZMTMQ","cause":{"type":"mapper_parsing_exception","reason":"failed to parse field [_source] of type [_source] in document with id 'AVQXnGmF62ewIKYZMTMQ'. Preview of field's value: [17:18:50] 'id'","caused_by":{"type":"mapper_parsing_exception","reason":"Field [_source] is a metadata field and cannot be added inside a document. Use the index API request parameters."}},"status ":400},{"index":"enwiki_content_1664293791","type":"_doc","id":"AVQXnGH_62ewIKYZMTMP","cause":{"type":"mapper_parsing_exception","reason":"failed to parse field [_source] of type [_source] in document [17:18:52] with id 'AVQXnGH_62ewIKYZMTMP'. Preview of field's value: 'id'","caused_by":{"type":"mapper_parsing_exception","reason":"Field [_source] is a metadata field and cannot be added inside a document. Use the index API request parameters."}},"status":400}] [17:18:54] awkward paste :S [17:19:33] why would the reindx api sends the _source field :S suggests some doc got that in 6.x, and now 7.x notices and complains? [17:19:39] weird [17:20:23] should be findable in the dump I guess? [17:20:24] i suppose can iterate all docs in the index and see if somehow we submitted a _source field in a specific document, this was after working through 5.7M docs [17:20:29] yea [17:23:10] other random related ideas...that reminds me the slowest part of importing dumps to yarn is dealing with a 30-60GB gzip'd files...should ponder if they could be reasonably chunked during dump [17:24:01] or maybe bzip2 them so pbzip2 can be used to at least throw a bunch of cores at it [17:24:16] you gunzip before importing? [17:24:27] have to, otherwise you get a single yarn executor [17:24:32] if it was bz2 it could be imported as is [17:25:02] with a text file spark can chunk on a pattern, with gz it has to use a single executor [17:25:14] yes with bz2 it can too [17:25:30] ahh, i hadn't seen that for bz2, ok thats a way forward [17:25:37] rdf dumps are bz2 and we can process them concurrently [17:26:15] might need some spcecial handling to not split between the command line and the data line [17:26:57] there's some code the rdf-spark-tools that joseph wrote to tell hdfs to split the way you want [17:27:24] yea i have a simliar thing in my cirrus2hive scripts, it uses: RECORD_START = '{"index":' [17:27:26] maybe not ideal :P [17:27:39] that should work :) [17:28:09] i suppose actually it's a \n before RECORD_START as well, no clue why i put the \n in another bit of code that uses RECORD_START [17:28:53] for funsies, the input format then throws that away and i have to prepend it again during processing. silly but works :) [17:29:59] alrighty, i guess i have some things to work through [17:30:11] thanks! [17:32:26] yes we re-add the separator too, it's weird... :) [17:49:10] * ebernhardson wonders if it's even worthwhile to read the dumps...or if yarn should just search_after on the elasticsearch clusters directly [17:49:43] i guess that would be back to single-threaded though :) [17:55:00] aparently we could search_after along with passing preference=_shards:n and run one task per shard [18:17:58] wrote a silly python async script to search_after shards in parallel and look for things with _source...something funky has happened. This is an indexed document: https://search.svc.eqiad.wmnet:9243/enwiki_content/_doc/AVQXnGH_62ewIKYZMTMP [18:18:48] it's a search query that got indexed? [18:20:12] ryankemper: I might be a few minutes late for the pairing session (Oscar's birthday today) [18:21:05] dunno if it's worthwhile to dig into, i only see 2 so far (but it's still searching)....might just delete them and call it a day unless it happens again in the future [18:23:24] gehel ryankemper ebernhardson Service Ops team is mostly gone , if you feel like working on their prometheus config would be appropriate for pairing LMK. Otherwise I'll tap out [18:23:33] https://phabricator.wikimedia.org/T318705 ticket I'm working on ATM [18:24:15] glad to see i'm not the only one wondering if something produces too many metrics, makes it feel less like a waste of time to prune to the right set :) [18:27:11] I'm unsure if this is a straight copy/paste into the k8s-specific prometheus config ir if there's some different namespace that I should be targeting [18:27:35] 3' late to pairing [18:27:55] it'll go somewhere around here, methinks: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/prometheus/k8s.pp#L222 [18:34:32] I'm there [18:35:50] gehel Just got a phone call, gonna tap out for now [18:36:01] inflatador: ack [19:01:42] ouch https://search.svc.eqiad.wmnet:9243/enwiki_content/_doc/AVQXnGH_62ewIKYZMTMP is worrysome [19:02:45] dcausse: I think that Erik was just talking about that [19:03:53] I hope it's a mistake we made manually and not cirrus transforming search request into indexed docs [19:06:49] it's a morelike on 4848272 (Donald Trump) hopefully that's us testing something [19:11:52] seems like a *very* old query, source does not extract namespace_text which was added in 2016 (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/320756) [20:14:54] for now i've deleted the docs and will hope it was something old. The same docs were found on cloudelastic which means they should be pretty old [20:15:16] same id's and all [20:17:56] i'm not sure what the alert from prometheus about mjolnir update failures was, the failed graph in grafana (same data source as the alert) is completely flat [20:19:06] oh, no not flat ... it's only 0 on the 7-day graph. The graphs are misleading :( on the last hour there was a brief blip of failures amounting to ~1k updates [20:19:20] i suppose will re-ship that hour to be sure [20:23:46] hmm, the total value on that dashboard is completely useless. It increases depending on scale of the graph. Narrowed to a 5 min gaph it claims 54.7k ops failed, at 1hr 2.74k, at 7 days 0 [21:48:55] started up the reindexes for enwiki, all three clusters. so assuming it doesn't fail this time (i deleted the bad docs) it should have the new shard/replica counts soon-ish [21:56:22] {◕ ◡ ◕} [22:11:37] long as we are reindexing things...i've also started up a reindex in deployment-prep to finish T316711 [22:11:37] T316711: Reduce shard count on all wikis in beta cluster to 1 - https://phabricator.wikimedia.org/T316711 [22:13:19] * ebernhardson then realizes i dont have auto-tmux on deployment-prep :P [22:44:06] meh, elastica's ResponseException doesn't give the http status code