[04:19:14] ebernhardson: I deleted `be_x_oldwiki_titlesuggest_1659407912` since it was making the codfw es cluster red. The current alias is pointing to `be_x_oldwiki_titlesuggest_1658396688 ` so I think `be_x_oldwiki_titlesuggest_1659407912` was created by the reindex and got interrupted by the reimage of `elastic2059` [10:23:58] lunch [10:35:37] lunch [11:40:27] errand [12:43:39] hm... not sure I understand the shape of this graph: https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=cpjobqueue-cirrusSearchElasticaWrite&from=now-90d&to=now [12:43:50] kafka consumer lag for cloudelastic [12:46:20] feels like it resets to earliest offsets every 15days or so [12:57:11] greetings [12:59:03] o/ [13:05:49] ^^ weird sawtooth pattern happening there [13:06:50] inflatador: do you know if cloudelastic is back to normal? [13:07:41] dcausse as far as I know, we finished the restore 2 weeks ago or so [13:07:46] https://phabricator.wikimedia.org/T309648 [13:07:57] ok [13:08:17] something weird then, I might misterpret the graph tho [13:09:23] that is definitely weird, although I have no idea how to troubleshoot that one [13:17:11] I would blame the jobqueue, it's doing 10 concurrent jobs [13:17:20] https://grafana-rw.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=cirrusSearchElasticaWrite [13:42:32] is that something we can scale up, or is it even our responsibility? [13:44:29] inflatador: I think we tried but it's not as obvious as tuning a knob apparently [13:44:56] related: T300914 [13:44:56] T300914: cpjobqueue not achieving configured concurrency - https://phabricator.wikimedia.org/T300914 [13:48:18] ah yes, that looks familiar [13:50:04] I added a comment to the ticket, should I reach out to Service Ops or anything? [13:54:37] inflatador: if you could yes, also to confirm that it's changeprop that resets the offsets to "earliest" [13:55:25] this means a "dataloss" for cloudelastic [13:58:27] dcausse I pinged in #wikimedia-serviceops [13:58:35] inflatador:thanks! [14:13:16] also, the last CODFW host is installing bullseye now! [14:13:46] \o/ [14:15:27] congrats! [14:19:14] \o [14:19:28] that cloudelastic graph is concerning :( something looks entirely wrong there [14:25:39] o/ [14:26:33] yes cloudelastic is likely outofdate [14:26:49] without easy remediation [14:29:25] job times do not seem crazy compare to cirrusLinksUpdate [14:30:02] it's the concurrency of cirrusElasticaWrite that seems to low [14:30:14] do we track cluster-wide indexing.index_total anywhere? I mean i suppose we know the answer, but was curious how much slower indexing is [14:30:32] it's hardly visible sadly [14:32:09] https://grafana-rw.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&var-cluster=cloudelastic&var-exported_cluster=cloudelastic-chi&viewPanel=12&editPanel=12 [14:33:16] 800-1k in the main clusters, 100 in cloudelastic :S [14:34:44] I wonder if elasticsearch_indices_indexing_index_total is per replica [14:34:57] dcausse: there should be two values, one for primaries and one for total [14:34:57] but still [14:35:02] ah [14:36:42] not sure how to even know which direction to point the finger...cloudelastic or the queue writing to it [14:37:35] I would blame changeprop concurrency [14:37:59] yea cloudelastic doesn't seem to have other metrics that make it look particularly struggling [14:39:32] ebernhardson looks like joe responded in #wikimedia-serviceops , you might check in there [14:46:57] ebernhardson: do you remember when you "re-inlined" the writes for the main cluster? [14:48:23] dcausse: hmm, looking [14:48:35] looks like it was quite a long time ago https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/765577 [14:48:43] Feb 24 [14:49:21] yea the commit in cirrus is feb 15th, would have gone out sometime in feb [14:49:33] well, probably. sometimes git dates are way off for other reasons, but often close [14:49:39] sure [14:49:59] so in april cloudelastic was alone on cirrusElasticaWrite [14:51:40] can we ignore ordering for cloudelastic writes? Right now that job is partitioned by a key and that means a single partition in cpjobqueue has to do all the work [14:52:08] if we instead randomly partition it (which would lose isolation, except cloudelastic is the only one so it's separately isolated) then multiple cpjobqueue threads can index at same time [14:52:27] thread is wrong word there... i mean consumers [14:52:46] (nodejs is already parallelizing internal to a single process, but no threads it's all async) [14:52:56] There's PDU maintenance happening right now in CODFW, will probably be shutting down servers soon: https://phabricator.wikimedia.org/T309957 [14:53:56] ebernhardson: it's only bad for deletes I think [14:54:11] part of the problem with cpjobqueue, as it was previously explained, is that it has no concept of load balancing. All processes subscribe to all possible topics and the kafka consumers battle it out over who gets to subscribe to what [14:54:32] so you could have all the high volume topic partitions on a single consumer, it wouldn't know or have any way to deal with that [14:54:56] how do they throttle a particular consumer then? [14:55:33] i could be mistaken, but i think only through the concurrency parameter which declares how many jobs are allowed to be in-flight for a single topic-partition [14:55:48] ok [14:56:03] and it's already high for this particular job? [14:56:04] I have to shut down a bunch of CODFW servers, so things will probably start screaming [14:56:18] dcausse: checking, but i think so [14:57:23] ah then we're back to this original problem defined in T309648 [14:57:23] T309648: Restore lost index in cloudelastic - https://phabricator.wikimedia.org/T309648 [14:57:26] dcausse: concurrency is set to 150 [14:57:31] nope wrong ticket [14:57:56] T300914 [14:57:57] T300914: cpjobqueue not achieving configured concurrency - https://phabricator.wikimedia.org/T300914 [14:58:00] so in theory it should allow 150 requests to be in-flight at a time [14:58:38] so it's just luck if you get this consumer assigned to the right pod you might be able to do it [14:59:18] yea, that was my understanding of the result of the previous ticket. If the process happens to be idle enough due to the random selection of topic-partitions it might get to 150, but otherwise it gets whatever it gets [15:00:12] i think i'd be willing to try changing the partitioner to random partitioning, but maybe should think about it more [15:00:32] at a minimum, it would tell us if spreading the work to multiple pods increases the rate, or if clouelastic is throttling [15:00:33] but then you need to partition [15:00:46] the topic is already partitioned, but right now it's partitioned by cirrussearch_cluster [15:00:55] and that happens inside cpjobqueue as well [15:01:46] so we give it the partition key? the cluster [15:01:49] i suppose we don't actually get a guarantee the work goes to separate pods, but seems probably [15:02:36] dcausse: yes, from deployment-charts repo helmfile.d/services/changeprop-jobqueue/values.yaml says to use cirrussearch_cluster as the partitioner, and charts/changeprop/templates/_jobqueue.yaml defines cirrussearch_cluster as mapping the params.cluster value into a map of {cloudelastic: 0, codfw: 1, eqiad: 2} [15:02:49] i'm not entirely sure how yet, but i suspect we can have that distribute to the 3 partitions randomly instead [15:03:41] so that would means that cirrus retries on the prod cluster will get mixed-up with cloud elastic? [15:04:09] dcausse: oh! hmm yea it is used for the regular cluster still too. forgot about that :S thats no good then [15:05:16] I don't have strong objections trying this but it would be better on an empty backlog I think [15:05:49] if it's going to block the regular cluster writes behind the cloudelastic lag thats probably no good. Maybe we could have multiple cloudelastic partitions in the same topic or some such [15:06:22] cluster + mod(page_id) or the like [15:06:29] yea something like that [15:07:57] i suppose what i would like out of it is to get something more definitive pointing to either cpjobqueue or cloudelastic as the place that's too slow. But with cloudelastic not filling it's write thread pool it seems likely to be cpjobqueue [15:08:26] which suggests hacking something to make cpjobqueue engage more pods on the same topic [15:08:49] would be nice if we could have free pods with manual assignement just to double check [15:11:11] hugh suggested that would require almost re-architecting cpjobqueue for that. Really it seems the design of cpjobqueue is for lots and lots of small queues that have low rates, high rates just aren't supported well [15:11:24] (which to be fair, we have lots and lots of small queues with low rates) [15:11:28] just not in search :P [15:13:13] for ideas, i think we can add extra partitions to the topic, maybe expand from 3 to 6, and do something hacky like have a second parameter that mimics params.cluster but does cloudelastic-0, cloudelastic-1, cloudelastic-2 along with adding those values to the cirrussearch_cluster partitioning map in cpjobqueue. bit hacky but not sure yet on other options [15:14:05] doesn't do anything for jobs already in the queue though [15:14:26] * ebernhardson realizes he has three copies of change-propagation locally, guess i keep re-cloning it to new places [15:14:55] ryankemper gehel we are all-of-a-sudden under pressure to shut off the following elastic hosts ASAP due to PDU maintenance lined above . They are all in the same row, can we do this all-at-once or should we wait for the cluster to go green? https://phabricator.wikimedia.org/P32168 [15:15:35] ebernhardson: +1 for the hack [15:15:47] inflatador: if the cluster isn't green, there is a chance of loosing data [15:16:05] a new field like "cpjobqueue_partition" or the like [15:16:36] gehel: we should always have data in 3 rows, as long as we have a primary and a replica for everything alive it should be "ok" but risky? [15:16:37] you should check that all non-green indices have at least 1 replica of each shard that isn't in that row. If that's the case, the risk is low. [15:16:49] thinking same thing it seems :) [15:18:29] 51 unassigned shards, that seems like a high enough risk that we should wait a bit [15:19:38] dcausse: ok i'll work that out today, shouldn't be too hard [15:20:31] dcausse: any idea what an appropriate number of new partitions is? 3 is arbitrary [15:20:39] or total cloudelastic partitions, i mean [15:21:15] I'd go with 3 for cloudelastic, I feel like it needs at least twice more so 3 seems more than enough? [15:21:21] ok, seems reasonable [15:21:33] gehel ebernhardson cool, will check it out, if there is already a one-liner for checking shards against row LMK, otherwise I'll get to work ;) [15:22:18] inflatador: sadly there isn't :( i poked the /_cat/shards?format=json api a little with jq but it's not that easy. would probably need some custom python which will take an hour or more to write, by which time it wont matter :P [15:23:03] I was thinking maybe a call to /_cat/indices filtering on health, maybe? https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-indices.html [15:23:18] inflatador: you would needs to use /_cat/shards, as the check has to be per-shard [15:23:20] but then it still needs the row stuff [15:23:31] boo [15:23:56] So should we tell DC Ops we're just going to have to wait until the cluster goes green before we shut any more off? [15:24:06] we can probably ignore apifeatureusage, so `curl -s localhost:9200/_cat/shards | grep UNASSIGNED | grep -v apifeatureusage` is the relevant unassigned shards (there are a few initializing as well that I'm ignoring) [15:25:06] Even with initializing shards, it's easy to see that there isn't the same shard unassigned twice [15:25:51] I think that we have at least 3 copies of each shard, so there is a very good chance that we're good. [15:26:18] gehel: hmm, yea that makes sense to me. As long as no shard has two unassigned copies then we know we have 2 sets of the data and can lose one [15:26:27] Still, I think it makes the most sense to wait for the cluster to go green before we start breaking things again [15:26:49] there is still a chance that the 2 active replicas of a shard are all on the same row [15:28:18] gehel maybe we should ban the nodes we're about to shut down? [15:28:30] That would prevent replicas from being copied to them now [15:29:00] not sure that it would speed things up [15:29:53] the issue is to be sure that we are in a safe state, not really about making sure we don't loose shards from those nodes [15:30:22] Would that keep the same large shard from getting copied over and over again to hosts that are about to be shut off? [15:30:27] 6 shards left unassigned or initializing [15:31:16] but we don't care about them being copied over and then lost. We only care about having enough copies to not loose data. [15:31:43] gehel: re "there is still a chance that the 2 active replicas of a shard are all on the same row", is that accurate? I should re-read some docs but i thought the allocation awareness guarantees two copies of the same shard can't be in the same row [15:32:12] no, it guarantees that not all copies of the data are in the same row. [15:32:17] ahh [15:32:27] I think it tries to spread them as much as possible, but I'm not sure [15:33:02] only 1 commonswiki file recovering. We should really be good. [15:46:42] FYI, looks like it's only hosts in A7 today, much smaller than I thought [16:06:58] also there's another PDU maintenance tomorrow https://phabricator.wikimedia.org/T310070 just subscribed us [16:07:22] apparently these were announced on a mailing list that doesn't go to embedded SREs ;( [16:28:05] \o [16:28:10] will catch up on the above after quick mtg with aisha [16:32:40] gehel ebernhardson looks like apifeatureusage shards are the only ones that aren't relocating? Can anyone spot-check that for me and let me know if we can move ahead? [16:36:31] inflatador: you mean initializing? [16:37:27] dcausse sorry, yes [16:37:51] inflatador: quickly looking I don't see anything stuck [16:38:50] dcausse thanks for looking, I'll keep watching [16:45:40] ebernhardson: gehel: oh I actually had the same misconception erik apparently had about allocation awareness [16:45:52] here's a `_cluster/allocation/explain` message for the awareness decider btw that should make stuff more clear [16:45:55] https://www.irccloud.com/pastebin/Xa1JGV38/ [16:47:32] still waking up but isn't that saying it wants 1 per row and no more? i.e. it expects shard count per row to be <= 1 [16:49:11] FWiW, it does look like things are moving now [16:50:52] hmm, we can't really use `pageId % foo` to have a consistent partitioning, could sorta mostly do it if we drop the batching abilities from ElasticaWrite (i'm not sure how often thats actually used, probably rare)..hmm [16:52:38] ryankemper: the docs aren't really clear, there is a note that "However if rack_two were to fail, taking down both of its nodes, Elasticsearch will still allocate the lost shard copies to nodes in rack_one" [16:52:50] ryankemper: which suggests it's not really a guarantee, but a best-effort kind of thing [16:53:00] ebernhardson: then a random number? [16:53:04] ryankemper: i don't understand yet under which circumstances it does which. docs from https://www.elastic.co/guide/en/elasticsearch/reference/6.8/allocation-awareness.html [16:53:25] dcausse: yea a random number is simple and can probably go that way, just doesn't have the guarantee that deletes stay behind writes [16:53:53] but we have read side protection against deletes and should be rare enough users wont notice if they get 19 results instead of 20 [16:53:56] true, I'm sure 99.9% of the jobs are for a single page [16:54:45] i suspect MassIndexJob and ForceSearchIndex are using batching though, and making those do 1 title at a time is probably painful [16:55:09] (but we don't use those ourselves, unless something terrible happens :) [16:58:07] problem is also that ElasticaWrite is "multi-purpose", inspecting its argument to know what it's doing might be painful [16:59:19] yea, we would have to resolve it everywhere that uses ElasticaWrite::build, although it's already pretty isolated to call sites in Updater and one call site in OtherIndex, everything goes through Updater::pushElasticaWriteJobs but it seems like setting params is best done from ElasticaWrite::build [16:59:43] although that also has a problem that while Updater::pushElasticaWriteJobs has access to configs and such, ElasticaWrite::build has to reach out into the global state [16:59:50] or pass more things along that don't seem to belong there :) [16:59:59] i dunno, its supposed to be a hack, maybe don't worry about it so much [17:01:16] i suppose it doesn't have to reach out for config, but i was going to stuff the count into a config variable that has an array like ['eqiad' => 1, 'codfw' => 1, 'cloudelastic' => 3] [17:08:33] hmm, i suppose a not terrible idea is replacing the `string $cluster` argument to ElasticaWrite::build with a `ClusterSettings $cluster` [17:10:57] sounds good, (if build is not called dynamically from the deserialized args) [17:12:32] doesn't look to be, when mw goes to run the job it invokes the constructor directly [17:15:41] lunch, ryan-kemper is watching the codfw hosts [17:15:51] Gave p.apaul the go-ahead to take down the 3 b2 elastic2* hosts. We're starting from green status so with the awareness we shouldn't be able to go below yellow [17:16:00] dinner [17:53:20] We're down to two masters in codfw for es clusters `9243` and `9443` while the b2 maintenance goes on [17:53:33] Shouldn't be too long till they're back up but just mentioning for general awareness' sake [18:15:04] * ebernhardson notes that somehow the Updater class in cirrus doesn't have any tests [18:15:39] probably a giant pain to start writing them :( will poke a bit thogh [18:26:40] back [18:27:17] will be about 2' late to sre pairing [18:28:53] closer to 5' actually [19:31:35] lunch [20:14:29] back [20:24:39] OK, I requested access to the sre Google Group for myself, ryan-kemper and g-ehel [20:39:28] hmm, yea it seems there is no hope writing simple tests for Updater...the space between the entry point (Update page X) and the output (jobs populated with a bunch of stuff) is too much and it reaches out into the services container regularly [21:12:00] eqiad restart is 35/36th of the way there [21:28:45] \o/ [21:29:21] annnd..finished [21:30:02] added myself and ryan-kemper on your T314078-related PRs! [21:30:05] T314078: Fix slow super_detect_noop code and monitor for future Elastic hangs - https://phabricator.wikimedia.org/T314078 [21:45:11] ebernhardson you should be clear to start the eqiad mjolnir [21:46:09] inflatador: ok, will do [21:47:34] fyi, in case any of these are relevant to tickets you created/own: https://phabricator.wikimedia.org/T314431#8126147 [21:51:06] inflatador: daemon is running now in eqiad, not sure how long until we are certain that the issue is indeed fixed and it's re-processed the files that previously got things stuck. I'll keep an eye on it for an hour or so [21:51:40] currently its processing through the prioritized updates (the hourlies) before it gets back to the bulk data [21:52:03] Yay! [22:10:26] based on whats its done in 20 minutes, going to need more than an hour or two to catchup on the imports since it was paused :P [23:27:34] hmm, cindy still intermittently fails, some data didn't make it into elasticsearch :( not entirely sure why. Not sure what it would take to narrow those down [23:30:24] * ebernhardson is also surprised to see that cindy, when started up on a patch when its uploaded instead of whenever the 10 minute sleep before poll'ing for patches is done, is faster than the normal CI tests :P