[06:54:45] yes 0 slots available is what we expect, we allocate just what we need so none are free once the jobs are all running [10:24:00] lunch [12:52:34] greetings [13:44:52] dropping off my son, back in ~15 [14:01:08] back [14:01:21] o/ [14:02:35] meh... Run id manual__2022-07-11T00:00:00+00:00 already exists for dag id image_suggestions_manual [14:12:06] dcausse: generally in that case you can clear the first task and downstream and have it re-run. But also you have to manually delete the folder out of hdfs (spark saveAsTextFile doesn't allow overwrites) [14:12:23] ah I was wondering about that [14:12:29] glad you're here just in time :) [14:12:43] i did that yesterday though, after talking to cormac. It's needed again? [14:12:55] it's failing because of _type: page :( [14:13:25] data in hdfs seems fine but it's getting rejected by elastic [14:13:33] hmm [14:13:51] haven't double checked mjolnir logs but applying the update manually it fails [14:13:59] ok, i see. because it's all (almost all) _doc now. thought i had previously checked that but i guess not [14:14:17] the wikis they are interested in are _doc now [14:14:29] enwiki is _doc [14:14:32] yea codfw is done and eqiad is up to zhwiki, almost everything is _doc :) [14:16:13] ok, so i see you patched the script and deployed with scap, so probably needs a `sudo -u analytics-search kerberos-run-command analytics-search hdfs dfs -rm -r -f /wmf/data/discovery/transfer_to_es/date=20220711/freq=manual/image_suggestions [14:17:21] ok running [14:18:07] and then i usually clear from the ui, but could be done from `airflow clear ...` command in cli. [14:18:46] ah clearing re-run it directly [14:19:25] I was looking into passing a run_id via -r [14:19:52] also curious why mjolnir didn't yell louder :( nothing showed up in the Result of Document Update graph under failed document updates [14:21:51] it's failing with {"took":2,"errors":true,"items":[{"update":{"_index":"enwiki_content_1658328569","_type":"page","_id":"41081188","status":404,"error":{"type":"document_missing_exception","reason":"[page][41081188]: document missing","index_uuid":"wVbacI3OT3GcquyZDUJfcQ","shard":"8","index":"enwiki_content_1658328569"}}}]} [14:22:03] perhaps document_missing_exception are ignored? [14:22:34] oh, hmm. Yea that get counted as 'document missing' instead of failed [14:22:47] i kinda disagree with elastic there, but i see how they get that :P [14:26:38] same :) [14:27:32] i guess the other side of that problem is we expect some document_missing to happen (althogh not that many). but it didn't seem suspicious to have some number of them [14:29:41] yes hard to tell what's the right threshold, tho here the graph reports 3k/sec [14:31:34] looking back in time doc missing seems to have increased since jul 24 [14:32:19] it should have been increasing i suppose over the last 2 weeks as the clusters were reindexed [14:32:52] curious to know why we get created docs there are huge spikes every now and then [14:33:02] glent [14:33:26] ah right [14:33:32] at least, if it's on it's own unique spike separate from the normal inputs, and happens once a week, that should be glent. They go at a very high rate because they are tiny docs [14:33:49] yes that's it most probably [14:34:15] maybe we could include some sort of marker in the mjolnir metrics about what it was importing [14:34:50] ah indeed would be nice [14:59:16] I'll be 1' late for retro... [15:03:16] ejoseph, ryankemper: retro time: https://meet.google.com/eki-rafx-cxi [16:00:01] hmm, so now the mjolnir update failure rate alert triggers. But the graphs show a total of 7 failures... [16:04:19] hmm, errors from cirrus indexing as well. Looks like elastic is rejecting some bulk updates (queues full somewhere?) [16:05:05] boooo [16:05:37] yea 1056 is having trouble somehow, most writes queues are 0 and it has 5k [16:07:00] memory pressure on 1056 looks typical, its not GC stalling the jvm [16:08:50] ebernhardson LMK if you think we should reboot, that would apply the outstanding NIC firmware as well [16:10:59] inflatador: hmm, i guess it wouldn't hurt. Not seeing any obvious reasons in our graphs for 1056 to be having trouble. memory is normal, disk i/o is normal. it doesn't really stand out on the per-node latency percentiles compared to the others [16:12:15] i guess elastic1056 has the highest p99 and p95 latency right now, but only by a few ms compared to other servers that look healthy [16:13:43] ebernhardson gotcha, will reboot shortly [16:26:07] hmm, that only moved the growing backlog and queue rejections to 1072 :( [16:26:13] OK, reboot is done, booo [16:28:54] 1072 isn't hosting the index currently being reindexed, it also isn't hosting an apifeatureusage shard (sometimes apifeatureusage sends tons of data, but doesn't seem like it is now) [16:29:00] i dunno :( [16:29:35] it's odd, we've rarely had problems on the write side of things [16:34:17] ebernhardson Is the "MjolnirUpdateFailureRateExceedesThreshold" alert related to this problem? [16:34:48] \o [16:35:02] hmm so if rebooting just moved the problem to 1072 then we can presumably rule out any sort of simple hw issue [16:36:52] inflatador: possibly, but not sure. When i checked the mjolnir graphs there was almost nothing. What i noticed was this one: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&viewPanel=9&from=now-1h&to=now [16:37:46] those errors all end up in the cirrussearch logging channel: https://logstash.wikimedia.org/goto/1a800f006db15ead4d9df027a0152bc4 [16:38:10] about 15k per 10 minutes since ~50 minutes ago [16:38:51] those end up looking like: update: /wikidatawiki_content_1658521377/_doc/67012023 caused rejected execution of processing of [2138580248][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[wikidatawiki_content_1658521377][11]] containing [.... [16:39:26] although the volume increases in part because cirrus reque's to try those writes again later, so we have more writes than normal trying to process (normal writes + retries) [16:44:29] Do we need to get any other teams involved, or should we wait a little longer? Looks like it's been ~1 h since rejections started [16:45:04] sadly i don't think anyone else has information about what this would be [16:45:37] while poking logs noticed chi->omega cross cluster was broken (again, bug in elastic keeps re-breaking that). Fixed but probably completely unrelated to the rejections [16:45:58] ah OK, I was thinking that it could potentially be caused by DB changes or something, just spitballing though [16:47:00] Elastica writes directly to ES anyway, right? [16:47:44] yea, Elastica is the client library CirrusSearch uses [16:47:51] talks directly to elastic [16:51:15] i suspect something is deadlocking writes, this doesn't seem to be progressing in index_totla or index_time_in_millis: watch curl https://search.svc.eqiad.wmnet:9243/_nodes/elastic1072-production-search-eqiad/stats/indices/indexing?pretty [16:52:00] running 1072 and 1071 side-by-side, numbers in 1071 go up and 1072 stays stuck [16:52:47] maybe jstack has fun info...more poking :P [16:52:49] pure spitballing again, but can there be circular dependencies happening? Like an update to one page triggers an update to another page that triggers an update to the first page? [16:54:21] inflatador: dont think so, the problem looks more like the node isn't making and progress on indexing so the queue keeps building [16:54:26] s/making and/making any/ [16:56:48] looks like 1056 and 1072 are trading different shards of commonswiki_file_1658783034 at the moment [16:57:01] maybe we shut one or both of them off for awhile and see what happens? [16:57:47] I'm not super bullish on that approach, I'd think it'd be more likely that it's an issue of a node not being able to handle its shards [16:58:41] To that end I tried looking at what shards moved from 1056 to 1072, not really a smoking gun here unless `commonswiki_file_1658783034` is a particularly hot shard: [16:58:43] https://www.irccloud.com/pastebin/8vRoffgz/ [16:59:19] i'm really not sure, i'm a bit curious if elastic1052 was in the same state where its `index_total` metric wasn't increasing. That metric being stuck seems like the smoking gun to me...but i'm not sure where the gun is :P [17:00:16] We could prioritize bringing in the eqiad elastic expansions hosts that have been waiting, only thing is adding in new hosts when the cluster is struggling isn't always the best way to do it [17:00:36] But who knows maybe it's just a simple net capacity issue [17:00:49] when the cluster is already struggling* [17:01:27] ohh, now thats suspicious. jstack shows all write threads on 1072 stuck inside our superdetectnoop script: sudo nsenter -t 68132 -m sudo -u elasticsearch jstack 68132 | grep -A 7 '\[write]' [17:01:47] whereas a healthy server has a few different traces for the write queues [17:02:12] * inflatador adds jstack cmd to notes [17:02:17] hmm, those should primarily be coming from the mjolnir daemons. Maybe stop the daemon and restart 1072? [17:02:22] inflatador: https://wikitech.wikimedia.org/wiki/Search#Using_jstack_or_jmap_or_other_similar_tools_to_view_logs [17:02:31] it was a pain to figure that out, require some mo.ritz magic :) [17:02:38] (and thats why its documented) [17:03:00] i wonder if the dameon is still writing ... checking [17:03:08] everything in this universe either requires some mo.ritz magic or some vo.lans magic IMO :P [17:03:14] yeah I'd be in favor of stopping mjolnir and seeing what happens [17:04:54] i wonder if it's just a terrible assumption. Its in a containsAll() call where i wrote in the comments: // Expecting n <= 10, a fancier comparison doesn't seem justified [17:06:57] it's comparing two lists, probably has a worst cast O(n^2) complexity [17:10:19] ryankemper: go ahead and stop the daemons and restart the stuck host, i'm trying to figure out if there are other places that send these multilist updates maybe from cirrus or somewhere else [17:11:14] ebernhardson: ack. can you refresh me on where the mlr daemons live these days? I think I remember them being on VMs somewhere? [17:11:16] * ryankemper checks docs [17:11:23] ryankemper: search-loader[12]001 [17:11:31] ty [17:13:09] it seems reasonably likely though this is related to the mjolnir daemons and the update we shipped from image recommendations earlier though, the timing seems reasonably close. But then i'm not sure how the second host picked up the same errors, it should have been done shipping long before we restarted 1052 and i dont think elastic will retry internally once it's been queued on an [17:13:11] individual host [17:14:03] oh! it's totally the mjolnir daemons ... they haven't managed to process the queue from earlier like normal [17:14:14] (well, probably :P) [17:14:15] Not sure if it matters, but I restarted 1056, not 1052 [17:14:50] oh, i totally mean 1056 and not 1052. getting simple numbers mixed up :( [17:14:56] ebernhardson: we've got the mjolnir `bulk` daemons and `msearch` daemons, is it just the bulk one that needs to be halted or all of them? [17:15:01] ryankemper: the bulk ones [17:15:08] ack, proceeding [17:15:52] oh also this timing is very coincidental: [17:15:54] https://www.irccloud.com/pastebin/KvgIkORi/ [17:15:56] i guess only maybe its the bulk daemons triggering the issue ... the actual image recommendations fully shipped, but it's recent attempts to ship data are stuck [17:16:16] unit running since 1h22min ago, implying it got restarted or something [17:16:42] seems reasonable, that daemon is designed as a "blow up when things go wrong and restart" kind of service, instead of implementing a bunch of failure case handling. Assuming that since we didn't ack anything from kafka it will re-read the last message and try again [17:16:53] Whereas the codfw one has been running for 5 days [17:18:35] i'm not sure how the problem would only surface on eqiad :S [17:18:56] ebernhardson: btw do I need to halt the codfw one too or just the eqiad one [17:19:04] ryankemper: lets stop codfw to be safe, but only eqiad is having issues [17:19:11] kk [17:20:59] search-loader eh, that's a new one on me [17:24:11] Alright done restarting `elastic1072`, let's see if things improved [17:24:35] I just restarted the services rather than rebooting the full host [17:24:51] seems sane [17:25:35] i suppose the hope is that nothing starts building up a queue again here: https://search.svc.eqiad.wmnet:9243/_cat/thread_pool/write?v [17:27:08] inflatador: search-loader is the spot where we load data from the hadoop clusters into elasticsearch. Mostly it receives messages over kafka about new files available in swift and then pipes them into the elasticsearch bulk api indexing api's [17:27:44] well, the bulk daemon at least. There is a second msearch daemon there that is much more insane..but we can get into that another time :P [17:28:05] ebernhardson yeah, reading about it now on the Search page [17:28:26] the cirrus failure sgraph seems to have gone back to normal since 17:23 or so [17:28:31] should probably play "how much does Brian know 'bout the Search pipeline" on a regular basis [17:29:51] hmm, so properly fixing this is probably going to require a new version of our search-extra plugin to not deadlock inside that comparison. Probably will need to figure out what in the dataset exactly triggers this issue [17:29:56] Well it's definitely feeling like we found the culprit [17:30:15] will need to run the cirrus catchup routine so it re-sends the last few hours of mediawiki updates [17:30:49] i don't think we've changed anything related, this should still work: https://wikitech.wikimedia.org/wiki/Search#Recovering_from_an_Elasticsearch_outage/interruption_in_updates [17:33:24] Alright so let's do a timerange like `2022-07-28T15:09:00Z` -> `2022-07-28T17:25:00Z` [17:33:39] ebernhardson is it necessary to limit the command to the indices found in logstash? Or just use dates? [17:34:44] inflatador: probably all wikis, we were rejecting writes from cirrus->elastic due to queues full so pretty much any wiki could be impacted [17:34:51] inflatador: I would assume that every operation failed therefore we want to do it on all wikis [17:35:21] but also regardless this is a small enough window that we don't need to worry about using too blunt of an instrument [17:35:31] how's this look? [17:35:34] https://www.irccloud.com/pastebin/giktUksX/ [17:35:39] on the upside, search indices actual changes are relatively small, most readers wont notice at all. We might annoy a number of editors though if they were using search as a task-queue (somewhat common) [17:36:44] is that run from the mwdeploy server? [17:36:53] actually I'll change the ranges to `outage $wiki '2022-07-28T15:07:00Z' '2022-07-28T17:25:00Z'` just to account for a 60s delay in the grafana metrics plus an extra minute because why not [17:37:02] ryankemper: seems reasonable, lines up with the cirrus-failures graph [17:37:24] well acutally my 15:09 already had a 120s buffer but yeah why not make it 240 [17:37:30] i often use round numbers like :00 or :30 or whatever [17:37:31] inflatador: generally mwscript is ran from mwmaint [17:37:42] Yeah I like the idea of making it round [17:37:44] ok [17:38:14] Okay I'll get this started on mwmaint in a tmux, one sec [17:44:29] lol did some weird magic to the tmux session where I'm not really quite sure what button I pressed [17:44:49] but the actual operation is still chugging along fine, just can't really view it properly within the session [17:45:22] need to figure out in the future how to move faster from unknown to something found, we had an idea something was wrong at 16:00 utc, but didn't run jstack until 50 minutes later. Maybe thats not terrible i dunno [17:46:13] the jstack idea came from random googling, elastic suggests that write queues growing could be a deadlock [17:46:29] How long did it take us to realize that indexing was totally stopped? [17:46:38] I think that was one of the key pieces that made things start making sense [17:47:08] first errors 15:12-15:23, but then it quieted until 15:44 when it started up and continued to fail. So about 15 minutes (while we were in retro) [17:47:19] not terrible [17:47:20] I was looking at the queues and seeing stuff build up but didn't realize that it wasn't getting any work done until you figured that out [17:47:43] same, i was thinking the queues were growing but also progressing until i thought to look at the node stats [17:48:29] also not all writes stopped, plausibly only writes where the stuck host was a primary [17:49:40] Rebooting the initial host was good because it pretty quickly made it clear that it wasn't a host-level (hw etc) problem [17:49:49] an hour and a half of errors really isn't the end of the world, it's probably a reasonable response [17:49:59] Come to think of it I still don't quite understand why only one host was visibly affected at a time [17:50:14] ebernhardson: is the bulk daemon just choosing an elastic host to talk to and so us rebooting the initially failing host made it move to another? [17:51:21] Would it help to alert on hosts that aren't processing their individual queues, or are the current alerts we have for mjolnir OK? [17:52:11] ryankemper: it's a weird one...it uses the python elasticsearch client which starts by talking to the lvs endpoint, but then sources a nodelist and talks to hosts directly. But the actual failures shouldn't matter which host it talks to, the writes go to the bulk indexing api which reads the header for each write, something like {"index": "foo", "id": "12345"} and then it forwards it to [17:52:13] whichever primary (and maybe replicas, not sure) owns that id for that index [17:52:56] unfortunately i couldn't figure out from the _tasks api which writes these were, it has both the 6 writes that were stuck and a bunch of tasks for queue'd writes as well [17:54:12] inflatador: quite possibly, although i wonder about false positives. We could probably at least set up some monitoring of this value (i suspect it's not currently written to prometheus, but would have to check) and see if it ever flat-lines in normal usage. I suspect even if it does flatline for a few minutes, it should never be flat for 30+ minutes [17:55:26] https://usercontent.irccloud-cdn.com/file/tPeKNfCk/Screen%20Shot%202022-07-28%20at%2010.53.27%20AM.png [17:55:32] as an aside does anyone know what tmux dark magic i did here [17:56:00] it looks like a split pane but from `tmux list-panes -s` I see `0.0: [229x50] [history 1870/2000, 679715 bytes] %0 (active)` which I think means it's not a pane [17:56:22] But yeah only half the screen is being used and it won't auto-update (like I have to manually scroll down to see the more recent output) [17:56:43] ryankemper: hmm, i would randomly press esc or q but not sure what does that specifically. I feel like i have seen it at some point [17:56:46] ebernhardson we can check it out if you think it's worthwhile, or we could maybe just add more verbiage to the current mjolnir alert if this is won't happen outside the mjolnir context [17:57:15] ebernhardson: ah hah, double tapping escape did it [17:57:15] ryankemper Not offhand but I have done it before, I think prefix + d will get out of it [17:57:26] prefix + d is just to detach [17:57:35] but when I'd reattach it'd still be the same weird view [17:57:50] but yeah the double escape did it [17:58:08] almost done with the operation btw, we're at `wikidatawiki` [17:58:54] inflatador: probably better at a higher level than mjolnir, this particular deadlock happened to be something that in normal usage only mjolnir daemons will trigger (they are the only ones uses the MultiListHandler that deadlocked), but i imagine we would rather alert on writes-are-deadlocked rather than a specific kind of deadlock we've previously seen [18:06:11] hmm, so we do have that metric in prometheus but it doesn't look useful: increase(elasticsearch_indices_indexing_index_total[5m]) < 100 [18:08:46] maybe some query could be figured out that would isolate them though [18:10:09] oh, actually it sorta does. this returns only 1056 and 1072 over the time period: increase(elasticsearch_indices_indexing_index_total{exported_cluster="production-search-eqiad", index=""}[30m]) < 500 [18:10:39] (it was the index="" that fixed it, there are both some per-index stats and overall-node stats in the same metric) [18:11:56] but it only works for the big cluster, the small clusters have periods of no-activity [18:14:11] i suppose the alternate lesson here, a comment like 'n is always <= 10, nothing fancier required' should include a condition that checks for n > 100 or some such and blows up instead of deadlocking [18:17:20] very good point on the checking for n>100 [18:17:31] that's something I should do more of when I write code too [18:17:41] the benefit of hindsight and all that [18:17:52] that metric seems good, a bit unfortunate it only works on the main clusters but that shouldn't be too much of a problem [18:18:10] or maybe we can do some magic to also compare against the total rate of work the cluster normally does or something, but that's beyond me at the moment [18:19:05] in unrelated news, I'm on my third pair of glasses after my first broke and my second gave me a raging headache...hopefully #3 plus some NSAIDs will help [18:21:05] double fun! :( [18:23:54] rough [18:23:56] extremely half-baked phab ticket before we forget, feel free to add details or we can work on them at the SRE session https://phabricator.wikimedia.org/T314078 [18:37:25] ebernhardson: would you have a moment to jump in https://meet.google.com/eki-rafx-cxi ? [18:37:35] yea, sec [18:40:40] is this a deadlock or just multilist being too slow? [18:41:51] dcausse I believe gehel pointed out it's technically not a deadlock, we're in ^^ if you need more details [18:45:22] increase(elasticsearch_indices_indexing_index_total{exported_cluster="production-search-eqiad", index=""}[30m]) < 500 [20:13:37] huh, 10k elements wasn't enough to make a test case take forever. but 100k did the trick. I'm surprised if there were > 10k elements though [20:14:08] replacement runs in 106ms instead of >30s (i didn't bother waiting for 100k elements to finish) [20:14:41] will certainly have to find the live data that has the problem as well [20:15:18] main thing i'm not finding though is how to make junit / assertj run a particular bit of code and verify it doesn't run "forever" [20:21:36] oh yeah, that sounds like it would be a problem in any language, but I'm no dev [20:21:51] * ebernhardson wonders if more things should be renamed now that this is internally a Map>. The name 'MultiList' seems a bit wrong :P [20:22:39] in terms of implementation, i would guess that it runs two threads, with one thread watching the timeout and the other thread doing the work, with the ability for the watching thread to kill the working thread. [20:23:01] ahh, it wanted @Test(timeout = ) [20:53:57] hi, reminder to fill in stand up notes if you haven't already [20:53:59] ok patch is up, test case verifies it completes in a reasonable time frame. Probably want a java-ish person to review it before we merge and ship a new plugin version though [20:58:42] Nice, would gehel happen to be around for the code review? [20:58:54] I have to clean up after my pet but should be ready to jump back on in ~5-10 [21:09:24] I'll be there I'm 5' [21:14:35] ebernhardson: did you see https://guava.dev/releases/snapshot-jre/api/docs/com/google/common/collect/Multiset.html ? [21:15:01] gehel: nope, reading [21:15:14] and maybe https://guava.dev/releases/snapshot-jre/api/docs/com/google/common/collect/Multisets.html [21:15:22] I just skimmed, not sure if it applies [21:15:42] gehel: so you're saying the naming here no longer makes sense, because this is called multiset but it doesn't have multiset semantics [21:16:31] no, I'm wondering if the Guava Multiset would have the required semantic for this use case [21:16:50] in the linked multiset, we have a pair, (value, count), for each element. But in my multiset its a Map> [21:17:47] gehel: no i don't think so, it's counting the repeated occurances of a value, we are fine with multiple occurances being collapsed into a single value, the multi we care about here is that there are multiple sets encoded into a single list [21:18:19] my bad, it's actually a MultiMap https://guava.dev/releases/snapshot-jre/api/docs/com/google/common/collect/Multimap.html [21:21:22] gehel: reading, but i think it's also a different kind of semantic. The main thing we do is replace the inner Set in its entirety, the check the K of the updates and then replace the matching stored Set with same K with the new set [21:22:00] yeah, I'm reading the MutliListHandler... [21:22:21] like we have {a: [1,2,3], b: [1,2,3], c: [1,2,3]} and the update comes in for {b: [3,4,5]}, so we replace b in the input [21:23:18] the other main thing it provides is a serialization from MultiSet to List and back [21:23:23] which would be a removeAll(b), putAll(b, [3,4,5,]) with a multimap [21:23:27] maybe serialization is the wrong description [21:23:30] but still need to do the diff [21:23:42] and still need the parsing outside [21:27:03] i suspect using the MultiMap would save little if any code from the implementation, we wouldn't be using the views which the docs seem to describe as a primary motivator of MultiMap [21:31:31] I'm checking the Set.equals() implementation [21:31:40] Not sure that it is O(log(n)) [21:33:03] hmm, you might be right. I was still thinking about sorting. HashSet.contain should be O(1)? giving O(n) for containsAll() [21:33:29] i suppose there is also no guarantee here that Collectors.toSet() used a HashSet [21:33:49] it's O(n logn) [21:34:13] if I'm not confused by dynamic dispatch [21:37:36] looks good enough to me [21:37:57] a few minor nit, but feel free to ignore them, or address and merge [21:38:45] i'm never sure when to static import things :P [21:41:14] ping mr. inflatador when you're ready to build the pkg [21:46:46] running the test suite locally before shipping to gerrit, but will be ready in a sec. I suppose next up i need to find the large update and see how big it actually is [21:52:26] and the largest update has (drumroll): 438,444 elements [21:52:42] just a little more than the 10 i wrote it for :P [21:54:05] oof [21:54:29] but i guess if i want to know what that update is i need something more obvious than this short jq thing though... [21:54:53] d-causse mentioned we might want to be a little more strict about what we accept at some point, no idea what that entails though [21:55:13] yea, i kinda doubt there is any intention to send 400k elements, it's probably also a bug in other parts [21:55:57] but as far as elastic is concerned, i don't think it will care much if there are a few long arrays as long as it's not all of them. It deals with large volumes of tokens reasonably well [21:57:54] * ebernhardson stops trying to write ever more complex bash pipelines that pipe from zcat into awk into jq and writes a little python instead :P [21:58:05] inflatador: https://gerrit.wikimedia.org/r/c/search/extra/+/818223 should be good for a merge and plugin build [21:58:35] :eyes [22:00:07] ebernhardson does this repo require manual merge? [22:02:40] nm, merged it regardless [22:03:09] inflatador: it should get merged by jenkins, manual merges outside of operations land are usually only for repos without and CI at all [22:09:36] so at the risk of sounding obvious, search/extra is a different repo than elasticsearch/plugins . Do I need to change anything in the plugins repo besides the build numbers? plugin urls maybe? [22:10:30] inflatador: we have to build and release the plugin :) i'm not exactly sure how to do that either, but lets look around there is probably some info [22:10:57] i suspect we never setup a jenkins job to do this for us like with rdf, but maybe there is [22:12:00] ebernhardson I know how to build it (more or less) using https://phabricator.wikimedia.org/P19522 [22:12:34] inflatador: thats for the plugins .deb, but first we have to build and release the search-extra plugin so it can be updated in there [22:12:53] aahhhhhh, now that I don't think I know how to do [22:13:27] looks like the last release of search-extra was done my emmanuel, whereas the rdf plugin is built by `maven-release-user@wikimedia.org` so almost certainly we have to build it ourselves. I'll do that since i imagine you don't have archiva (jar repository) access all setup yet [22:14:06] it mostly amounts to `./mvnw release:prepare && ./mvnw release:somethingelse`. I'll have to check what exactly the rdf plugin ci is doing and copy that [22:15:26] OK, I see there's a link to maven and a checksum we need: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/refs/heads/master/debian/plugin_urls.lst#8 [22:15:55] release:prepare followed by release:perform apparently. Running now [22:17:57] we should probably make CI build this one for us as well, not sure if there is a particular reason we didn't set it up, or if it's simply that this repo is much older than the rdf so we never got around to it [22:18:17] it'll take a few minutes, doing a release re-runs all the tests and such [22:19:43] ACK [22:26:56] release completed, now making a patch for the operations/software/elasticsearch/plugins repository [22:27:49] inflatador: also ge.hel was right, we should use a better name than deadlock. I got the initial idea of where to look from some elasticsearch stuff that was talking about deadlocks, but this wasn't really a deadlock. It's more like 'code that is supposed to take 100ms that instead runs for 30+ minutes'. I dunno if that has a proper name :P [22:27:58] I can fix that [22:28:31] also ebernhardson don't forget to increment the version number in changelog and rules! Been bitten by that one too many times ;) [22:29:07] lol, i will almost certainly forget :) Now i'm wondering though, the existing link in here is to repo1.maven.org, but i think release:perform only uploaded to archiva. Maybe i'm missing a step somewhere... [22:29:47] can point it to archiva for now i guess, i think i did setup so i have access to upload things to maven.org but i don't remember how... [22:31:16] oh, sigh...of course i built the 7.10 version of the plugin, because we already upgraded past 6.8. hmm [22:31:53] i suppose create a 6.8 branch off the old point, cherry pick this patch onto it [22:32:33] ebernhardson I thought we already snapped back to 6.8 because we had to fix another plugin [22:32:57] hmm, maybe theres already a branch then. looking [22:34:42] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/85b5d124e752d8e6675fcde6e96d875c9d5319cb [22:35:35] inflatador: oh, good at least we don't have to do it there too :) I had only gotten as far as the search-extra plugin part (i release search-extra-7.10.2-wmf2). I've created a new 6.8 branch in gerrit from the latest search-extra release and cherry-picked the new patch onto it [22:36:43] you can just ignore me...keep getting confused between search-extra and plugins repos [22:38:00] it's easy enough, there are a bunch of things with almost the same name, and i'm usually not specific enough [22:43:42] alright merged in 6.8, doing a release this time for 6.8.23 [22:52:12] cool, ping me when you're ready with the elasticsearch/plugins repo PR. Or I can do it myself if I can get the new link/checksum for the extras plugin [22:55:01] hey look, that new bit that checks if the plugin already exists on apt.wikimedia.org works :) Even after you reminded me to update the debian/rules with the new version ... almost ready [22:57:46] hmm, no it doesn't work :S I'm sure i checked before that this Makefile rule passed when it doesn't exist, but `debian/rules verify_unreleased` is failing even though curl is returning 404 as expected [22:58:47] apparently `curl --head ` exits with 0 even on 404 [22:59:06] (╯°□°)╯︵ ┻━┻ [23:00:05] i guess we need `curl --fail --head ` . what would we do without stack overflow :P [23:05:52] oh, we never actually merged that and i didn't notice i was doing my patch on top of an unreleased patch ... ok i think i finally have a working patch. I must not do this often enough...so many minor mistakes [23:06:19] inflatador: https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/818241 [23:09:15] ebernhardson do we care that "debian-glue-non-voting" failed? [23:09:21] hmm [23:10:42] inflatador: i dont think so, the failure message seems to about about something uninteresting, and the last merged patch gets a 2s fail from same thing [23:11:01] last merged patch (ci logs no longer available, so can't verify its exactly the same): https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/804003 [23:12:10] ebernhardson ACK, merged. I'll start building the plugin now. ryankemper did we decide whether or not to start the rolling-operation immediately or wait until tomorrow? I'm OK either way [23:12:49] i don't think this is urgent enough for y'all to lose an evening over it. We can do it tomorrow [23:13:13] although the plugin that has the bad behaviour is still deployed, as long as mjolnir stays paused i don't think anything else can trigger it [23:14:04] ebernhardson cool, happy to wait if the situation is not urgent [23:25:39] OK, I updated the ticket, but the new package is deployed on apt1001 . We'll install/restart elastic tomorrow [23:25:48] Thanks everyone for your help! [23:25:52] inflatador: ebernhardson: agreed, let's wait till tomorrow. puppet's disabled on the search-loader instances so there shouldn't be any risk of them turning back on [23:26:24] ryankemper ACK, see you tomorrow! [23:26:33] \o