[00:05:03] Since the cluster allocation explain seems to indicate that for each host there's either already (1) a copy of that shard in that same row, or (2) there's already 2 total `enwiki_content_1658309446` shards on the host, I'm looking closer at what rows the shards are scheduled to [00:05:18] https://www.irccloud.com/pastebin/lqBKkGvc/ [00:06:22] So shard id#15 it scheduled to 2050, 2027, 2044 which corresponds to rows D, A, B respectively, so there would need to be space on a row C host [00:13:23] Seems like we might need to just loosen `index.routing.allocation.total_shards_per_node=2` from 2 to 3 [00:18:35] Side note: I need to figure out the jq command to parse the response of the following command, selecting only entries whose `.node_attributes.row == "C"` [00:20:52] Ah I figured it out. Here's the command whose output I was having trouble with tho for posterity: [00:20:59] https://www.irccloud.com/pastebin/1NycL5nD/ [00:22:44] Here's the command I was looking for: [00:22:47] https://www.irccloud.com/pastebin/KhR7ImX2/ [00:23:37] https://www.irccloud.com/pastebin/7kFQfSSE/all_row_c_hosts_already_have_two_shards.log [00:23:58] So yeah that confirms what I thought, it's getting stuck on the requirement for `index.routing.allocation.total_shards_per_node=2` [00:46:14] Ah elastic2059 was out of the cluster (reimage cookbook waiting for user input) and it's in row C, thus why the cluster was having so much trouble [00:46:47] I'm gonna bring in some more elastic hosts so we can handle that better [06:53:46] will bring in those hosts tomorrow. initial patch for the first step of that: https://gerrit.wikimedia.org/r/c/operations/puppet/+/815823 [09:47:18] Lunch [10:15:00] lunch [12:44:08] dcausse: I was playing with event-utilities code, trying to use streams and collectors instead of a while loop. I now have one test failing, but I don't understand why. https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/815260 [12:44:21] looking [12:44:25] If you have a few minutes to see if I'm missing something obvious [12:44:41] no emergency at all! This is more part of my 10% time than anything else. [12:45:04] I see: The file /src/pom.xml is not sorted :) [12:45:13] is there something else? [12:46:15] oops, let me fix this. Yes there is a failing test as well [12:46:33] ah sorry just saw the comment message [12:47:29] * gehel just pushed the sorted pom [12:53:21] weird... I don't see the problem just looking in gerrit, I'll pull it in intellij and see [12:54:18] now using parameterized tests makes it trivial to re-run only this failing test in IntelliJ. [12:54:32] * gehel is quite happy that dcausse has not spotted the issue in 30 seconds. [12:55:18] I suspected something about a strict ordering being expected, and now that ordering changed, but I don't see how that would go. [13:00:25] it was a map previously so if that's the case to production code is extremely fragile [13:00:44] s/to/the [13:05:06] My expectation was that the matcher was too strict, but those matchers are a bit convoluted [13:10:28] ryankemper / inflatador: I'll skip the SRE pairing session. There is a Search Deep Dive section that is conflicting [13:11:48] OK [13:11:52] also, greetings [13:33:27] dcausse, ebernhardson: for the image recommendation retro, we're missing the work done by the Search team around weighted tags in the timeline. Could you have a look? (link to the Miro board in the meeting invite) [13:41:10] it goes back to 2020 :) [13:41:37] hard to remember when all the weighted_tag stuff was done :P [13:43:46] Added ryan-kemper 's API call to the docs, feel free to edit if necessary https://wikitech.wikimedia.org/wiki/Search#Anti-affinity_(shards_limit)_prevents_shards_from_assignment [13:52:42] added two notes, one for when we added the weithed_tag pipeline around may 2021 and when Erik configured our jobs to pull data for imagerec around may 2022 [14:32:23] dcausse: thanks! [15:20:37] not sure what to do with cindy...somehow typing into the input boxes doesn't work correctly anymore, it likes to miss spaces. And you can imagine how that might have problems for the rest of the test :P [15:23:59] random googling the webdriverio team closed a similar ticket as wontfix and said they send the correct api requests to chromedriver, the problem is in chromedriver. The confusing bit is i doubt our chromedriver package has been updated in years since its chrome 73, i doubt it's been updated in years [15:48:47] Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/815823 to start bringing the new hosts in [15:52:41] Separately, noticing these shards stuck on codfw 9400: [15:52:44] https://www.irccloud.com/pastebin/52kVu0qM/ [15:53:34] ryankemper: hmm, unclear where those -1's are coming from but yesterday they simply stayed stuck there. brian issued a cluster reroute with the 'cancel' command for those and it retried [15:54:10] not sure it's worth digging too deep into it since we are about to switch versions of elastic [15:55:23] (retired different indices/shards, i don't think it was zh_min_nanwikiquote or adywiki having issues yesterday) [15:55:29] s/retired/retried/ [15:57:22] Ran these commands https://www.irccloud.com/pastebin/i2qDhK5a/ [15:59:49] dang, we've also got people in releng complaining about beta ES cluster, will look into that when I finish working out [16:07:30] https://gerrit.wikimedia.org/r/c/operations/puppet/+/816008 Step two, this will bring the new hosts in [16:19:24] for puppet deploy window (or sometime a bit later if busy, no particular rush but would like to get it all moving): first https://gerrit.wikimedia.org/r/c/operations/puppet/+/815781, then somehow monitor if indexing actually changes over (tcpdump 9200 on some hosts and grep for apifeatureusage + _doc? not totally sure :P). After its verified writing to _doc ship [16:19:25] https://gerrit.wikimedia.org/r/c/operations/puppet/+/815782/2 [16:21:29] the requests should be coming in reasonably quickly, i imagine a tcpdump for 2 minutes and drop into wireshark should be able to verify it transitions. I suppose i'll test if i can see the current apifeatureusage indexing requests [16:24:22] oh, actually apifeatureusage logstash is talking to elastic over port 9200, so it's unencrypted and we could monitor traffic out of the apifeatureusage[12]001 hosts. Except i can't login to those :P [16:29:31] Cool finishing up the other elastic stuff and then can do the puppet deploy stuff with ya [16:35:06] i suppose the only worry is that if it doesn't transition to writing to _doc then the updated template will cause all writes to tomorrows index to fail. Should be the only risk i can think of [16:42:44] New codfw hosts aren't visible in pybal https://config-master.wikimedia.org/pybal/codfw/search, looks like there's a manual step to add em [16:45:16] Ah of course, I forgot there's a hieradata entry for these [16:49:44] (will circle back to update docs / ticket w/ full process later) [16:54:04] back [16:56:03] Patch here for conftool entries: https://gerrit.wikimedia.org/r/c/operations/puppet/+/816017 [16:56:22] inflatador: I'm gonna join puppet deploy window in a bit but if you wanna get started on that feel free [16:58:12] ryankemper looking at your patches now [17:01:08] As an aside, 2066 is still having mgmt pw issues. Seems like the `ipmi-config` change isn't quite taking, not sure why [17:01:23] Will circle back on that, clusters are green for now so going to switch gears to the puppet deploy window stuff that ebernhardson had [17:04:45] hopefully mine will be easy :) [17:05:19] ebernhardson: ready to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/815781 whenever, could prob use help with the tcdump cmd though [17:05:41] ebernhardson looking at your patch above [17:06:19] ryankemper: my favorite tcpdump resource: https://danielmiessler.com/study/tcpdump/ [17:06:53] ryankemper: probably, from apifeatureusage1001, `tcpdump src port 9200 -w output.pcap` [17:07:11] err, not src port, just port [17:07:24] +1 to that TCPdump page [17:08:50] then scp locally and load into wireshark. In theory anything you do in wireshark could be done with tshark on cli, but this is one of the rare things where i tend to find a gui easier :) [17:09:15] ebernhardson: inflatador: I'm in https://meet.google.com/iqe-wcuz-mpn jfyi [17:09:47] oops, I guess I had the wrong link...omw [18:52:30] hi Search team! I asked for advice with T301096 a while ago, but with really poor timing (the team was on offsite that week, and I was on vacation the next several weeks); I'd like to raise that again now. [18:52:31] T301096: Add a link: prioritize suggestions of underlinked articles - https://phabricator.wikimedia.org/T301096 [18:52:39] The issue is summarized in https://phabricator.wikimedia.org/T301096#8095690 but the tl;dr is: would it be okay to add a new `outgoing_links_count` field (with the number of outgoing links) to the search index? [18:53:49] That would allow accessing the link count from scripts (a rescore function, specifically, to find articles which need more links). Not an elegant solution but I can only see worse alternatives. [19:09:13] hi team, i know the standup notes are kinda wonky right now with our okrs kinda scrambled, but please provide updates soon so I can put in my team update on asana. Thanks! [19:43:53] back [19:43:59] mpham working on it now [19:54:27] tgr: looking [19:57:20] tgr: nothing particularly wrong with adding an outgoing_link_count field, but you'll have to wait for 16 weeks after it's merged for the indices to be fully populated [19:57:56] we could move that quicker (days) for a few small-ish wikis, but the background process that does it auto-magically across everything takes awhile [19:59:27] tgr: regarding getting the link counts without it, there probably isn't a good way. The data is still stored in elastic as a hidden source field, but that field contains a json blob and would have to be decoded for each result to be scored. would be much too expensive [20:01:13] once things are indexed the arrays don't exist anymore, instead there is a large jump in the position (so [abc def, ghi] is indexed as abc at pos 1, def at pos 2, ghi at pos 102 [20:03:32] (i'll write that to the ticket with some more clear details) [20:04:12] for actual implementation, i think we can define a subfield of outgoing_links that treats each array element as a token and then counts the number of tokens [20:06:34] oh, i suppose if we do it with an elsaticsearch field mapping and not by sending data from mediawiki, then it happens as soon as everything is reindexed. Annoyingly i'm running a full-cluster reindex right now, but we can do another as necessary [21:52:22] I see some more "-1.0%" recoveries on chi on CODFW, checking now [21:53:04] I'm also not entirely clear what that value describes, it doesn't quite line up with the columns in my request, ref https://phabricator.wikimedia.org/P31556#132764 [21:55:37] :S [21:57:20] my google-fu is weak, all i can really find (looked before too) is people happy that they can kick it and get it to try again, but nothing about what is actually happening to cause them [22:00:22] 4.8 hours ago is ~16:10Z, thats about the moment ryan merged the patch to add elastic20[64-72] but not entirely sure how that would cause problems [22:00:39] ebernhardson that's OK...upon further review, elastic2045 (which didn't reimage properly and doesn't currently exist in puppetdb) seems to be the common thread [22:01:07] we have lots of capacity ATM so I think I'm just going to try and reimage it again. [22:05:15] sounds reasonable [22:28:13] ah OK, so 2045 somehow eluded its BIOS and NIC FW updates, just applied now and in process of reimaging [22:42:35] ebernhardson: thanks for the explanation! is JSON decoding the source field something I can do in a rescore function? For testing via manual Cirrus queries that would be useful, so we can check whether this gives reasonable results before creating a bunch of work for you. [22:50:12] tgr: you could try params['_source']['outgoing_link'] but i don't know i've specifically tried that. It's mentioned in a few different SO posts [22:50:29] from a painless script_score in a function_score query [22:51:39] i see mention that it existing in some version of elastic but was eventually removed, not sure which version though [22:52:06] thx, will test it [22:53:37] tgr: sadly probably won't work: After v6.4, thanks to an unintentional side effect of refactoring, accessing _source in the script query context is not possible. [22:54:31] but worth trying, script query and painless are slightly different [23:01:49] tgr: actually reasonable chance it will work, documented here under `The document _source`: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/modules-scripting-fields.html [23:03:33] I think I could not get that to work (not that that means much) [23:04:26] but will give it another go. [23:09:04] tgr: see mwmaint1002.eqiad.wmnet:~ebernhardson/test.q, invoked as curl https://search.svc.eqiad.wmnet:9443/testwiki_content/_search?pretty -H 'Content-Type: application/json' -d @test.q [23:09:35] tgr: specifically the bit inside the rescore, the rest is just to target the main page and make it easy to see the score matches the outgoing_link count [23:13:20] OK, elastic2045 reimaged and back up. See ya tomorrow! [23:13:29] inflatador: thanks! [23:15:36] hm, I guess I tried in expression, not painless [23:19:40] tbh i'm slightly surprised painless allows that, there reason for inventing the painless language was the groovy gave people too many footguns where they could do terrible things to performance. i suspect this is there for the update scripts and it ends up also available at query time [23:19:48] s/there reason/their reason/ [23:21:03] i suppose maybe it's also not as terrible with tiny 1k docs, but we have huge sources that repeat the page content in multiple forms