[08:17:13] o/ [08:20:02] o/ [10:04:39] lunch [13:19:03] o/ [13:49:17] looking at this qorum issue... looking at host threads I see that the master that's trying to get elected is busy serializing the cluster state in mem (PublicationTransportHandler.serializeFullClusterState) [13:49:52] checked two other master eligible and they seem busy applying regexed doing DiscoveryNodeFilters [13:51:51] interesting [13:52:02] routing allocation exclusions as much more nodes in eqiad than codfw [13:52:27] it's empty in eqiad [13:52:37] err in codfw [13:52:41] in eqiad we have [13:52:43] elastic1054-production-search-eqiad,elastic1054-production-search-omega-eqiad,elastic1054-production-search-psi-eqiad,elastic1055-production-search-eqiad,elastic1055-production-search-omega-eqiad,elastic1055-production-search-psi-eqiad,elastic1056-production-search-eqiad,elastic1056-production-search-omega-eqiad,elastic1056-production-search-psi-eqiad,elastic1058-production-search-eqiad,e [13:52:45] lastic1058-production-search-omega-eqiad,elastic1058-production-search-psi-eqiad,elastic1059-production-search-eqiad,elastic1059-production-search-omega-eqiad,elastic1059-production-search-psi-eqiad,elastic1063-production-search-eqiad,elastic1063-production-search-omega-eqiad,elastic1063-production-search-psi-eqiad,elastic1067-production-search-eqiad,elastic1067-production-search-omega-eq [13:52:47] iad,elastic1067-production-search-psi-eqiad,cirrussearch1060-production-search-eqiad,cirrussearch1060-production-search-omega-eqiad,cirrussearch1060-production-search-psi-eqiad,cirrussearch1061-production-search-eqiad,cirrussearch1061-production-search-omega-eqiad,cirrussearch1061-production-search-psi-eqiad,cirrussearch1062-production-search-eqiad,cirrussearch1062-production-search-omega [13:52:49] -eqiad,cirrussearch1062-production-search-psi-eqiad,cirrussearch1064-production-search-eqiad,cirrussearch1064-production-search-omega-eqiad,cirrussearch1064-production-search-psi-eqiad,cirrussearch1065-production-search-eqiad,cirrussearch1065-production-search-omega-eqiad,cirrussearch1065-production-search-psi-eqiad,cirrussearch1066-production-search-eqiad,cirrussearch1066-production-sear [13:52:51] ch-omega-eqiad,cirrussearch1066-production-search-psi-eqiad, [13:52:53] oops [13:53:26] I can clear that out, that's a holdover from the migrations [13:53:59] in mtg now but can devote more time within the next hr or so [13:54:44] \o [13:54:51] o/ [14:03:00] pfischer: FYI, we decided to cover Ch 2 of the stats book next week [14:13:14] meh: Vim: Caught deadly signal SEGV on cirrussearch1094 [14:13:35] works now, weird... [14:26:49] that is weird, SEGV should be rare [14:27:35] yes... [14:27:49] happened only once tho [14:28:01] calling view on a gz file [14:31:47] hmm, hope it's not some hardware weirdness [14:32:16] yes... could also be a "rare" bug in vim [14:34:09] OK, out of meeting. Just removed all the banned hosts from cluster config [14:34:44] thx, I'd be a bit concerned if we don't see the issue again tbh [14:35:52] yeah, I doubt it's the cause but I def need to get better about config drift [14:38:18] it's constructing one lucene automata per banned node, nothing cached, should be pretty fast but if you muliply that to number of indice*number of nodes*number of election attempts perhaps this adds up [14:39:15] could be, we can try a roll restart later today and see if it makes a difference [14:39:30] or just restart a master, I guess that's enough [14:40:17] sure, yes one master and two other non master elligible nodes should correspond to what we're doing in the cookbook [14:41:24] ebernhardson: FYI, the page that triggered the Sudachi tokenizer bug has been deleted by the admins on that wiki. Apparently the user was also mocking someone. Anyway, it's no longer there as a live test case. [14:54:03] my guess from translating the page was they were mocking the entire japanese culture/history [14:54:20] the name basically referred to the japanese empire, and the page was a bunch of laughing emoji [14:57:00] i also updated the docs on Help:CirrusSearch re:regex, hopefully it's sufficient [15:00:35] thanks! [15:04:13] separately, i wonder if we missed an opportunity to directly support \n, \t, etc. in regex as well [15:04:46] partially because i noticed Help:Cirrussearch says to use [^ -􏿿] if you need \n [15:05:35] really cirrus could do that as well i guess, but it helps to have a little machinery to know where you are in the regex. I'm pretty sure the search engine supports them, it's just that we have to send a real \n, not the escape sequence [15:11:36] was not aware of \n workaround [15:11:40] *this [15:13:11] they had a \s workaround as well, [^!-􏿿] [15:13:30] aptures newline, space, tab, maybe a couple other control chars [15:18:09] what's the char you're pasting in that regex? I get tofu when I try to copy/paste it [15:18:52] i dunno, i just copy /paste :P It's documented as U+10FFFF [15:20:43] interesting. U+10FFFF is the last code point (according to https://unicodebook.readthedocs.io/unicode.html , anyway) [15:29:04] ryankemper finishing off the wdqs data xfers, thanks for getting those started [15:29:54] inflatador_: ty! [15:35:16] ryankemper small CR for bring a CODFW host back to production you have time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180888 [15:37:54] +1 [15:54:24] ryankemper ah, I was gonna self merge just so we didn't forget. Not sure if your +1 made it [16:12:16] lunch, back in ~1hr [16:35:30] * ebernhardson is tempted to introduce black-fix and isort-fix to discolytics. I dont feel like manually sorting the import lines :P [16:35:46] it's kinda nice to delegate exact formatting decisions and stop thinking about them [16:40:18] +1 [17:06:31] dinner [17:27:09] back [17:39:14] hmm, so i have appropriately formatted dumps in hdfs now. Files are a bit awkard though, for example we have index_name=enwikitionary_content/part-15211-94efe55a-3241-42ea-957a-72b704f0abe7.c000.txt.gz [17:39:22] (and 174 more files, just for enwiktionary_content) [17:40:47] actually it has index_name=enwiktionary_content_1755154971, but will fix that and drop the numbered suffix [17:42:00] but the final result is 33,891 files. Feels excessive :P [18:16:59] * ebernhardson tries again with a `.coalesce(1000)`...wonder how it will go [18:19:53] lots of out-of-memory errors :P [18:21:40] * ebernhardson had kinda hoped it would be equivalent to sequentially accessing the files, just from the same task instead of a new task...but i guess not [19:30:43] and it turns out... .coalesce(4000) performs the work with fewer tasks, but only reduces the final output count to 28,730. Couldn't convince it to do 1000 [19:32:05] i suppose that makes sense, it's coalescing partitions but when choosing partitions randomly most partitions will be for different wikis (each partition is sourced from a single index) [20:30:49] ryankemper had a question about the use of `blazegraph_instance wikidata_main` on the cookbook args for https://etherpad.wikimedia.org/p/wdqs-reload-T386098 , is that to get around the categories bug we were talking about? [20:30:50] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [20:39:16] also, I guess our example wdqs queries don't return results on scholarly, I wonder what a good example query would be for scholarly? Digging around scholia but I haven't found one that returns results yet [20:43:43] ah, https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_graph_split has a pretty good explanation [20:47:07] basically the cats query, but with one of the q-items considered scholarly [20:51:38] inflatador_: oops yeah that should definitely say scholarly for those last hosts [20:52:59] ryankemper ACK, I hacked around it by manually creating data_loaded files with `scholarly_articles` but we should probably audit/fix [20:53:15] some of them probably have `wikidata_main` when they shouldn't [20:54:21] gotcha yeah we can fix it with cumin once the xfers are all done [21:27:12] OK, we are gonna try and restart eqiad master again to see if we get the same error [21:51:38] So we restarted the active master 3 times, no problem at all. Looks like d-causse' theory about the cluster state being too big was correct