[13:14:06] \o [14:16:51] hmm, bulk daemon has fallen over :( should be fun :) [14:34:16] .o/ [14:36:06] oh yeah, sorry I missed that one. Although to be fair, I'm not sure I'd be too much help ;( [15:00:09] curiously, only codfw even though both DC's update the same files. suggests an opensearch-ish thing [15:06:27] and the silly answer is...it's trying to talk to elastic2* instances :S i wonder where it's getting those names [15:13:35] inflatador: codfw cross-cluster hosts in cluster state are wrong. omega and psi have old and new names, chi only has old names [15:13:59] i think it's working around the unreachable hostnames in omega/psi, but it's not able to find seeds for chi [15:18:37] ebernhardson ACK, I'll start a patch on that [15:23:13] * cormacparle waves [15:23:16] me again [15:23:43] I don't suppose it's possible to order a template search resultset by transclusion count? I suspect it's not, but thought I'd ask anyway [15:25:06] cormacparle: hmm, no i don't think we have info about transclusion counts anywhere [15:25:14] 👍 [15:26:28] if injected them into the elasticsearch docs for the templates, would we be able to order by the count then? [15:28:55] cormacparle: to use for sorting it would have to be a dedicated numeric field. We could probably add one, but it would need an appropriate indexing pipeline(doesn't fit in weighted tags) [15:29:14] cool, thanks Erik [15:30:52] I guess you guys would have to provide that, so there'd be a dependency on your team? [15:32:04] yea probably. I'm not quite sure how it would be calculated either, maybe from parsoid html dumps that have html tags that say where something was transcluded from [15:32:29] i don't think it's the kinda thing mediawiki has in the db that we can just query out [16:00:08] headed to early lunch, but I created T393100 to fix the cross-cluster stuff [16:00:08] T393100: Fix cross-cluster seed settings in CODFW - https://phabricator.wikimedia.org/T393100 [16:19:54] huh, turns out java regex considers [^] to be a syntax error, as an unclosed character class [16:20:02] but i'm certain other engines treat that as the literal ^ [16:20:55] or i'm seeing references that in javascript it is the equivalent of ., since it's the negated empty character class. I guess we ignore that one :P [16:22:04] > i don't think it's the kinda thing mediawiki has in the db that we can just query out [16:22:35] isn't that what `templatelinks` is? a transclusion count? or have I been miscontruing that table all this time? [16:23:21] or rather - not a transclusion count, but something we could calculate one from [16:23:31] cormacparle: hmm, maybe. I'm not super familiar with templatelinks [16:24:02] cormacparle: if the value to sort on can be queried from the db it's much easier, we have a place in php to add it and it should only be a few lines of code [16:24:29] (it still takes 8-12 weeks to populate the index though) [16:24:45] or maybe 16..i should really be better at remembering what the cycle time we set on that was [16:24:50] :D [16:35:35] `SELECT count(*) FROM templatelinks INNER JOIN linktarget ON tl_target_id = lt_id WHERE lt_title = '' AND lt_namespace =0` should do it I think [16:36:42] <ebernhardson> looks to be reasonably indexed as well, should be plausible [16:37:01] <ebernhardson> (dba's have complained before about expensive queries from cirrus indexing) [16:59:46] <inflatador> ebernhardson were you looking at a chi master when you saw the elastic* in the chi seeds? Reason I ask is that I'm looking at at psi master and I'm seeing the chi/omega seeds with old and new hosts, but the self (psi) seeds are elastic* only [17:03:25] <inflatador> ah nm, I see what you're saying. It's the search.remote data as opposed to the cluster.remote data [17:03:31] <ebernhardson> inflatador: yea i was looking at chi [17:04:16] <inflatador> search.remote is still pointing to elastic*, cluster.remote what I was looking at above [17:04:35] <ebernhardson> essentially the bulk daemon uses the search cluster we provide it as a bootstrap host, it collects all the hosts from cluster starte in `persistent.cluster.remote` as the full set of clusters [17:04:41] <ebernhardson> s/starte/state/ [17:05:06] <ebernhardson> it expects to find all three clusters there [17:06:20] <inflatador> ACK, so `persistent.cluster.remote` is indeed a problem. Do we use `search.remote` for anything? [17:06:28] <ebernhardson> inflatador: no, thats elastic 5.x [17:06:45] <ebernhardson> or maybe it was 6.x, but basically it's deprecated and replaced with cluster.remote [17:07:15] <ebernhardson> i have some vague memory that we had dificulty removing it from the cluster state because the cluster doesn't recognize it as a valid thing we can change [17:08:18] <inflatador> yeah, that sounds vaguely familiar to me too [17:08:57] <inflatador> I think this is what needs a tweak https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/opensearch/manifests/cross_cluster_settings.pp#24 [17:10:59] <inflatador> and https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/icinga/manifests/monitor/elasticsearch/cirrus_settings_check.pp for the check [17:29:26] <inflatador> my patch is failing https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/icinga/spec/defines/cirrus_settings_check_spec.rb , not sure why yet [17:30:06] <ebernhardson> looking [17:31:21] <ebernhardson> inflatador: looks like the change is expected, you would only need to add the new lines around line 51 [17:36:33] <inflatador> Thanks for the advice. Still not 100% sure I set it correctly, but here goes [17:39:55] <inflatador> OK, it's happy now. Feel free to give it a review if you have time [17:40:17] <ebernhardson> this wireless phone charger is so weird...it only works if the usb-c cable is plugged in one way, if i turn it over (which should be irrelevant), it constantly reboots instead of charging the phone [17:40:26] <ebernhardson> sure, checking [17:46:14] <inflatador> I had a wireless charger on my dearly missed Palm Pre ;) . Haven't tried 'em since then [17:47:05] <ebernhardson> on my last phone the usb port eventually stopped working, and i could only charge it wirelessly. I guess since then i try and wirelessly charge to avoid wearing out the phones USB port [19:10:51] <ebernhardson> hmm...cirrus-highlighter has the ability to highlight with the java regex instead of the lucene regex...almonst wonder if that would be simpler than trying to integrate the transformations into the highlighter [19:11:21] <ebernhardson> the annoyance is that it's a bit fiddly since we add reserved characters, highlight them, and then need to exclude them from the highlighted output [19:41:20] <inflatador> Just merged the cross-cluster patch and manually ran the script, looks like it's working now [19:46:32] <ebernhardson> hmm, might have to wait for it to update cluster settings i guess? still see the elastic hostnames in codfw cluster state [19:46:53] <inflatador> oh yeah, I just ran against psi [19:52:30] <inflatador> OK, I ran against chi and omega...LMK if it looks OK [20:02:37] <ebernhardson> yea looks reasonable, and the daemon looks to have stopped restarting [20:04:12] <inflatador> cool. I resolved the ticket, feel free to reopen if anything related crops up [20:20:02] <ryankemper> inflatador: quick post-wdqs-internal-removal alias patch for ya: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140535 [20:24:14] <inflatador> ^^ +2'd/merged [20:37:58] <inflatador> low priority patch up for review as time permits: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140537 [20:39:17] <Trey314159> ebernhardson: just for funsies, I checked all the tokenizers, and \u0000 and \u0001 do get tokenized by the smart_cn tokenizer (from the command line). Trying to search for them on zhwiki, it looks like some layer between the browser and the analyzer—browser, MediaWiki, Cirrus, etc.—isn't letting the characters through, though.. each eventually gets replaced by � in the query, even if I encode it in the URL. So it *probably* [20:39:17] <Trey314159> doesn't exist in the index, but it might not be impossible. Hmmmmm. [20:43:11] <ebernhardson> Trey314159: hmm, thanks for checking! Although i'm using \uE000 and \uE001, in the utf8 reserved spae [20:43:13] <ebernhardson> space [20:43:29] <Trey314159> Ahh... [20:44:13] <ebernhardson> i imagine smart_cn will treat it similarly though [20:45:18] <ebernhardson> or i guess it's called the "Private Use Area of the Basic Multilingual Plane" [20:47:21] <Trey314159> Yeah.. Looks like 6 articles on zhwiki have \uE000 ad three have \uE001. They do already screw with the highlighting. There are results on zh wikibooks, wikisource, and wiktionary, too. Fun times! [20:51:24] <ebernhardson> i should stop being surprised by us having things in content that shouldn't exist [20:51:39] <ebernhardson> how do you accidentally use reserved utf8 :P [20:56:35] <Trey314159> The private use area isn't really reserved, is it? I thought you could use it for whatever you want without conflicting with anything "official". I know people have used it for scripts/fonts for their conlangs. At least one of the pages with those characters is about the relevant unicode block. [20:57:01] <Trey314159> But, yeah, I'm never surprised anymore that something exists in a wiki. [21:10:03] <ebernhardson> well, it's reserved in the sense that unicode will never use that space. But yes other people can use it for anything [21:13:37] <Trey314159> It's not searchable on enwiki, but both enwiki and zhwiki use E000-E07F for Tolkein's Tengwar! (Looks like they have a custom font on that page. Nerds gonna nerd.) [21:13:41] <Trey314159> And just to make sure things are complicated, there's an unofficial "standard" for con-script assignements in the private use area, so at least those users won't step on each other. [21:16:31] <ebernhardson> hmm i guess i can use a more random-ish value instead of the first two [21:19:52] <Trey314159> those might actually be random enough. I doubt they occur in titles very often [21:26:02] <pfischer> Hi, I would like to release our team report on behalf of Guillaume tomorrow and would appreciated your input. ebernhardson: would you want to add sth. to https://etherpad.wikimedia.org/p/search-standup or shall I summarise based on your tickets (which is fine, too)? [21:28:07] <ebernhardson> pfischer: i suppose you can summarize, basically i've been working on the additional regex syntax for most of the week [21:28:47] <ebernhardson> currently trying to understand how the highlighter works...it's a bit hard to follow...but i'll get there :) [21:32:20] <ebernhardson> i suppose also released the update to mjolnir, but that's all tech debt [21:38:07] <ebernhardson> apparently U+FDD0 to U+FDEF are "noncharacters", there are 66 total. Not sure if that would be better [21:38:29] <ebernhardson> https://www.unicode.org/faq/private_use.html#nonchar1 [21:50:07] <pfischer> ebernhardson: thanks, is the highlighter you are looking into the OpenSearch one or our custom ES one? And this is about getting the highlights for reg exps with ^ and $ rendered as expected?