[07:43:00] not sure to understand what's happening to sup-consumer@cloudelastic either [07:43:25] it's pushing data to elastic but indeed nothing comes from the sanitizer [07:44:17] seeing some "index_not_found_exception for index otrs_wikiwiki_general (private) for which I wonder where they're coming from if not from the sanitizer [07:52:40] perhaps not [07:55:57] eqiad.cirrussearch.update_pipeline.update.rc0 does have private wiki events [07:57:06] they should perhaps go to a private stream of be properly flagged [07:57:30] but that does not explain why the sanitizer is broken on cloudelastic... [08:22:48] ah could be because of the new wikiid filter that allows prefixing a "-" to exclude some? and somehow we never shipped the consumer to producer-(eqiad|codfw) [08:25:55] yes might be it, consumer-search shows a diff on the wikiid filter, consumer-cloudelastic only for the restartOne (manual restart) [10:07:18] lunch [13:12:59] o/ [13:23:51] ^^ should we alert on ES sink slowdowns? [14:00:33] \o [14:01:54] inflatador: i'm not sure, it had already alerted on the update rate being too low when that was happening [14:04:49] o/ [14:04:58] dcausse: thanks! for some reason i thought the consumer was already using WikiFilter without having looked... [14:05:10] np! [14:11:49] * ebernhardson realizes it was a bit silly to return a tuple of include/exclude, just to immediately unwrap it in another .map() call...but whatever [14:12:16] it seemed conceptually separate pieces :P [14:16:00] :) [14:17:45] was a bit surprised as well but could not find solid arguments against the approach nor think too much a possible alternative [14:18:55] would just make it all one big map call, instead of .map() followed by .map() i guess. I suppose i was thinking they served two different purposes (parse / apply) [14:19:19] but it's not like they are generic functions that can be separately invoked [14:21:09] ah true, that'd make the lambda a bit bigger [14:21:26] but yeah no real need to use the tuple anywhere else... [14:22:39] dcausse: separately, would there be value in adding some sort of node visitor to limit the number of deepcat's in a single query? Realized you can do `deepcat:a deepcat:b deepcat:c deepcat:d deepcat:e ...` [14:22:41] when chaining lambda with streams it's possible that you refrain yourself from writing big blocks [14:23:02] maybe post-expansion and limiting the total category count (not sure how viable) [14:23:52] \o [14:24:01] maybe it's unlikely enough that a user would actually do that, but i could imagine them mixing include and exclude deepcat's [14:24:10] ebernhardson: haven't thought about that but yeah... might be easy to create a crazy query... [14:25:20] I wonder if it's worse to chain plenty of insource:// or plenty of deepcat [14:25:41] in my testing they worked ok, could actually issue 100k categories in a single query and get a result in ~20s. It actually only really showed up in metrics when repeated over and over [14:26:43] which also seemed odd, and i can't explain. I wouldn't expect a single query (fanned out to 32 shards) to be able to push overall metrics around much. Should just be 32 threads + 1 coordinator thread [14:27:13] the way we assess the cost is quite simple today... inspect the query with some heuristic and assign a pool counter [14:27:57] ebernhardson: yes... I don't really know why [14:28:23] unsure if it's the boolean logic that gets too crazy when pushed too much [14:28:50] even that though, conceptually i would have expected it to simply spin the coordinator thread more shuffling data around. [14:28:57] was willing to test the terms query at some point to see if there's big diff but haven't had time yet [14:29:37] maybe some oddity in elastic not expecting many kb's of serialized query [14:30:56] I was thinking that the query nodes could do some crazy work trying to optimize the chain of nested booleans [14:31:37] hmm, i suppose i was thinking that kind of thing would be limited to a single thread, where serialization might go through some inter-node shared work thread [14:32:43] i mean the boolean optimization would be limited to the coordinator thread, or maybe the per-shard threads, but shouldn't be able to effect the unrelated queroes [14:32:51] indeed, could also be the coordinator rewriting the es query into a lucene query (calling the field analyzers 10k times) [14:33:14] ahh, yea perhaps [14:34:02] would have to check but it's possible that new instances have to be created per analyzer calls [14:35:00] maybe if i have time will adjust my script to pre-analyze and see if it has the same behaviour with terms queries. Not super important, but an interesting curiosity [14:41:22] working on T359423 ... as far as external services, staging rdf-streaming-updater just needs kafka-main-eqiad and flink-zk-eqiad, right? [14:41:23] T359423: Migrate charts to Calico Network Policies - https://phabricator.wikimedia.org/T359423 [14:43:00] ebernhardson: sure, also randomly interesting: some comments I saw in the lucene code to discourage users to "workaround" some limitations: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java#L330 [14:44:48] interesting. And that limit is 16 [14:45:06] (which we might accidentally pass in other places?) [14:45:57] oh wait, i'm looking at a different constant with same name [14:48:24] and apparently i haven't looked at lucene code locally in awhile, i still have lucene 7 checked out, we are on 8.7.0 [14:52:19] ahh it is the same, in 8.7.0 that references TermInSetQuery.BOOLEAN_REWRITE_TERM_COUNT_THRESHOLD, which is also 16 [14:56:30] yes not sure that this particular comment is affecting us in any ways but I'm thinking that lucene devs have some expectations how users build their queries, and it's probable that the way we use the bool query is hitting some bottlenecks [15:03:51] ebernhardson: triage https://meet.google.com/eki-rafx-cxi?authuser=1 [15:04:23] pmw [15:04:29] pfischer: ^ [15:04:53] pfischer: disregard [16:12:51] ebernhardson: I see you launched a release for the sup, will you deploy or should I? [16:13:25] something I noticed is that we seem to disable the sanitizer everywhere for private wikis, should we only disable it for cloudelastic? [16:15:23] dcausse: i can deploy it [16:15:29] thx! [16:15:38] dcausse: oh, yea i suppose we can turn saneitizer back on for private wikis too, will do. [16:29:57] FYI, I'm working on migrating the flink-app network policies to the newer calico standard. No action at the moment, just wanted to give a heads-up https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1071648 [16:53:58] dinner [17:07:00] fix looks to have worked for cloudelastic saneitizer, shipping to everywhere [17:23:07] i wonder if SanitySource should avoid clearing unnecessary wiki state for some predefined time (2 weeks?) and flag them as disabled. Use case would be that cloudelastic saneitizer started over, which means the 16 week cycle restarts today and the later pages will be delayed from their usual cycle [17:56:10] lunch, back in ~1h [18:19:00] dcausse: realized the plausible cause of private rerenders in the public stream is my patch to cirrus-rerender, i ran a set of rerenders sourced from the dumps and it always produces to the regular rerender stream. [18:22:25] ebernhardson: how do linkrecommendations make it into elasticsearch these days? [18:22:52] is it SUP? or in airflow somehow? [18:23:21] ottomata: streaming updater, i would have to check which topic. We are trying to migrate them all to a unified event stream for all of our external-tags [18:23:29] right [18:23:49] probably mediawiki.revision.recommendation_create [18:24:14] so, mwmaint cron job asks elastic for articles without recomendations, then gets recommendation from linkrecommendation service, then produces mediawiki.revision.recommendation_create ? [18:24:35] (reading https://phabricator.wikimedia.org/T268803 ) [18:24:52] ottomata: tbh i'm not sure how the first half works before it gets to us. That seems plausible [18:25:07] oh, but all elastic has is if a page has link recs? right? [18:25:29] ottomata: right, we just have a tag on the pages that says if they should be returned in a links recommendation query [18:26:23] can generically find them with https://en.wikipedia.org/w/index.php?search=hasrecommendation%3Alink&title=Special:Search or can add additional search things to get a narrower list [18:27:30] okay i think I see. so I think: refreshLinkRecommendations.php handles both updating ES with flag, and also writing to MariaDB table results from linkrecommendation service [18:27:35] i thikn... [18:28:20] hmm no. hmm? somehtign [18:28:22] anyway thank youj... [18:28:23] i'll figure it out [18:28:30] that seems about right :) sorry can't help more [18:40:13] hmm, UIDGenerator (which we use in logging) was removed a couple days ago, deprecated in 1.35 apparently [18:40:29] looking for what is supposed to replace it ... [18:46:09] oh, it was fixed. Just not in the patch i was looking at [18:47:13] sorry, been back [19:54:47] security update on my laptop...rebooting [19:57:19] ...or not. I said it was OK to reboot, then nothing. Well, if I disappear you'll know why! [20:03:00] * ebernhardson has mixed feelings on "cute" error messages like this..but it's just developer tooling it doesn't really matter: Oh no! 💥 💔 💥 [20:04:07] dr0ptp4kt: ping for meeting [20:04:34] ebernhardson: coming! never start writing in a Doc after getting a gcal alert [20:07:06] the emoji stuff can get annoying when you're dealing with bugs [20:23:58] ebernhardson dr0ptp4kt any objections to me deploying staging-commons rdf-streaming-updater today? I was going to test what happens when we remove the legacy network policies in favor of calico policies, ref T373195 [20:23:58] T373195: Migrate Search Platform-owned helm charts to Calico Network Policies - https://phabricator.wikimedia.org/T373195 [20:24:25] inflatador: should be fine, monitor dashboards, roll back if necessar [20:26:26] inflatador: would you please backup .jnl from each of full, main, and scholarly? and plop them into hdfs beforehand? not too worried, but just in case. do you have a means to validate this in a lower environment? [20:31:46] dr0ptp4kt we'll only be touching staging commons, so the WDQS environment won't be touched. Worst-case scenario is that the updater loses connection to kafka and/or zookeeper...which I don't think is a problem? ebernhardson does the staging updater actually write to anywhere? [20:33:09] inflatador: i don't actually know where it writes :* [20:33:10] inflatador: ok, no worries then. ebernhardson and i are on a meet right now so we just read your irc and reply [20:33:11] :( [20:33:22] i was thinking maybe wdqs1009? but could be old memories [20:34:42] i think ryankemper may know the topology for staging (just off the top of my head) from the recent migrations - i see y'all have a meeting to cover that, in case you want to see that any sinks are healthy after application to staging [20:35:06] yeah, we can review that. It's a good idea to know regardless ;) [20:35:29] heh, yes. i reckon it's in puppet or somet other repo, just trying to work on a few things [20:36:08] yeah, my guess is there are no streaming updater processes on any w[cd]qs hosts that actually subscribe to the staging updates, but never hurts to check [20:41:31] The only true test host atm is wdqs2025 [20:44:55] d-causse made a good point on the last CR just now...I'm going to remove the changes I made to helmfile.d in my last patches. Just to keep the rdf-specific stuff away from the generic flink-app chart changes [21:44:14] network policy test was a success! The first time we did it, we had to delete the new pod to get the policies to apply, but I've seen that happen before