[08:39:40] dcausse: While working on `CirrusSearch` so it can forward weighted tags via EventBus instead of Elastica, I noticed that `CirrusSearch` uses manual dependency injection via `MediaWikiServices::getInstance()`). According to `MediaWikiServices::getInstance()`’s doc block, it should not be used explicitly in constructors. Instead, constructors should get passed the parameters they need. So what’s the preferred way for [08:39:40] 3rd parties to obtain an instance of `CirrusSearch`? Should they use use `new` or should they obtain it via `MediaWikiServices` [08:53:17] pfischer: services should generally be injected in the constructor, seeing MediaWikiServices::getInstance() directly in the codebase is generally legacy code where injection was not properly refactored [08:55:21] CirrusSearch is a SearchEngine so it's obtained via MediaWikiServices::getInstance()->getSearchEngineFactory()->getCirrusSearch() by third parties [08:55:44] err: MediaWikiServices::getInstance()->getSearchEngineFactory()->create() [08:56:02] dcausse: Thanks for clarifying. GrowthExperiments still instantiated CirrusSearch directly. [08:56:31] then the client generally test with "instance of CirrusSearch" to call cirrus specific methods [08:56:33] ah [08:56:49] How about the CirrusSearch code base? Should Hooks.php call the constructor or use `MediaWikiServices`? [08:57:19] pfischer: you mean for creating a CirrusSearch instance? [08:57:27] dcausse: yes [08:57:56] I think it depends... [08:59:38] pfischer: if it's new code we should not add new stuff to Hooks.php [09:01:24] No, it’s not new code, but if I refactor CirrusSearch and make SearchConfig a mandatory parameter, it has to be passed in all places where the constructor is called. If we hide that in ServiceWiring.php instances could be obtained via `MediaWikiServices::getInstance()->getService(‘CirrusSearch’)` [09:03:07] pfischer: I think it's too dangerous sadly, the CirrusSearch constructor should be considered as a stable interface [09:03:44] but I agree with you, it should have been considered "internal" in the first place [09:04:29] changes you make to the CirrusSearch constructor should be forward compatible [09:04:47] because it's used by other extensions [09:05:29] I checked with codesearch and according to that, it’s just GrothExperiments [09:06:28] Oh, and Wikibase((Lexeme)?CirrusSearch) [09:06:57] https://codesearch.wmcloud.org/search/?q=new+CirrusSearch%5C%28&files=&excludeFiles=&repos= [09:07:09] introducing breaking changes is a bit painful sadly [09:07:47] Sure, alright. I’ll roll that back and try to keep my changes minimal invasive. [09:13:59] yes there's clearly a design issue here, CirrusSearch is a kind of god object here, the "new CirrusSearch()" in fixLinkRecommendationData is only used to call resetWeightedTags [09:14:54] Cirrus could perhaps expose a kind of CirrusSearchWeightedTagService [12:51:42] \o [12:54:22] o/ [12:58:40] o/ [13:56:20] * ebernhardson tries to deciph how AbuseFilter works... [13:56:23] decipher [15:22:28] the sup producer is having issues [15:23:40] hmm, looking [15:27:01] no change in incoming rerender or page change event rates. Nothing stands out in the taskmanager logs [15:27:05] Failed to invoke mw api at /w/api.php?action=cirrus-config-dump&prop=replicagroup%7Cnamespacemap&format=json&formatversion=2 [15:27:14] in CirrusNamespaceIndexMap [15:27:19] InvalidMWApiResponseException: Not Found [15:27:45] hmm [15:27:57] ofcourse it does not tell the wiki... [15:31:49] where do you see that? Not seeing it in logstash of kubectl logs [15:32:02] I have seen that before, but was intermittent and seemed to solve itself [15:32:16] ebernhardson: https://logstash.wikimedia.org/goto/174492371d4993a1ac6ebe6cf2227b76 [15:32:43] oh, i have to select flink ecs.... [15:33:19] started at 15:00... [15:33:27] unsure if related to a deploy [15:38:01] a 200 not found probably means it didn't even find /w/api.php? [15:38:44] a broken domain name? [15:39:28] maybe..poking around [15:39:40] err, a non-200, since /w/api.php basically always returns 200 [15:41:32] seems we're not very flexible here and do not allow failing [15:41:50] in testing, i can get a 404 if no host is provided, same if a random host is provided [15:41:56] host header [15:42:55] the code has if (settings == null) { but not sure we ever enter this [15:46:55] we should catch IllegalStateException I think? [15:47:26] and let the event be ignored later with proper logging [15:48:01] hmm, but what can it do without the namespace map? [15:49:16] the event should be thrown away [15:51:21] / With an incomplete target, this event will end up being dropped at the output stage. [15:51:56] i suppose plausibly. I worry that it will do that for some instance simply giving a bad response, but perhaps it's better than failing everything [15:52:29] one silly option to see whats happening, i have a hacked up vendor that has envoy write request/response to /tmp inside the envoy container [15:52:38] actually it will fail :) [15:52:39] put it together to figure out the NetworkSession bits [15:52:47] throw new IllegalArgumentException("TargetDocument incomplete"); [15:52:51] heh [15:53:27] yes I think we need to know what's happening [15:53:42] if it's a corrupted event we need to find a way to ignore those [15:53:46] ok, will deploy to eqiad producer. sec [15:53:56] (it's from a deployment-charts directory in my home dir) [15:57:28] meh...i forgot response body is base64 encoded so tedious to check :P [15:59:05] :/ [16:00:10] and there is no jq inside here :P will just page over these by hand.. [16:00:43] 404 is against wikitech.wikimedia.org [16:01:23] yea, we are supposed to be not getting events from wikitech i thought? [16:02:11] but thats the one, we have multiple 404's (probably retries), all against wikitech [16:02:55] annoyingly, we have an include wikis filter, but not an exclude [16:04:04] maybe if we added an http route for wikitech? [16:06:50] checking the input topics w/ kafkacat + jq to verify these haven't come in before [16:15:49] so far not even turning up the event that trigered the current problem... [16:23:10] i suppose, adjust WikiFilter::createFilter to recognize wikis prefixed with - and use those as an inverse filter? [16:25:29] ahh, found some. we started seeing events from wikitech at 2024-09-05T14:58:14Z in the page_change stream [16:36:25] oh [16:37:07] I saw some movement on this side with T371359 but was not aware that they started to ship events [16:37:07] T371359: Migrate Wikitech's Jobqueue - https://phabricator.wikimedia.org/T371359 [16:37:26] i'm working up a patch to support exclusions in the wiki filter, just writing tests [16:37:34] makes sense, thanks! [16:38:13] need to prepare dinner, back in ~20mins [16:45:59] https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/160 [16:50:49] i suppose the event is already in flink buffers, probably have to destroy and start back up from a few minutes before the first wikitech event? [16:52:08] just saw an alert for Saneitizer, is this expected? [16:56:02] hmm, no the saneitizer should be on the non-broken side :P looking [16:56:59] inflatador: oh the fix rate too high, that alert is from prometheus? I forgot about that, we will have to fix it. Basically i changed the way data is collected there [16:58:07] ebernhardson np, will silence for now. If I can help with a patch LMK [16:58:36] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?viewPanel=35&orgId=1&from=now-6M&to=now here's the dashboard link from the alert [17:03:42] inflatador: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1071004 [17:08:51] ebernhardson: thanks, merged [17:30:16] dcausse are your puppet patches ready for review? Happy to look them over if so. Re https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070955 [17:34:00] yup, couldn't simply update the container. flink already had the event in the checkpoints. destroyed the eqiad producer and re-deploying with kafka-source-start-time ~15 min before the first wikitech event [17:35:28] also looks like the saneitizer dashboard panel is still using graphite, I'll see if I can fix that with the prom metrics from the alert [17:35:54] the saneitizer dashboard from prom will ideally want to split things out by the problem field [17:35:58] (thats why i added it) [17:36:32] ACK, will send you a link to the panel once I'm further along [17:37:55] looks like the sup producer is flowing events again, at least the update_pipeline.update.rc0 topic has events. will see when it catches up [17:50:56] consumers are still a bit backlogged, but should catch up soon enough [18:23:54] Updated Saneitizer panel, let me know what you think: https://grafana.wikimedia.org/goto/zyOZoMeSR?orgId=1 [18:25:43] not sure how to show the total results of `sum by (problem) (increase(mediawiki_CirrusSearch_sanitization_total{action="fixed"}[1w]))` as "all" in the graph...checking [18:26:28] inflatador: mostly good, in terms of showing "all" i think better might be to stack the lines? [18:27:03] ACK, can do [18:30:15] i also wonder if we can make it respect the `exported_cluster` config var we use in the dashboard. hmm [18:31:04] i guess it's not exported cluster, what we care about is eqiad/codfw/cloudelastic [18:32:23] * inflatador checks if we have a var for that already [18:32:37] in the prom metrics thats recorded as search_cluster, but we don't have a direct mapping from the dashboard variables are [18:42:53] https://grafana.wikimedia.org/goto/TiunAM6SR?orgId=1 OK, slight update to use the newer panel type, stacking, and search_cluster var [18:43:06] * inflatador crosses fingers that it works [18:44:32] getting nearly the same results for `oldDocument` regardless of `search_cluster` value, is that expected? [18:45:14] inflatador: yes, old document is not based on any state in elastic or sql. We simply flag 1 in n pages as old document each time we pass them, to ensure they have been re-rendered in the last n passes [18:45:50] basically old document isn't actually a problem, it's correct operation of the process, but that's just kinda how it fis in [18:46:22] makes sense [18:48:42] otherwise, seems reasonable [18:49:02] it might be nice if it could hae a default-off behaviour for the oldDocument line, did we figure out a way to do that/ [18:49:36] I think so? ;P Will look at the dashboards again [18:50:08] seems like "all" isn't useful either [18:51:02] the set of things that don't have a problem field is indeed not useful, but i'm not sure where that comes from. Suggests a data collection problem [18:51:13] when action="fixed" there should always be a problem set [18:51:39] might be a misunderstanding on my part with how the prom stat collection works in mediawiki, will review [18:53:11] Specifically I meant adding up "oldDoc" and all the other problems [18:54:48] that doesn't seem to be a useful thing to put on the graph [18:56:11] adding up all the problems other than oldDoc seems reasonable, at least thats what the alerts are based on. It also gives a top-level view of how much the updater is wrong [18:56:23] but simply having them stacked is probably close-enough to adding them up [18:57:15] looks like there is a way to hide oldDoc a la https://grafana.wikimedia.org/goto/2HxgJM6SR?orgId=1 [19:02:19] nice [19:02:31] `sum by (problem) (increase(mediawiki_CirrusSearch_sanitization_total{action="fixed", search_cluster="$search_cluster", problem!="oldDocument"}[1w]))` is giving me weird results [19:03:18] as in, the total query result is ~8000 but the `oldVersion` and `redirectInIndex` together only add up to ~160 [19:08:03] my assumption is that is for ones where problem="", which looks like a data collection issue [19:11:33] I'm not sure yet how though :( [19:12:23] there was one potential path, but david noticed and shipped a fix [19:14:04] maybe the fix appears in the newer metrics...will try dropping down to 1h instead of 1w maybe? [19:14:50] i've been looking at the 1h metrics in grafana explore, they are still coming in [19:15:19] ;( [19:17:25] i suppose missing from the graph is pageInWrongIndex, but it seems plausible it's missing because we haven't run into one yet [19:17:42] and the code there is the same as everything else, no reason it would be recorded as problem="" [19:21:36] maybe I just don't understand the expected output of `sum by (problem) (increase(mediawiki_CirrusSearch_sanitization_total{action="fixed", search_cluster="eqiad", problem!="oldDocument"}[1h]))`? Do the query results for specific problems seem reasonable? [19:25:19] inflatador: everything looks reasonable except the extra line that doesn't have a problem, but instead has the full query as the name [19:31:25] yeah, that's annoying. Working on hiding that [19:33:49] well, thats the bit that i think is a data collection issue :) [19:38:42] ohh...transform makes this easy. Amazing how much I forget each time I don't use grafana for a wk ;( https://grafana.wikimedia.org/goto/XWoO-MeIg?orgId=1 [19:39:32] inflatador: i would probably avoid hiding the ones with problem="" until we understand what they are. I can't yet rule out they are real problems [19:39:35] annd...just fixed the rate back to 1wk [19:40:12] OK, will see if I can relabel instead of hide [19:43:48] I called the ones with problem="" `uncategorized`...LMK if that works, happy to rename if not https://grafana.wikimedia.org/goto/4xnL-M6Sg?orgId=1 [19:44:03] yea that's ok until we understand what they are [19:55:09] hmm, now that i think about it ... i wonder if the lack of contextual per-language profiles is why when we tried re-running the wbsearchentities optimization the results came back as inconclusive [19:55:24] i guess i'll have to check how that was run...but seems plausible