[13:22:14] \o [13:26:43] o/ [13:30:48] ebernhardson: I checked the recommendations created since midnight August 19th and all page_ids that show up in the stream show up in cloudelastic enwiki_content with the expected weighted_tag set. However, since some of the _docs have timestamps prior to August 19th, they might have that tag already. [13:31:10] s/have/have had/ [13:31:22] pfischer: thats at least a little promising. [13:32:22] inflatador: meeting with Tajh running late [13:32:31] gehel ACK, np [13:34:57] inflatador: and I'm there! [13:39:14] pfischer: have you gotten ahold of the growth team yet about running their script? I suspect they wouldn't mind us running it, but maybe check with them first. It's `mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=dewiki --search-index --verbose` [13:43:22] * pfischer finally found the issue with the kafka record counting and feels stupid… [13:43:44] i do the same regularly, don't worry about it :) [13:44:06] hehe, thank you, that helps a bit [13:57:40] yet another dashboard question: do we still use https://grafana.wikimedia.org/goto/MM00wljSR?orgId=1 (cirrussearch dedupe)? If so, [13:59:08] inflatador: let me check [13:59:42] - guessing the metric we want is `mediawiki_CirrusSearch_result_file_duplicates_total` ? [14:02:22] another question: does this need to be its own dashboard, or could we maybe put this into "Elasticsearch Percentiles"? [14:04:01] the initial ticket was https://phabricator.wikimedia.org/T341227, it's not really a problem to respond to so probably doesn't need to be on the percentiles dashboard. As for if it's needed, not sure. Guessing peter has better ideas [14:05:16] inflatador: I’m not sure we still need this. What it tells us is that we get duplicate file results (and how many), see https://phabricator.wikimedia.org/T341227 [14:07:02] Since the observation period of one month is long over, and we didn’t get any complaints, we might get rid of that metric after all. I’ll create a ticket. [14:07:45] pfischer ACK, sounds good. Seems like we could resurrect with the prom metric in the future if necessary [14:08:00] sure [14:25:14] i emailed a link to the draft roadmap, reviews appreciated. [14:26:46] Trey314159: [14:26:56] gehel: [14:27:09] :) [14:27:12] fat fingers! [14:27:46] no problem [15:14:53] ebernhardson: I looked a the code of the GrowthExperiments maintenance script: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/GrowthExperiments/+/refs/heads/master/maintenance/fixLinkRecommendationData.php#158 - They call `cirrusSearch->resetWeightedTags` which in turn uses the job queue to call `DataSender#sendUpdateWeightedTags`. But I thought they already leverage the [15:14:53] `mediawiki.revision-recommendation-create` stream. After all, we do see events coming through it. Would we discard the job queue code in favour of publishing events via EventBus? [15:15:59] pfischer: hmm, i thought it was going via eventbus, but indeed that code looks to talk to elastic directly. looking [15:18:38] That script only resets weighted tags via CirrusSearch, I can’t see any call to `updateWeightedTags` [15:19:07] pfischer: oh, i remember. We had the reset's go directly to elasticsearch because the delay was too long for the use cases, but it used to be a several hour delay. Not sure if the new <10min would make it work better [15:19:20] i would ideally like to keep things all flowing through the same codepaths, instead of having different bits for different purposes [15:21:49] indeed i'm not seeing where they send the updates :S [15:25:21] pfischer: looks like i was perhaps mistaken on which scripts they use, fix looks to reconcile and clear out recommendations that are no longer needed [15:25:47] pfischer: there is also maintenance/refreshLinkRecommendations.php and maintenance/revalidateLinkRecommendations.php which use the LinkRecommendationUpdater service, which followed a few levels down uses EventGateSearchIndexUpdater [15:52:10] workout, back in ~40 [16:16:53] ebernhardson: Thank you. That looks more like it. Since `SearchIndexUpdater / EventGateSearchIndexUpdater` only supports adding weighted tags, we/they’d have to make that interface capable of clearing weighted tags too before a migration away from `CirrusSearch#resetWeightedTags` becomes possible. [16:19:25] ebernhardson: T366253 does not list ores/liftwing as source of weighted tags, but I assume they should flow through the new stream in the end, right? [16:19:26] T366253: Create a generic stream to populate CirrusSearch weighted_tags - https://phabricator.wikimedia.org/T366253 [16:20:12] pfischer: as a random other thought (perhaps a few days late now), it might be useful if the weighted tags stream could indicate if the updates are "revision related" or not, with the idea being that updates unrelated to revision updates are unlikely to be deduplicated and could skip that step [16:20:55] err, s/deduplicated/merged/ [16:21:32] ebernhardson: Sounds good, that could bring down the latency. Since the schema is under development, changes are less problematic, so I’am happy to add that flag. [16:21:33] mostly just thinking about if the best-effort could be much less than the 10 minute slo [16:41:49] * ebernhardson kind of wants to rename TargetDocument::pageTitle to TargetDocument::prefixedPageTitle to indicate it has the namespace prefixed to it, but not sure it's really worth the trouble mucking with serde [16:44:04] it would better mimic mediawiki though, which accesses that value through Title::getPrefixedDBKey or Title::getPrefixedText depending on the format [17:29:18] * ebernhardson separately wishes we had used a single format for titles everywhere in the search index and did any changes with analysis chains instead of having 3 or 4 different forms of title.. [17:37:32] lunch, back in ~40 [18:17:13] back [18:27:40] Little late to pairing [18:28:10] ACK [20:16:04] * ebernhardson is not having great luck understanding why the redirects array isn't clearing out bad redirects when we do a full-document update [20:40:59] how weird...thought i would verify what is landing on the elasticsearch servers to see if anything silly is there so captured 60s of localhost:9200 traffic on cloudelastic1005 (since update requests are less spread out in a small cluster). Not seeing a single _bulk request in 60s [20:41:29] but it does have the saneitizer queries (which come from mw side) [20:48:09] oh, duh. It's because we batch things into large bulks. I was still thinking old-style where we sent a bulk per update. Capturing another 60s found one [20:53:41] oh the answer was also painfully obvious, but i feel like this must have been an intentional decision...we don't turn on the flag to collect redirects during a normal document rebuild. [20:57:36] ryankemper wdqs2024 failed in mid-reimage even after FW updates. Best guess is that it's a disk issue as I found T345542 [20:57:39] T345542: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542 [21:01:12] aww, it almost lasted a whole year [21:14:16] ahh, sadly my answer wasn't so simple. We do request the redirects, and they make it into the raw fields, afaict. But the bulk request with 22 updates doesn't have a single "redirect" field, even though many are page rerenders... :( [21:39:48] :q [21:43:33] ?! [22:14:34] ryankemper Second reimage failed, I sent ticket above back to DC Ops to look at the HW [22:34:26] inflatador: ack. do we see any indications in the logs or is it just an educated guess from that last ticket [22:38:26] ryankemper just an educated guess. I checked the SEL from the DRAC web ui and didn't see anything new, but that doesn't completely rule it out [22:38:49] I did see the installer start and get pretty far along, then randomly freeze [22:39:00] so it's not the typical PXE boot failure [22:39:05] makes sense