[08:01:40] * pfischer will only be able to join after lunch [10:25:45] errand+lunch [12:07:39] dcausse when you have a moment, could you have a look at https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1186 ? (no rush!) [12:51:13] gehel: sure [12:51:46] err meant gmodena ^ [12:51:54] dcausse thanks :) [12:58:20] I'm perhaps missing something but we don't have any sensors on query_clicks_daily? [13:06:51] gmodena: lgtm, feel free to "resolve" the thread I started that's probably a question to addressed in a separate patch [13:17:04] dcausse ack. I'll adjust the exec time before merging. [13:17:04] dcausse mmm actually maybe I don't follow. Let me reply in thread [13:19:32] gmodena: basically instead of training between [now-30days, now] we train between [now-7days-30days, now-7days] [13:20:19] dcausse yeah, sorry. Was having problems context switching :) [13:20:25] np! :) [13:21:07] gmodena: I think this is a broader problem due to the lack of sensors, I have no objections merging this patch and address this issue later [13:21:48] I added a schedule_period (7 days) to the start-date calculation [13:21:52] something like this: [13:21:53] (execution_date - macros.timedelta(days=schedule_period + training_window)).strftime("%Y-%m-%d") [13:22:18] yeah, lack of sensors is a bit iffy (good catch!) [13:22:30] gmodena: yes that would be the period I expect [13:23:02] this is what fixtures now report: --start-date 2022-11-29 --end-date 2023-03-06 [13:23:25] (i find these time macro in airflow also a bit iffy sometimes) [13:23:38] :/ [13:25:10] if it's ok with you i'll merge with this change and monitor what happens on the two failed runs [13:25:27] i'll need a f/up anyway to bump mjolnir 2.7.0.dev -> 2.7.0 [13:26:47] dcausse just to clarify, if I look at historic runs, the interval looked ok: https://airflow-search.wikimedia.org/dags/mjolnir_weekly/grid?dag_run_id=scheduled__2025-03-13T00%3A00%3A00%2B00%3A00&task_id=query_clicks_ltr [13:27:15] the issue here is that we might not have the most recent 7 days of data, because of lack of sensors ? [13:30:34] gmodena: the lack of sensor will definitely makes this a bit unreliable if we want the most recent data, the new filtering you added slightly change the previous behavior of mjolnir which trained on whatever most recent data was available [13:32:38] I think it's fine (but would prefer ebernhardson to ponder), and this could wait to have proper sensors if we want to train with the most recent 7days [13:34:28] dcausse ack. I'll hold-on testing till later today/tomorrow. [13:34:52] an i mean, I won't vanish next week... should be around if work spills over :) [13:36:15] gmodena: yes, thanks and sorry for the confusion! [13:36:23] no worries at all! [13:39:58] lunch+meetings [13:51:55] inflatador: perhaps you missed my ping on -operations but we're still serving search traffic from codfw (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1131359 was not merged yet) [13:54:32] dcausse: FYI your rdf stream config change is for eventgate-main. usually, eventgate-main would need a restart. But, in this case, the configs are about consumer stuff only, and eventgate ignores that part. [13:58:49] ottomata: yes, thanks, that's what I was assuming [14:03:29] dcausse ACK, sorry was in pairing. I need to create/merge more puppet patches before actually doing any migration stuff, so no worries there. I can ping you before I start the rolling operation if you like [14:04:04] I can also shut off CODFW using confctl if you'd like, but I didn't think it'd be necessary [14:10:57] \o [14:11:30] .o/ [14:11:57] o/ [14:13:07] inflatador: totally my bad, the first patch I wrote to switch search traffic to eqiad was completely broken [14:13:50] the windows are all awkward for california deploys these days, but i can ship the patch to move traffic at 1pm (~6 hours from now) [14:13:56] it's basically the only window in california hours :P [14:14:25] i think that gets better when daylight changes in eu next week, with the other window coming up to 7am [14:14:29] it's all good...we have plenty of capacity in CODFW so I don't think it'd be a problem. If things go really badly we can always depool via confctl [14:15:06] First I need to figure out how to do find/replace for a single file only in RubyMine ;P [14:15:16] sed :P [14:16:34] sometimes the simple tools are best [14:17:05] I can focus on the debian pkg build instead if y'all'd prefer [14:17:31] we should probably release that before installing more opensearch instances, if only so they don't have to be restarted to get the new package [14:17:37] although the new package is really more for testing than prod [14:17:47] (it's vector support) [14:18:32] cool, let me give that a whirl [14:19:44] * ebernhardson notices now that david wrote the same patch for WriteClusters in beta...i just didn't notice it [14:48:41] np! I think I was a bit overcautious with all this :) [14:59:57] the redirects question is curious....personally i really dislike the idea of `excluderedirects:true` or something along those lines, not from the search perspective but the design of a query language perspective. But maybe i'm being overly pessimistic:P [15:00:22] but i see where users come from, because the things we wanted to do (like proper parens and bools) don't exist [15:04:11] ebernhardson: yes... there's also the possibility to have a "god" keyword like "local:" that must happen at the beginning of the search query, "onlyredirects: intitle:this hastemplate:mispelling" [15:04:25] but the hard part is indexing redirects as first class citizen [15:04:42] yea that is a whole separate thing...i can imagine how it might be done but not sure of the consequences [15:04:58] we would essentially index them both as pages and as part of the target page, then filter in most cases [15:05:12] but that has knock on effects that i can't imagine off the top of my head [15:05:56] a prefixed keyword might be ok for redirects, i suppose it does save having to try and fit that all into the MW side of SearchEngine [15:05:57] from comments I feel that if we provide simple keyword variation like inredirect:sometext intitleonly:sometext that might solve only a tiny fraction of the issue they raised [15:06:26] yea i think they want things like source or categories with the redirectpages [15:06:31] yes [15:06:54] it would also probably explode content index doc count, unsure on size [15:07:05] maybe acceptable [15:07:18] but yes I agree, hard to guess what would happen, there many places where we just assume that the results are directed at article pages [15:08:59] I think we might need to clarify all this in the ticket, it's very probable that users don't realize how and why cirrus docs are shaped this way [15:09:43] yea it's probably not obvious [15:18:11] errand, back in ~15 [15:29:07] latencies are bad in codfw [15:29:46] should we do an emergency deploy of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1131359 ? [15:36:24] back [15:36:34] dcausse I can depool via confctl if you like [15:37:07] inflatador: what would you depool? [15:37:52] https://wikitech.wikimedia.org/wiki/Conftool#Depool_all_nodes_in_a_specific_datacenter [15:38:14] inflatador: I'm not sure that works for search? [15:38:20] inflatador: i don't think that would help, mw in codfw would still be trying to talk to the LVS endpoint. just there wouldn't be any nodes to talk to [15:38:41] the problem is we don't have a global discovery endpoint, i started work on that some time ago but it got sidelined for some reason (maybe it wasn't working right? i forget) [15:39:23] oh I remember, was blocked by translate but I think this got fixed in the meantime [15:39:38] so is the problem that mw in EQIAD is only talking to CODFW? [15:39:42] oh, maybe. That should probably come back on the list [15:39:47] +1 [15:39:55] inflatador: no, eqiad talkes to eqiad, codfw to codfw. If we depool all the codfw nodes then codfw talks to nothing [15:40:10] we have a patch that will make codfw talk to eqiad, but it has to go through mediawiki deploy [15:40:20] i think we should check with releng and ship it [15:40:37] cirrus does not use any discovery mechanism because it needs to talk to both clusters [15:40:42] I guess I don't understand why latency is bad in CODFW then. Maybe we have another host that's bogged down with hot shards? [15:41:09] I banned row A in CODFW, let me unban it [15:41:10] inflatador: you banned a full row [15:41:19] i'm not sure either, but since we already prepped the patch it seems easiest [15:41:24] inflatador: no I think that mw-config patch will solve it [15:42:56] dcausse ACK. I banned at 13:54, looks like the latency started going up at 15:20 or so? [15:43:12] inflatador: it's the time for the shards to move out [15:43:15] I'm gonna unban regardless since it doesn't seem like we'll be ready to start anytime soon [15:45:43] syncing out now to move traffic to eqiad [15:45:50] nodes are unbanned now [15:46:23] i think you can keep them banned, the traffic should disappear in a few minutes [15:47:02] yes, it's fine to keep them banned for when you want to start the upgrade [15:49:06] I need to push a pretty major patch before I get started anyway, I'll ban again once we're closer [15:51:41] * ebernhardson is going to have to unsubscribe from data-engineering-alerts...was curious for a bit but it's spammy. But it did make me feel like our problems with airflow jobs failing are minor :) [15:52:10] :) [15:55:42] curiously, lists.wikimedia.org shows me as subscribed in the list of things, but then clicking through it gives me the option to subscribe instead of unsubscribe :P [15:56:26] but clicking another one like discovery-alerts, i have an unsubscribe button. It doesn't want me to leave :P [15:57:53] deploy finished, cirrus traffic should be shifted now [15:58:03] thx! [15:58:22] qps declining in the percentiles dashboard (they are a rate over 5m or something), so looks to work as expected [15:58:41] yeah, sorry for slowing down the search, I didn't think losing a row would have that much of an impact, esp. in CODFW since we have 5 extra hosts ATM [15:58:56] i didn't think a row would have that effect either, it's a bit surprising [15:59:20] it does seem sensible still to migrate an entire row before allowing shards onto the opensearch instances, due to the primary->replica thing [15:59:58] elastic2114 seemed to have struggled a bit so possible that only one machine caused the overall slowdown [16:00:16] i really want to know how to solve that one host struggling issue :( [16:00:41] yeah, definitely. ES/OS is so good at horizontal scaling in general [16:00:53] but my only ideas involve a bunch of work to test and see if maybe it helps :P [16:01:16] We might be able to do that once we have some larger VMs [16:03:01] retro? [16:04:10] oops [16:36:43] FYI, looks like elastic2114 was CPU bound: https://grafana.wikimedia.org/goto/khOf6hTNR?orgId=1 I pushed up a patch to enable the perf governor, subject to DC Ops approval https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131775 [16:38:49] even still, a little more cpu might help the nodes hold a little more load before getting into trouble, but to me the real problem is that the cluster doesn't route around a single node thats in trouble [17:05:27] completion indices are created, opensearch in deployment-prep has now 200 indices vs 201 in elastic, I think that's good enough [17:05:40] awesome, yes that sounds close enough [17:19:15] back [17:19:44] it appears we do have some visibility into adaptive replica selection, it might be interesting to record http://search.svc.eqiad.wmnet:9243/_nodes/stats/adaptive_selection somewhere when we have the situation where a node hits it's a full search threadpool queue. Essentially to try and figure out if ASR is appropriately noticing the node is in trouble [17:20:08] not sure that it would give any actionables, but ASR is the thing that is supposed to prevent the problem of one node getting overloaded, so might allow some analysis into whats wrong [17:20:16] NICE! [17:41:14] random things: For T344371 global-search would like to offer many/all of the keywords cirrus offers. They could probably get away with using &cirrusDumpQuery and extracting the __main__.query field, how terrible would that be? [17:41:14] T344371: Add all CirrusSearch filters to Global Search - https://phabricator.wikimedia.org/T344371 [17:42:58] i suppose some keywords require rescore (prefer-recent, maybe others), but i would want to avoid suggesting they bring rescores over [17:43:26] ebernhardson: probably good enough? unsure if they'll need to manipulate the resulting query to drop anything that would filter too much based on the current wiki [17:44:03] issue is that they need to target a specific wiki and some keywords are wiki specific [17:44:33] hmm, are there wiki specific queries other than wikibase? [17:44:53] deepcat might be [17:45:10] anything that hits the db at parse time [17:45:16] oh, indeed [17:45:26] I don't think there are many [17:46:11] but it's probably easy enough to implement and see? [17:46:38] might be a bit "fragile" since it relies on the shape of the msearch request [17:47:06] yea, the queries aren't really intended to run everywhere but hoping it will be acceptable [17:47:28] dinner [17:47:29] i suppose namespace filters are also a thing [17:47:34] indeed [17:52:42] OK, the perf governor patch is merged. As Erik said, it's not really a solution. But at gives us a bit more headroom [18:01:47] * ebernhardson makes some updates to the madvise c program...and i surprised to re-learn that you dont get booleans in C unless you include "stdbool.h [18:05:27] * ebernhardson separately wonders who is supposed to code review C :P [18:12:11] lunch, back in ~1h [18:40:20] random stats: enwiki contains just shy of 14 million redirects [19:00:01] curiosities though...iterating the `allredirects` api generator gets ~7k redirects from mediawiki.org. sum(size('redirect')) from discovery.cirrus_index gives ~20k. `select count(1) from redirect` on the analytics replica dbs gives 80k [19:00:27] i suppose replica table has interwiki, but still [19:04:25] back [20:39:55] back from lunch [21:02:01] who has two thumbs and should **not** review C? ;P [21:02:10] :P [21:02:22] just paste it into chatgpt, if it's happy +2. What could go wrong? :) [21:02:41] inflatador: few mins [21:03:20] ryankemper ACK [21:03:37] * inflatador segfaults