[09:11:27] o/ [09:15:35] o/ [09:16:36] gmodena: if you have a sec I have quick artifact version bump: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1113 [09:21:10] err wait the build is failing, sorry, looking [09:21:30] dcausse ack [09:25:41] dcausse would you have time for a quick chat sometimes today? I was looking at the board with gehel, and wanted to ask you for some input on what to pick up next [09:25:52] gmodena: sure [09:27:50] wondering if we could configure a gitlab project to for MRs to be on the top of tha main branch before merging [09:27:58] dcausse thanks! [09:28:15] dcausse do you mean like an auto-rebase? [09:28:29] gmodena: yes auto-rebase but before merging [09:28:57] to force run CI after a rebase [09:29:24] well... that won't solve all the cases I suppose... [09:30:20] I mean the problem of test fixtures not being 100% guaranteed to be up-to-date if the rebase or fast-forward happens after CI [09:30:22] we could add the rebase as a CI step/job [09:31:04] ah! the fixture / snapshot test bits trick me often :( [09:31:17] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1114 [09:31:39] yes, it's hardly visible... perhaps a post-merge build is possible? [09:31:45] at least to raise visibility [09:33:06] Yeah, that'd be nice. This is probably something that the airflow platform should take care of for us [09:44:33] finally green https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1114 [10:02:37] dcausse ack. Somehow Gitlab only shows changes to fixtures [10:03:28] gmodena: oh sorry I was not very clear.. made a separate MR for the fixture fix that I self merged [10:04:05] yep. Just saw [10:04:08] I just rebased the search patch: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1113 [10:04:10] i need more coffee [10:05:52] :) [10:56:49] lunch [13:08:40] ebernhardson: would you be able to have a look at T381909? ryankemper might be able to help [13:08:41] T381909: WCQS updates for miscweb migration to k8s - https://phabricator.wikimedia.org/T381909 [13:08:55] o/ [13:20:54] lunch+errand [13:39:15] o/ [13:59:08] \o [14:04:23] .o/ [14:11:53] o/ [14:36:46] o/ [14:38:45] ebernhardson: I missed the `list of`/prefix discussion yesterday! We've talked about that before, and I think we predicted `John` might make the list, too—though there are a lot more names than I expected. Cool that you got a first pass at it! [14:38:55] (In addition to their being fewer women with wiki pages, in the US at least, male names have been historically less varied than female names. Still, I'm surprised `Mary` didn't make the list! The page for the given name has almsot 3K names that start with Mary, so it is up there.) [14:39:00] It would be interesting to look for redirects that are suffixes of the actual title and compute the dropped prefix from that. I assume that would get rid of most of the names. It might not work on smaller wikis with a less robust set of redirects, though. Determining prefixes by any method in spaceless languages would be... interesting... Cool stuff! [14:41:22] is https://global-search.toolforge.org/ an "official" search platform tool? [14:42:28] i was talking with tchin earlier today about support for cross-wiki search, and the only path I could think of was using APIs of WIkidata [14:42:36] then I found globla-search [14:49:13] gmodena: not the ui but it's powered by cloudelastic which we own [14:49:39] cool :) [14:52:01] we do some kind of limited cross wiki search on wiki: cross project searches to show results from "sister" projects (wikipedia, wikisource, ...) and crosslanguage as a "fallback" when there are not enough results and when the query string looks like another language [14:56:52] Are crosslanguage fallbacks something end users can enforce, or are they purely determined backend-side? [14:59:07] gmodena: no it's only determined by the backend as part of "second try" strategy [14:59:44] dcausse ack. Thanks for clarifying! [15:02:40] workout, back in ~40 [15:04:38] fyi, logstash is now more restricted, there's a new logstash-access group that you've been added to automatically if you accessed logstash in january [15:07:23] ya i saw the emails for that too, somehow i lucked into still having access [15:07:32] although i'm sure i didn't use it in january [15:08:19] did not see any emails (just a thread in slack) but I'm in there too... [15:12:25] seeing people confused by the "ores" name in the weighted_tags, wondering if we should start thinking migrating those to a better name... [15:13:09] transition might be somewhat painful but perhaps not impossible [15:14:43] hmm [15:15:03] somehow i thought that was all replaced with articletopic / drafttopic / etc. [15:15:49] the prediction actually comes from new models but we kept the same tag name as the destination [15:16:40] I might file a task to think about a transition process for the rename [15:16:47] yea i suppose we should [15:44:28] cleaning up some datasets (to re-enable the drop-daily dag) but wondering if I should just wait for gmodena modena to be merged, we have query_clicks since dec 9 [15:45:30] I mean since october 18 sorry [15:45:52] oh wow, that does go back a decent bit. I wonder if thats part of why training needed more resources? I think it just reads the whole table [15:46:03] oh [15:46:06] could be [15:46:58] one problem would be it doesn't have the redactions done there though [15:46:58] dcausse the only problem i see is that that click data has not been anonymized for long term storage [15:47:04] yea [15:47:06] yes :/ [15:47:07] :D [15:47:45] hm... should we bother backfilling with the data we still have or simply wait again? [15:48:10] it wouldn't be terribly hard to copy that data out to a temp table, then repopulate it with the filtered data i suppose? [15:48:18] bit of a for loop and repeating per day [15:48:41] the aggregation step, on day of data, is pretty fast [15:48:53] *one day [15:49:27] can you run an update with and set q_by_ip_day = if q_by_ip_day < 50 then 1 else 50? [15:49:47] I mean can sparl-sql do that? [15:50:00] in terms of backfilling, the source data is probably gone. I'm not sure if parquet data can be updated, but what you can do is `create table ebernhardson.foo as select * from source_table where year=x and month=y and day=z` [15:50:10] and then a second query to `insert overwrite partition ....` [15:50:21] ok [15:50:33] there's no upsert in hive without iceberg [15:50:46] q_by_ip_day should be easy to reconstruct [15:51:53] the salted session a bit less, I'm afraid. Salt is rotated every 8 hours and I don't think we have granual enough info in daily_clicks [15:54:29] hmm, sounds like it might be possible with some effort, but not clear if it's worth the effort? [15:54:44] well, the salting would perhaps even need some alternate approach [15:54:52] (just for the historical data) [15:54:55] ah forgot about the salt... [15:55:20] i suppose that also means, the 13 month retention should start from the day the changed aggregation is applied, and not to whats in storage already [15:56:25] or we need to develop the process to also apply the same restrictions to everything currently in the query_clicks table [16:00:36] mmm... i wonder how much of an issue it would be to keep what we have in storage now (3 months cutoff) together with the new anonymize data, till it is eventually phased out. It should not be possible to correlate old <> new sessions [16:07:07] Trey314159: curiously, sudachi 20250129 doesn't have the same hash for system_core.dic as latest. It's a binary file though so i'm not sure how to see what is different. [16:07:27] (random oddity, probably not worth figuring out :P) [16:08:56] gmodena: hmm, it's actually more than 3 months right now. david mentioned something with drop_old_data_daily wasn't working right and it currently goes back to mid october. In terms of re-anonymizing the data, i imagine we could do something simple like pull a random number and use that to salt everything in a one day partition? As long as it's discarded it should be fine [16:09:20] ebernhardson: that is pretty weird. Maybe there's some meta data in the zip file that's different? [16:09:58] thinking some script that copies a day of data to a temp table, then overwrites that day with the q_by_ip_day changes and hashes the session id's with a random number used for that day [16:10:23] Trey314159: it's not even the .zip thats different though (probably that too), but the uncompressed .dic [16:10:46] Trey314159: overall though i think you are right and we should be specific, not letting that randomly change [16:11:31] ebernhardson: Using the "latest" link from the parent page, I got "sudachi-dictionary-20240409-core.zip", which is.. uh, not the latest! [16:11:49] lol, i didn't even think to check that. makes sense :) [16:11:57] ebernhardson mmm.. that could work! We would end up with a few session salted over a 24 hours period, but possibly it won't impact training too much [16:12:46] session ids should differ every session_timeout_sec (1800sec) anyways no? [16:13:19] dcausse: sorta, the session id is something like `year_month_day_identity_session-num` [16:13:32] basically it's not random at all, except that identity is already a hash [16:13:45] ebernhardson: shall we pick the 20250129 dict? I'm updating my Sudachi from 3.0.0 to 3.3.0 and I'll grab the 20250129 dict for my pre-OpenSearch baseline. [16:14:02] Trey314159: yes that makes sense, i just updated the patch to be explicit about using that dict [16:14:09] but re-hashing that with a random number should obfuscate everything no? [16:14:15] dcausse: yes it shoudl [16:14:18] ebernhardson: Ahh, cool. Thanks! [16:17:03] I can try to work on a quick notebook to salvage that historical data while we have it [16:17:43] yea seems reasonable, that gets a couple months head start [16:18:38] I'll add a note to the ticket, but we probably also want to adjust mjolnir to only read 3 months of data so it doesn't keep breaking on us while building up the additional data? Or do we want to let it break and we scale it as necessary? [16:19:47] Trey314159: i suppose i never asked, should we are about -core vs -small vs -full dict? [16:19:56] i just randomly chose -core because thats what the document as the dfault [16:25:16] ebernhardson dcausse +1 for having bounds on training data. I can do that as part of T360536 if you'd like [16:25:17] T360536: Increase retention of training data - https://phabricator.wikimedia.org/T360536 [16:25:22] ebernhardson: we somehow need to adjust the partitions read by mjolnir on a per wiki basis? [16:25:36] 3month for enwiki more for the rest? [16:25:55] gmodena: thanks! [16:26:43] or perhaps it was just for new wikis? [16:26:59] I mean hewiki could certainly benefit from more data [16:27:06] dcausse: hmm, yea perhaps discarding enwiki data older than 90 days is enough? Part of my goal with the longer retention was to bring in more wikis, but it seems likely existing wikis could also benefit from more data [16:27:30] ebernhardson: yeah, core is the right dictionary [16:27:42] and actually thinking about it [16:27:50] https://people.wikimedia.org/~ebernhardson/T377128/T377128-AB-Test-Metrics-WIKI=hewiki.html is with 3months of data [16:28:30] https://people.wikimedia.org/~gmodena/search/mlr/ab/2025-02/T385972-AB-Test-Metrics-WIKI=hewiki-EXPERIMENT=mlr-2025-02.html is with 5month data if I'm not mistaken [16:29:09] more data is already showing improvements at least on hewiki [16:29:27] dcausse: indeed, a quick look at engagement and the interleaved both suggest improvements in the 5 month data [16:30:52] errand+cooking, back later tonight [16:58:43] dinner+kiddos. I'll be back online later [17:10:27] starting on cloudelastic1011...it's a brand-new Supermicro host, so hopefully no hardware gremlins [17:27:50] * ebernhardson wonders if moving completion to a dedicated service that lives in k8s would also potentially allow easier decisions around scaling up the size of completion indices via subphrases [17:28:45] as in, maybe instead of providing (List of Foo, Foo) as variants, we could enable subphrases directly and get (List of Foo, of Foo, Foo) [17:42:27] (and maybe stopwords could drop `of`) [17:44:52] there's also the defaultsort option we never enabled that would inject "Trigonometric identities" as a candidate for "List of trigonometric identities" (https://en.wikipedia.org/wiki/List_of_trigonometric_identities?action=cirrusDump) [17:45:11] only enabled on a couple wikis where it was explicitly requested [17:45:25] cloudelastic1011 looks like it's working...heading to lunch, back in ~40 [17:45:40] \o/ [17:46:17] dcausse: oh interesting, it does seem like defaultsort should be reasonable to include everywhere [17:46:54] ebernhardson: sadly not that simple... some wikis use defaultsort with some strings that has nothing to do with the title :/ [17:47:02] enwiki I'm pretty sure that would work [17:47:06] * ebernhardson hadn't realized how many redirects exist on list of trig identities [17:47:25] ahh, hmm :( [17:47:53] T145427#3515817 [17:47:54] T145427: Testing needed for the add of DEFAULTSORT keys to wiki search autocomplete - https://phabricator.wikimedia.org/T145427 [17:48:20] "Chinese is not a phonetic writing system and there's a tradition on zh.wp to use the first few letters of the phonetic transcription of the title as a sort key" [17:49:35] I mean the code is there and working, I'd be for enabling it wherever we think it's acceptable [17:50:18] yea i see that, i'm not sure how to know which wikis it would be reasonable to enable on. Probably fine on enwiki, but that's narrow [17:50:42] although, i'm sure my common prefix finding would also do interesting things on zhwiki that we can't even interpret without help :P [17:51:49] yes... always dangerous to enable these kind of things everywhere... [17:56:07] iirc the defaultsort thing was mainly requested for cases like searching by lastname first not really for "list of" but could also cover that [17:56:48] T331719 [17:56:49] T331719: When searching by keyword, results sorted by relevance should prioritize family names in the title: please improve search results for articles with DEFAULTSORT - https://phabricator.wikimedia.org/T331719 [17:57:01] yea that makes sense, i wonder how often defaultsort is even set. I suppose part of the problem is defaultsort isn't guarantee'd by much, although i wouldn't be surprised if bigger wikis had some bot-ish things that go around looking for them and setting them [17:58:31] no clue :/ [18:21:01] * ebernhardson is separately getting nowhere in tracking down the PCRE JIT failure...but the script does run 8 times per hour and only triggered once on march 9th [18:21:31] but maybe those executions don't always trigger the queries that then invoke the pcre jit [19:04:06] resolution: Wasn't releated to JIT, the host simply ran out of memory [19:05:26] {◕ ◡ ◕} [19:06:50] ugh....getting puppet failures related to systemd tmpfiles again. Seems to be a race condition during the initial puppet run [19:09:51] `Mar 11 18:48:53 cloudelastic1011 systemd-tmpfiles[1013]: /etc/tmpfiles.d/opensearch-cloudelastic-chi-eqiad.conf:1: Failed to resolve user 'opensearch': No s>` [19:13:23] sigh, yea looks like we need some extra ordering dependencies somewhere [19:14:58] Our puppet systemd module claims to run `systemd-tmpfiles --create` as root ( https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/systemd/manifests/tmpfile.pp#38 ) , but it doesn't happen every time [19:16:20] I wonder if there's a way to just tell it to do that action every time and don't report a change [19:17:05] inflatador: i think thats what `refreshonly => true` does, it only triggers if something else tells it to [19:17:19] probably via the subscribe? Although i would have to double check puppet docs [19:17:54] yeah, I've had this pain with Ansible in the past. You set a notifier (or I guess it's called 'subscribe' in puppet world) and it makes the change, but never runs the notifier because a different task fails [19:18:26] then when you run it again, it doesn't detect a change and it never runs your notifier [19:19:34] also i would guess that it does always run as root, the problem is those conf files refer to the opensearch user [19:20:09] whatever defines the content of those conf files to point at the opensearch user likely need a dependency on User['opensearch'] [19:20:27] yeah, it's weird though...I can run puppet after the package installs and the user exists, and it still fails [19:21:15] FWiW I think our unit files are also trying to create those tmpfiles, but running as user opensearch, which doesnt' work [19:21:24] so yeah, a few things to fix [19:21:38] hmm, i suppose i would start with adding the user dep at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/opensearch/manifests/instance.pp#305 but indeed that might not be enough [19:21:57] really that whole class depends on the user though, maybe thats not the right place [19:25:25] rebooting is enough to "fix" it, since the systemd-tmpfiles stuff runs automatically at boot [19:27:05] unsatisfying fix :P [19:28:14] Agreed...working on it ;) [19:28:37] also found T328674 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/885372 [19:28:37] T328674: Revise elastic/open search and its /run + tmpfiles creation - https://phabricator.wikimedia.org/T328674 [19:52:29] ryankemper ebernhardson here's my first crack at trying to set the rundirs: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126643 . Happy to look at other approaches if y'all have suggestions [19:53:05] Can look in 15 [19:57:54] no hurry, this is not going to block the migration, just make things a little less ugly. We need a followup patch to remove the `ExecStartPre` line from `modules/opensearch/templates/initscripts/opensearch_1@.systemd.erb`, but since that touches other teams' servers we can wait [20:05:58] i suppose i would have hoped to fix the other part, instead of having the things created twice. But we don't have nice integration environments that let us re-image a machine 20 times testing what is going to work [20:13:36] yeah, or ansible --check [20:14:30] but yeah, feel free to offer a better way, I just threw that up there as a first draft [20:27:28] by better, i guess i mean if its failing because the user doesn't exist, we need a dependency somewhere that ensures the user exists before that tries to run. The main question would be where that belongs. I'm not even sure where the user gets created : [20:30:08] best guess would be it comes from the package installation via debian, since we don't seem to declare it anywhere [20:33:46] but then i'm not really sure best practices there, it could be to depend on the package, or it could be to explicitly declare the users [20:37:27] i suppose i'm thinking something along these lines: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126653 [20:39:24] yeah, it def comes from the package [20:39:57] since we already depend on the package in half a dozen places in instance.pp, it seemed reasonable to move the requirement up a level so instance.pp doesn't run until the package is installed [20:39:58] and that definitely looks cleaner [20:52:50] OK, +1'd [20:52:58] will reach out to 0lly as well [20:59:18] 4’ late to pairing [20:59:45] ACK [22:00:48] that was quick, debian-glue is already deployed for opensearch/plugins [22:06:36] Nice