[08:45:28] inflatador: can you move the phab task to in progress when you start working on it? ^ [08:55:41] o/ [08:55:51] checking the failing mjolnir dag [09:01:07] o/ [09:31:55] o/ [09:33:17] dcausse: Regarding your comment https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1129871/comment/302780b2_b248fe10/ - Do you expect full tag replacement? Because right now it’s implemented to replace prefixes only. [09:35:09] pfischer: what do you mean by "full tag replacement", the way you implemented the logic is what I was expecting so perhaps my review comment is misleading? [09:36:35] Alright, the suggested settings.txt comment suggested it’s replacing tags but the tag names would remain unchanged. So I just make the documentation more explicit. Thanks! [09:37:26] oh I see my example "old.tag.name", yes should be more like "old.tag.prefix" [09:39:42] and "replacement tag" -> "new tag prefix name", yes definitely what I suggested is confusing, sorry about that! [09:39:48] np, fixed, just wanted to make sure we expect the same thing here. [09:41:29] I also often refer to "tag" as "classification.prediction.articlecountry" it's more accurate to say "tag group"/"tag prefix"... [09:59:37] dcausse: Understood. But what’s the glossary terminology? /| ? (https://wikitech.wikimedia.org/wiki/Search/WeightedTags) [10:03:23] pfischer: technically yes, but when speaking outside the team I'm not sure, saying "articlecountry prefix" might not be very explicit [10:04:39] dcausse: Ah, that makes sense, I forgot about the non-tech context. ^^ [10:04:39] or "articlecountry weighted tags" might be ok, the *s* in tags suggest some sort of grouping [10:09:25] I think the problem might just be me not using a precise terminology in a technical context, outside of the team using "weighted tags" was generally not so ambiguous [10:40:05] i did a bit of a mess with param passing in the data retention mjolnir changes :( [10:40:13] patches incoming [10:53:30] lunch [12:08:28] workout+lunch [13:13:18] o/ [13:19:56] thanks for the review on that wdqs alerts patch...will get a new one up that uses your approach shortly [13:21:05] thx! [13:55:06] gehel our 1x1 is double-booked with the DPE deep dive, what do you want to do? [13:57:02] inflatador: let's cancel our 1:1 for today [13:57:05] gehel ACK [14:12:05] \o [14:12:36] .o/ [14:13:48] o/ [14:55:52] gehel are we doing the DPE retro in 5m, or triage in an hr, or both? [15:51:52] hmm, i didn't know about https://www.mediawiki.org/wiki/Object_cache#Main_stash . Perhaps MLT should be cached there instead of MainWANObjectCache [15:52:36] Behaviour: may involve disk read (1-10ms), semi-persistent, shared between application servers and replicated across data centers. [15:53:41] ask your doctor about Main_stash! [16:34:43] * ebernhardson wonders how to convince chrome to go to the "right" place when i type phab...it's been trained over years to open https://phabricator.wikimedia.org/tag/discovery-search-sprint/ [16:55:05] ebernhardson: re main_stash you're mostly interested in the cross-dc replication? [16:55:36] dcausse: also that it stores on-disk instead of in memcached [16:56:25] i remember before we considered longer cache times, but didn't want to set it too long because it was going to be 50GB+ that had to be held in memory [16:58:20] memcached looks to be 2.7TB in both eqiad and codfw, so it's not like the 50GB is the end of the world, but still seems a bit much. Machines are 128GB memory each which means almost half a server just to cache MLT [16:59:31] makes sense, was never really sure what drives evictions, if it almost always ttl or sometimes capacity [17:00:32] I guess my last analysis was https://phabricator.wikimedia.org/T264053#6507373 which is 1 day ttl is ~28.2GB, and 7 days TTL would be ~41G. If somehow everything was cached it would be ~400GB (sounds dubious?) [17:02:58] i have no clue what drives evictions either, and not sure how to find out [17:03:02] everything is all content pages getting a morelike? [17:03:22] yea everything is taking the number of docs in *_content and multiplying [17:03:28] ack [17:04:12] the morelike cache does not use getWithSet so we don't get much of the usual metrics [17:04:22] i suppose those numbers also assume our cache keys work as expected and there is only 1 key per content page. Hopefully true :) [17:05:25] yes... not sure that's true, mobileapps might possibly ask for more than 3 hits? [17:05:52] hmm, i suppose this could be migrated to getWithSet, should be possible [17:08:49] unrelated but should we switch deployment-prep to opensearch? [17:09:04] umm, almost certainly yes [17:09:45] ok filing a task, seems odd to move prod before the beta cluster :) [17:09:56] yea [17:12:43] oh yeah, I've kinda forgotten about that environment completely ;( [17:13:31] filed T389971 [17:13:31] T389971: Migrate deployment-prep elasticsearch cluster to opensearch - https://phabricator.wikimedia.org/T389971 [17:17:15] re: write isolation, per docs/settings.txt it currently works as intended, with all clusters isolated by default. But i suppose it opens the question, is that an appropriate default? [17:17:53] Who should I talk to before I update anything in deployment-prep? [17:18:04] inflatador: thats the magic of deployment-prep, no-one owns it :P [17:18:17] inflatador: it's partly in puppet and partly in horizon :) [17:18:51] ebernhardson: perhaps I'm missing something but I don't think it adds much to have this extra-hop [17:19:12] The magic of un-reproducible environments ;p [17:19:42] dcausse: i would generally agree, in particular if there is only a single configured cluster. I suppose there is a wider question, if WMF is the only place with multiple cirrus clusters, and we do all the writes from SUP now, should all this be simplified away? [17:21:03] ebernhardson: true, does that mean we revert back to classic jobqueue retries and remove ElasticaWrite? [17:21:23] I don't think we support the frozen index bits anymore [17:21:34] dcausse: i think we still need ElasticaWrite for retrys, that was it's original intent. But we probably dont need write isolation, job queue partitioning, maybe more [17:22:48] CirrusSearchPrivateClusters is also perhaps superseeded by having groups in CirrusSearchWriteClusters now...a few things that are kinda vestigial [17:23:16] ebernhardson: I'm all for simplifying all this [17:25:03] I wonder if i remember enough about how it works to simplify without breaking things :P [17:39:57] sigh... https://github.com/opensearch-project/opensearch-build/issues/184 [17:40:30] it's solved for opensearch 2 I guess? but it's a bit of maze to find the url of a plugin [17:40:43] :S [17:41:07] do y'all consider deployment-prep a blocker for rolling out to production? [17:41:46] inflatador: I'm not sure, we already moved cloudelastic [17:41:56] inflatador: in principal, probably? But thats a weak yes. In theory we want deployment-prep to be able to recreate problems, but in practice we don't use it [17:42:52] inflatador: perhaps we should do it at least before rolling search traffic to opensearch once codfw is done? [17:43:07] * ebernhardson notes that part of the problem of finding the zips is that the usage of S3 for so many things means simple things like directory listings to browse around and find your .zip don't tend to exist [17:43:26] dcausse ACK, I'm happy to do it first if that gives us more confidence [17:43:27] no directory listing is a huge pain [17:43:51] speaking of S3, that's going to break on deployment-prep for sure [17:44:04] ? [17:44:41] inflatador: you mean s3 secrets? [17:44:47] I mean, the opensearch instances won't start without their S3 config...unless the puppet is different there? I guess I don't know for sure [17:45:40] I don't think there are any s3 repo setup in deployment-prep [17:47:11] yeah...unfortunately the Openstack Ceph is not compatible with Horizon projects that have a hyphen in their name [17:47:36] I can create a user/bucket on the search project, but where to put the secret? [17:48:32] i think the theory goes that there are no secrets in deployment-prep? But that could be 10 years old and out-of-date [17:48:35] inflatador: I don't think we need anything related to s3 in deployment-prep [17:49:02] the plugin requires a default set of secrets? [17:49:44] dcausse not sure yet...I was just thinking that if it applies the cirrus::opensearch role it would need something [17:49:58] but that could be 100% wrong. Let me do some digging and I'll get back to y'all [17:50:01] inflatador: i would put the string `not-really-a-secret` and call it a day :P [17:50:39] oh yeah, I was more concerned that the plugin would try to connect to the S3 endpoint and refuse to start if it wasn't there [17:50:58] hmm, i suppose my hope would be that it doesn't do anything until the snapshots api is invoked, but not really sure [17:50:59] But I'm piling assumptions on top of assumptions at this point ;P [17:51:06] we all do :) [17:51:10] it's the only way to survive [17:51:25] :) [17:52:25] No worries, will dig thru deployment-prep horizon once I get back from lunch [18:08:25] dinner [18:24:38] back [18:29:30] DPE SRE has a flapping alert for `AirflowDeploymentUnavailable: airflow-scheduler.airflow-search is unavailable on dse-k8s-eqiad ` ... any idea what might be causing that? [18:30:56] Ben and I are looking, so NBD if not [18:31:09] i have some vague memory that it occured on another teams thing too, but i'm forgetting the name [18:31:52] https://wikimedia.slack.com/archives/C055QGPTC69/p1742325253186869 is what i was thinking of [18:32:09] but i guess thats slightly different [18:32:19] in general, i don't know how the new k8s variant works :P [18:38:16] * inflatador is learning now ;P [18:47:43] we've just increased scheduler resources, exactly as in your link :) [18:48:35] spoke too soon...we're still getting 503s from teh scheduler [18:48:42] :( [18:53:13] oof [18:53:54] i wanted to release mjolnir 2.5.1, but I'm afraid ci already bumpted to 2.6.0.dev https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/blob/main/.bumpversion.cfg?ref_type=heads [18:54:20] * gmodena wonders how to trigger major and patch releases with wmf_workflow_utils [18:54:32] gmodena: there is a button on the pipelines page, sec [18:55:07] gmodena: the play icon to the right that opens a dropdown, on commits to main, has a 'trigger_release' option [18:55:16] https://gitlab.wikimedia.org/repos/search-platform/mjolnir/-/pipelines [18:55:37] i think it might not be available until the CI finishes after the merge [18:56:08] ebernhardson ack on that. But do you think setting a VERSION variable in there would work? [18:56:09] i think we have that same thing in most of our repos, i find it easier than pushing a tag (although pushing a tag is of course pretty easy) [18:56:34] gmodena: hmm, no clue. I've always just let it version things. I know it's not semver, but personally i could have dates as the version :P It's not like we have external users [18:56:34] i would like to release the current main, with the MR you accepted (thanks!), as 2.5.1 [18:56:40] but I think I've missed the boat [18:57:05] ebernhardson eh [18:57:19] ok with you if I release as 2.6.0 then? [18:57:22] ea [18:57:24] yea [18:57:37] feels a bit silly, but i don't want to manually mess with tags / file configs [18:57:49] ebernhardson thanks [18:57:50] feels totally normal to me :) Each release gets a +1 [18:58:03] ok! [19:18:05] Re: Airflow deployment alerts, it was actually due to a misbehaving ceph node. I'm writing some docs to connect the dots [19:26:36] OK, extremely rough writeup here, but the TLDR is that some pod crashes can be traced to misbehaving ceph volumes https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Ceph/Troubleshooting [19:29:33] makes me more worried about the elastic on ceph ideas :P [19:30:54] * ebernhardson is going to have a hard time writing not-elastic :P [19:32:45] agreed on both points ;) . But I feel like these are growing pains...pretty much everyone else in the world is using crazy hyperconverged storage. Hopefully we can work the kinks out [19:33:48] i suppose my worry is that other places also have networks at a scale thats 10x what we do [19:35:27] like, i've read that many places are running 50GBE within racks, and 25GBE between racks. With 100GBE used in ML dedicated clusters [19:36:22] some quick searching suggests 200GBE and 400GBE are not unheard of for between racks [19:37:36] I dunno...I think we need some work on QoS, but we had entire DBaaS products running on bonded 10 Gbps connections between racks [19:38:06] 10 Gbps or even 1 Gbps between racks. Granted, that was built 10 years ago and I haven't seen it in 5 years [19:39:01] sorry 1 or 10 Gbps within racks, between 10 and 20 Gbps between racks [19:41:39] Backport deploy in 20 minutes, after which we'll have mediawiki querying categories from `wdqs-main` instead of `wdqs-full` [19:41:50] {◕ ◡ ◕} [19:42:05] i dunno, it's certainly possible, i suppose i just worry we never built the DC networks out for high-bandwidth applications. But will see how it works [19:43:05] yeah, I think we'll need to do QoS, bonding, maybe beefier network gear like you're saying [20:24:57] File does not exist: hdfs:/wmf/cache/artifacts/airflow/search/mjolnir-2.6.0.conda.tgz [20:24:58] mmm [20:27:52] hmm, that is odd. Usually the artifact handling just works™ [20:28:51] i think it's done as part of `scap deploy` from deployment host? [20:29:05] (not sure if thats even a thing in the new k8s stuff) [20:30:18] i think we still need to scap [20:33:09] Thoughts on how to test https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1124535/ ? The scap deploy is finishing up now [20:33:27] Probably exec into mediawiki pod and try to curl? Although curl might not be installed on the container [20:33:40] Can also look at https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs-internal-main&var-graph_type=%289103%7C9194%29 although the qps graph is hard to read [20:33:43] ryankemper: issue some deepcat queries, see if it doesn't fail [20:33:51] https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes#DAGs_deployment does not mention scap. I'll ping the broader group in slack [20:34:07] you could check the access logs on the internal-main hosts too, see if stuff is showing up there [20:34:13] ryankemper: https://www.mediawiki.org/wiki/Help:CirrusSearch#Deepcategory [20:34:49] ryankemper: essentially, use the browser plugin to route requests to the appropriate mwdebug host, then issue a deepcat query like the example [20:36:28] curious that 'artifact' doesn't show up on the k8s or airflow pages anywhere :S [20:36:33] (on wikitech) [20:37:12] and https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Developer_guide still references scap :P [20:38:55] ebernhardson: so if I got appropriate results for `deepcat:"musicals"` on `en.wikipedia.org` with `X-Wikimedia-Debug:backend=k8s-mwdebug` via the browser tool then I should be good? [20:39:21] ryankemper: assuming scap is at the point where it tells you to it's been deployed to mwdebug and is waiting for confirmation that its ok, yea that should be it [20:39:26] yeah it is [20:39:29] great [20:39:32] ryankemper LGTM assuming the above as well [20:42:36] * ebernhardson takes minor offence to the statement (in a commit message) "The default CirrusSearch configuration does not work out of the box". The problem is they don't have elasticsearch installed and are complaining that it gives errors :P [20:42:49] imo thats what cirrus should do [20:43:15] Yeah, I dunno what the alternative would be [20:43:40] they asked for the default to be setup so cirrussearch doesn't attempt to write anywhere when installed unless otherwise configured [20:44:49] it's just a weird CI setup where they have Cirrus because of extension dependencies, but they don't actually want to use cirrus [20:45:04] So I guess they just want a silent failure when it doesn't exist or something? [20:45:09] maybe a warning? [20:45:30] i gave them some config settings that match prod, where cirrus doesn't attempt to write (and assumes some external application will do that) [20:48:02] interesting, I guess that's how it works now due to the new SUP? [20:48:28] yes for SUP we added a way to disable direct writes [20:49:27] welp. discolytics is not happy with the query_clicks_ltr schema. [20:49:35] * ebernhardson is mostly commenting here to avoid complaining on the patch...they can do what they want i guess :P [20:49:46] :S [20:52:29] gmodena: the `raise ValueError('schema does not match any known datetime partitioning')` error? [20:52:54] yes - I'm just checking what's up [20:53:35] hmm, query_clicks_ltr is partitioned by year/month/day. should work :S [20:53:42] err, query_clicks_daily [20:54:48] that error message should probably include the table name just to make things more obvious [20:54:58] i guess the function doesn't have the name though [20:56:37] did i do something silly like drop those columns earlier? [20:57:51] gmodena: sigh, yes the problem is mjolnir.cli.helpers in the `require_daily_input_table` it drops year/month/day. I suppose the proper integration would move that condition into the input param helper [20:58:32] ah! [20:58:35] good catch [20:59:52] i suppose that would also move the start/end date arguments into the helper as well [21:01:22] or alternatively, i suppose a more complete integration would be to delete the HivePartition class from mjolnir and migrate everything to the HivePartition class in discolytics (it started in mjolnir, then we moved it but never got around to using discolytics in mjolnir) [21:02:00] i should probably have remembered all this last week :( [21:02:24] np [21:02:34] i should have tested better [21:02:36] inflatador: anything in mind for pairing today? [21:02:43] and not just believe integration tests :) [21:02:53] i'll sleep on it and see how to refactor tomorrow morning [21:02:53] integration testing spark has always been hard [21:03:01] and slow... [21:03:01] yeah, airflow too [21:03:31] just spotted a typo (variable name) in the MR i just merged. Sigh. Will do a quick f/up on that too. [21:03:42] ryankemper was gonna look at T389971 [21:03:43] T389971: Migrate deployment-prep elasticsearch cluster to opensearch - https://phabricator.wikimedia.org/T389971 [21:04:35] ebernhardson for the sake of wrapping up this data retention bit of work, I'd be keen in refactoring filtering logic to the helpers class [21:05:02] and eventually refactor everything to discolytics at a later stage (whch I won't get to this week :() [21:05:05] what do you think? [21:05:07] gmodena: agreed, the simple answer for now is fine. Maybe maybe a ticket for a followup to use HivePartition from discolytics [21:05:15] across the whole package [21:05:18] ack [21:26:21] i published a couple more examples of llm output at https://people.wikimedia.org/~gmodena/search/T388549/llm-judge/ [21:31:16] for now I've only been using random pages (this is just a sample), holler if there's some query you'd be interested to try! cc / Trey314159 ;) [21:31:24] does not have to be only on enwiki [21:38:39] gmodena: i might suggest trying a few with only the top 3, since thats what the UI will show [21:39:15] ahh i see cycling has the top 3 [21:40:34] ebernhardson ack. dcausse made the same point :D [21:40:47] here went for 10 to see if more context might help [21:40:52] but I don't think it does [21:41:02] (based on the few experiments) [21:42:24] welp. It got late here. Talk to you tomorrow! [21:42:27] * gmodena waves