[09:30:48] dcausse no rush, but would you maybe have 15-30 mins to chat about vector search sometimes today? [09:31:18] gmodena: sure [09:35:59] dcausse thanks! Would after the unmeeting work for you? [09:36:12] gmodena: yes [09:36:40] dcausse terrific, thx [13:01:53] dcausse do you have feel for which profile I should use when querying morelike (https://www.mediawiki.org/wiki/API:Search)? Are classic or classic_noboostlinks reasonable? [13:06:58] gmodena: should be classic_noboostlinks to mimic what mobile web is doing [13:07:08] o/ [13:09:21] inflatador: any objections to complete the migration of relforge? I'd like to import a couple indices there [13:09:51] dcausse ack. [13:11:36] o/ [13:12:33] dcausse you mean reimage relforge1003 to Opensearch? [13:13:56] unrelated, Observability has asked us to look at this patch about kafka logging topic changes. I'm not sure of all implications, so passing on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1128793 [13:21:25] inflatador: relforge1003 yes [13:24:13] dcausse ACK, can maybe get to it this afternoon. It doesn't work with our reimage cookbooks b/c it's so old, so it might take a bit. We should also have some new relforge hosts within the next 2 wsk [13:35:56] :( [14:08:05] \o [14:08:30] o/ [14:14:47] * ebernhardson is pondering some way to tell the saneitizer to go backwards a x hours with all the states...but not sure if worthwhile [14:15:13] also not sure how :P i know there is some way to add api calls to the streaming app, but not sure if thats really viable or reasonable [14:16:50] it also wouldn't really go backwards x hours in reality, it would just be a calculation of whta the page_id probably was then [14:19:22] if we have the chunk of ids we could do something [14:19:36] perhaps querying the _general indices? [14:19:59] but unsure that's worth the effort [14:20:27] hmm, yea i suppose we can get a set of id's by searching for the namespace that shouldn't be there...it's about 300k docs so its something but indeed not sure if it's worth worying about a ton. They will be fixed in 2 weeks [14:20:50] i was thinking if it was generic functionality we could use in the future it might be worth putting together, but as a one-off probbly not [14:21:30] sure [14:21:31] it's a bit tedious to search for because it's all wikisources though [14:21:45] yes... [14:22:51] I'm roll-restarting cloudelastic with the cookbook now, seems to be working [14:24:43] trying to verify that is properly consumer a topic, seeing "checkpoint 27674 as completed for source Source: mediawiki.cirrussearch.page_weighted_tags_change.v1-source." [14:25:10] which is encouraging but not seeing any offsets moving in kafka burrow which is concerning... [14:25:39] s/that is properly consumer/that flink is properly consuming/ [14:26:18] hmm, indeed the offsets should move [14:26:45] our setup in staging is bit weird tho... [14:28:13] dcausse ebernhardson CC'd you on the above observability patch just so it doesn't get lost. Pretty sure we are not affected though... [14:30:42] inflatador: it mentions apifeatureusage I suppose that's why, but I have no clue how to review this patch :/ [14:30:42] inflatador: i took a quick look, it seems reasonable [14:32:16] dcausse yeah me neither. ottomata says we're good, so I'm gonna trust him ;P [14:32:50] it essentially keeps the existing input the same as it has been, afaict, and then adds an input for a k8s-mw-* topic pattern [14:33:07] i mean, i don't know what the k8s-mw-* topic pattern is, but it looks plausble :P [14:33:18] :) [14:34:17] turns out there's no data in the topic we consume, staging is forcing codfw.* topic [14:36:30] unrelated, is there a way to run check_indices.py with the new k8s maintenance stuff? [14:37:09] inflatador: hmm, probably...sec lemme try [14:37:30] inflatador: oh, actually its a python script, it doesn't need k8s [14:37:40] but it does need mwscript...hmm [14:37:50] inflatador: it should have moved there https://gitlab.wikimedia.org/repos/search-platform/cirrus-toolbox [14:37:58] looking if I changed it [14:38:15] and it no longer uses mwscript [14:38:40] oh nice! i didn't realize we moved that all in there [14:39:20] annoyingly I forgot to cleanup the CirrusSearch script folder so it's confusing [14:44:52] dcausse NICE! How does it know what cluster or clusters to check? [14:45:34] I guess it gets it from MW endpoint arg? [14:45:50] hm... can't remember [14:46:55] I think it's hardcoded [14:47:16] https://gitlab.wikimedia.org/repos/search-platform/cirrus-toolbox/-/blob/main/cirrus_toolbox/check_indices.py?ref_type=heads#L423 [14:48:05] you can't target a specific cluster it's running all of them [14:48:29] That's fine, thanks for the help [14:50:09] still getting read timeouts/503s with the rolling-operation cookbook. checking... [14:53:34] wish i had a better idea about those :( [14:56:50] That's OK, we expected some bumps on the road. If we're lucky, it's just a problem with the `allow-yellow` logic [15:14:26] cloudelastic's in the red now...checking [15:15:00] if anyone has a minute for a quick patch: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1124485 (parent patch just got deployed successfully) [15:15:19] hmm, 0 shards assigned, some cluster block exception [15:15:28] :/ [15:16:55] seems like maybe the first error: [2025-03-20T15:13:32,304][WARN ][o.o.t.TcpTransport ] [cloudelastic1009-cloudelastic-chi-eqiad] exception caught on transport layer [Netty4TcpChannel{localAddress=/10.64.32.30:9300, remoteAddress=/10.64.48.24:53582}], closing connection [15:16:57] java.lang.IllegalStateException: transport not ready yet to handle incoming requests [15:17:42] but it still seems to have a cluster formed [15:18:18] odd that we aren't getting any alerts [15:19:57] org.opensearch.action.NoShardAvailableActionException: null :/ [15:20:57] I re-enabled replica allocation and we're back to yellow [15:21:16] still, the cookbook is clearly not ready for prime time [15:21:30] curious, indeed shard allocation immediately jumped up to ~250 per instance, from 0 [15:22:06] was cloudelastic1009 reimaged? [15:22:24] not super confidence inspiring that elasticsearch always kinda just worked, and the first opensearch migration has...interesting things happening :P [15:22:40] yes :/ [15:22:41] dcausse I thought it was! But let me double-check [15:23:27] inflatador: nvm I thought you performed some action on cloudelastic1009 [15:23:30] 1009 also became master, i don't know how it decides but i would have expected something else to take that role if it was just reimaged (it was certainly restarted at 15:13 at least) [15:23:42] I see an 12days uptime on cloudelastic1009 [15:23:47] 1009-chi shows it booted up at 15:13, so if it wasn't reimaged, why did it restart? [15:24:02] weird [15:24:49] based on `"dpkg -l | grep -i elasticsearch"` , all of 'em are on opensearch. You can also look at `cat /etc/wikimedia/contacts.yaml` to see what role's applied [15:24:59] i guess i'm assuming a restart because i can see it loading plugins in the log at that point [15:25:14] but there are so many stack traces it's hard to decide whats going on :S [15:25:31] I was running the rolling-operation cookbook at 15:13, so that's why the service was restarted [15:25:54] ahh yea found it, stopped at 15:13:12, restarted at 15:13:25 [15:26:11] but why did it restart 1009? [15:26:36] did maybe the whole cluster get restarted at the same time? [15:27:33] hmm, no looking at 1010 logs i don't see it being restarted then [15:27:51] 1010-chi last restarted at 14:29 [15:28:00] yeah, here's the restart times: https://etherpad.wikimedia.org/p/cloudelastic-mystery-red [15:28:08] that's just for chi BTW [15:28:37] i guess the curious thing to me right now is that 1009 wasn't being reimaged, but it was still restarted [15:28:42] needless to say, this moves up the timetable for reimaging relforge1003 ;;) [15:28:43] i don't know if thtas a cause, just curious [15:29:09] I used the 'restart' option on the cookbook, as opposed to reimage [15:29:22] ahh, so rolled a restart across the cluster? ok [15:29:39] meeting time, but we can ponder this after if you want [15:29:42] sure [15:31:09] probably unrelated but seeing: "Caused by: java.lang.IllegalArgumentException: layout parameter 'type_name' cannot be empty" in journalctl [15:31:21] logging config issue I suspect [15:31:31] dcausse: meeting? [15:31:36] dcausse: https://meet.google.com/axr-okqe-oht [15:31:39] oops [15:31:40] cc: gmodena ^ [15:32:37] gmodena: only if you are interested and have time! [15:45:19] ebernhardson: I am testing a reindex script for ES on my local machine. I do have an index with mappings that I reindex. However, docs that exist in the source index with the weighted_tags array end up in the new index, but without the weighted_tags field. Feels like I am missing something. I manually added a mapping for weighted_tags to the dest index, too, but that didn’t change anything. [15:45:39] That happens even without the script. [15:48:21] pfischer: hmm, in terms of adding the field manually i think thats expected. Basically you can add fields to elasticsearch indices but it doesn't do anything with the things already indexed, it only applies to things that get indexed going forward [15:48:40] as for why it doesn't get put in initially, i'm double checking [15:52:47] hmm, so they should be configured by default. Essentially extension.json defines 'wikimediatags' as a hook handler for 'SearchIndexFields', and that hook handler is supposed to add the field to the mapping [15:55:17] pfischer: you should 'require_once "$IP/extensions/CirrusSearch/tests/jenkins/FullyFeaturedConfig.php";' in LocalSettings.php and reindex [15:55:18] pfischer: i think i might not quite be following your problem, your saying the _source doc has weighted_tags, you do the reindex operation, and then it doesn't have a weighted_tags in the source field anymoe? Or that the field is there, but it's not in the mappings and thus not searchable? [15:56:03] i typically use $IP/extensions/CirrusSearch/tests/jenkins/IntegrationTesting.php, but FullyFeaturedConfig is a subset and probably works too [15:56:17] maybe we should rename that at some point, jenkins might be misleading [15:56:18] it's enabled via wgCirrusSearchWMFExtraFeatures = [ 'weighted_tags' => ... ] [15:56:22] ahh [15:56:37] ebernhardson: oh, yes that's ambiguous [15:57:25] i dunno what to call it instead though...i think it was called jenkins because the idea was that CI would use them to run tests. [15:59:09] is Jenkins.php still referenced by CI? [16:00:06] hmm, generally i doubt it but not sure how to check [16:00:21] i've mostly used this either in the context of cindy or local dev [16:02:21] not sure what to do, it's definitely confusing and I don't see a clear pattern, some jobqueue setup in Jenkins.php, some wfLoadExtension in both FullyFeaturedConfig and IntegrationTest... [16:02:42] lunch, back in ~1h [16:03:00] will merge/deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129877 when I get back [16:03:33] some is unnecessary, i suspect deleteBrowserTestPages.php, nukeAllIndexes.php, and resetMwv.sh are all unnecessary now [16:03:58] they were all related to running under vagrant and getting a partial reset without rebuilding the env (since that often didn't work) [16:05:48] i think the idea of IntegrationTesting.php is that it pulls in everything needed for tests/integration to work, except there is more config in the LocalSettings.d of the cirrus-integration-test-runner repo [16:06:24] i suppose part of the question is, where does this config belong? Should we move the things needed for integration testing into the cirrus-integration-test-runner repo? Or should cirrus know how to configure the wikis for the integration tests? [16:06:59] it seems useful to have some sort of "turn most things on" config available in the main cirrus repo, although i don't know that other extensions do that [16:15:01] yes definitely usefull... I wonder how annoying that would to have a single include, annoying I mean force the dev to install a bunch of extensions they don't want like SiteMatrix for instance [16:15:34] it was a bit tedious tracking things down when i first setup the integration env, but after that i just type `./create-env.sh` and it works : [16:15:35] :P [16:16:52] never went the route of using the same mwcli env for local dev but perhaps I should... [16:17:25] but there'll be devs working with simple mw-docker setups still willing to ship simple patches to cirrus I think [16:23:21] yea seems plausible, but they can probably write patches without the extra bits configured. FullyFeaturedConfig only requires Elastica, which is certainly required, IntegrationTesting.php brings in everything else which is needed for cross-wiki testing and usch [16:23:47] i also wonder if we really need pool counter configured in FullyFeaturedConfig.php. I guess it ensures that the code is run somewhere other than prod [16:24:58] the confusing part about poolcounter is that it just works without the actual poolcounter service :) [16:25:29] hmm, actually thats a good point. I don't remember actually running the pool counter service anywhere post-vagrant [16:25:30] last I debugged this it just ignores connection issues and add a warning [16:26:58] ahh, that makes sense why it keeps working then. I guess that works fine for our purposes, we mostly want to see the code get invoked. If it fakes getting a ticket from the service is irrelevant [16:28:55] true [16:29:24] i guess the summary is, no clue what to do with this jenkins dir :P [16:30:25] perhaps we can try to at least remove one file? :) [16:31:36] and the few cleanup scripts possibly [16:32:08] i suspect everything but FullyFeaturedConfig.php and IntegrationTesting.php can go [16:32:18] Jenkins.php is definitely useless [16:32:20] yes [16:32:43] hmm, i'll submit a patch and see what cindy thinks [16:32:47] thanks! [16:39:00] cindy hates it :P looking into why [16:42:16] :) [16:47:42] looks like we still need cleanSetup.php. I guess thats been going so long i thought we had another bit that setup the indices during update.php [17:12:53] back [17:20:40] cirrus-reindex-orchestor appears to work with mscript-k8s tested testwiki but now struggling to allocate replicas on testwiki_general... [17:20:43] in cloudelastic [17:22:16] sigh... took forever [17:22:37] it's only 500mb... [17:22:43] :S [17:24:51] still some issues with my integration with k8s, logs are only visible in e.g. cloudelastic/testwiki_content.reindex.log once the script is done... not particularly useful to monitor progress [17:25:06] certainly me not understanding how to write to a file in python :) [17:25:29] plausibly needs to be regularly flushed? [17:25:37] yes most probably [17:25:49] working on relforge1003 reimage now [17:25:53] inflatador: thanks! [17:26:02] heading out, back later tonight [17:33:38] * ebernhardson submitted a line buffering patch, plausibly it works [17:34:02] some things are much easier than others ...that took 10 minutes, but i've been looking at this growth_underlinked thing for like 2 hours now and not getting very far :P [17:41:15] spoke too soon, we explicitly have a comment about this once i find the right place: Not doing anything will result in a Cirrus error about a non-existent function type, which seems like a reasonable way to handle the case of using underlinked weighting on a wiki with no link recommendation task type. [19:13:34] pondering the right option...we could move the GrowthExperiments profile from extension.json to a CirrusSearchProfileService hook handler. Then it could be conditionally registered only when valid. My minor dislike for that is now we will go through the process of instantiating a variety of GrowthExperiments things on every search query to determine if this profile that isnt being used [19:13:36] is allowable [19:15:15] Alternatively we have fallback profiles, tempted to add a param to the profile, maybe `is_conditionally_supported`. When thats true we could fire a new `CirrusearchCheckRescoreProfileSupported` hook with the profile name and let growth experiments only load it's code and do checks at that point [19:15:36] and do that as part of the normal fallback checks [19:34:37] OK, relforge1003 is on opensearch. Note that I didn't reimage because of the manual problems, although I did disable all elastic units, ban/unban, and reboot several times. Everything looks OK to me, but LMK if you're noticing issues [19:35:39] break, back in ~15-20 [19:52:26] back [19:53:32] inflatador: you’re good for the backport window right? I forgot it’s at 1 not 2 so I’ll still be out with the dog for awhile [19:54:42] ryankemper yeah, sounds like they will do the deploying for us, but I'll be around [19:55:06] yea they do the deploy, but you have to be arround to verify it works [19:55:14] and to approve the deploy [20:18:21] sigh writing tests takes longer than writing the code...conditional profiles on hooks is easy, wiring up everything to test it not as much :P Not terrible, just lots of bits copied from other tests, and then wondering about sharing between tests [20:48:53] is https://gitlab.wikimedia.org/repos/search-platform/cirrus-toolbox an appropriate place to add convenience scripts for looking at the production cirrus clusters? Or is that too much scope creep? [21:31:41] inflatador: that's basically its purpose. It's cirrus in the context of wmf prod, but mostly operational things that aren't generally relevant to external cirrus users