[07:46:58] inflatador: thanks for relforge1003! [09:08:08] dcausse: Is there a script to forcefully reindex? I see Reindexer and that it’s used by UpdateOneSearchIndexConfig but only if there’s an alias that points to different index than the expected one (the one identified via --indexSuffix). So it’s somewhat cumbersome to create a scenario where the Reindexer is actually used. [09:14:37] pfischer: the options should be: "--reindexAndRemoveOk --indexIdentifier now --ignoreIndexChanged" to force a reindex even if it's not required [09:16:54] "--reindexAndRemoveOk --indexIdentifier now" is usually what we run, --ignoreIndexChanged helps to skip some optimization that detects if any mappings/settings change would justify a reindex [09:44:09] dcausse inflatador could you give me a heads up when relforge is ready for testing? No rush at all, right now I'm looking at something else with tchin. Just wanted to make sure I'm following (I got a bit lost backscrolling :D) [09:44:49] gmodena: should be ready, will schedule a copy of enwiki, frwiki & itwiki soon today [09:51:14] dcausse thanks for the heads up! [10:49:32] lunch [11:01:37] started to import enwiki in relforge@gmodena_enwiki_content_20250321, might take ages... relforge is way slower than I had expected, time curl localhost:9200/gmodena_enwiki_content_20250321/_count -> real 3m8.972s [11:02:01] I used 15 shards like prod might perhaps explain but still... [11:02:18] lunch [11:28:33] dcausse welp. [13:12:05] Is relforge still going slow? It's really old hardware...probably the last 1G hosts we have. We should have new relforge hosts within the next week or so [13:32:32] o/ [13:36:53] inflatador: ack, yes somewhat slow not sure what the bottleneck is at this point https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-cluster=relforge&var-instance=All&var-site=eqiad [13:38:17] but no worries, it's making progress, indexed 3.1M articles that should be ~50% of enwiki [13:39:50] Cool...don't see any obvious resource starvation https://grafana.wikimedia.org/goto/iU-MSehNR?orgId=1 [13:40:59] CR to reduce noisy alerts on the RDF streaming updater. I need to fix the cleanup script, but I haven't had time https://gerrit.wikimedia.org/r/c/operations/alerts/+/1130114 [13:50:10] ^^ no need to review the above, it's been merged [14:06:56] \o [14:07:07] o/ [14:07:24] realized...cloudelastic name is almost as bad a logstash.wikimedia.org :P [14:07:28] it doesn't run elastic [14:07:51] yes... :) [14:08:25] ebernhardson see T387028 [14:08:25] T387028: Decide on a new name for Elastic hosts - https://phabricator.wikimedia.org/T387028 [14:08:51] seems a decent start :) [14:15:52] although..does renaming require a reimagine? maybe should have choose before migrating [14:15:56] reimage [14:17:12] I think you do, but that's OK https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging [14:23:36] worth consideration [14:25:28] * ebernhardson wonders if cirruscloud is a better or worse name [14:25:56] :) [14:28:19] could possibly be painful to search from web search engine, but not sure that's important, cloudelastic points to Elastic Cloud [14:44:56] errand [15:13:58] I updated the ticket with a link to this etherpad: https://etherpad.wikimedia.org/p/elastic-rename-suggestions-T387028 . Add your name suggestions to the etherpad and maybe we can vote on them at a future standup [15:13:58] T387028: Decide on a new name for Elastic hosts - https://phabricator.wikimedia.org/T387028 [15:27:22] image suggestions are resuming, this is going to push 24M tags to commonswiki :/ [15:28:13] hmm [15:28:29] i forgot how we expect that to work, it's rate limited at production and will produce for quite some time? [15:30:06] can't remember how search-loaders are throttling their work... [15:30:14] might run for quite some time indeed [15:30:31] 100/s at 86,400s/day would be like 3 days. It's probably higher though? [15:31:06] Matthias implemented a new rouding technique to limit the volatility of the scores, hopefully next runs will be a lot smaller [15:36:19] it's already at 300/s [15:37:08] hm and mjolnir is complaining now... [15:37:51] hmm, fail rate isn't that high. [15:38:06] no, 2.5/s, the alert must be quite low [15:38:55] wondering what's the failure, looking at logs [15:39:46] mjolnir rarely fails (never in the last 90days) [15:40:11] looks like throttling in the mjolnir_bulk daemon is mostly just about parallelism limits, 10 threads with 100 items per bulk [15:43:32] it does not report individual failures, only "POST https://search.svc.eqiad.wmnet:9243/_bulk [status:200 request:0.556s]" [15:44:27] hmm, the code makes it look like it should. Both places that us Metric.FAILED.inc() log str(result)[:1024] [15:44:30] hm... I'm perhaps not on the right loaded [15:44:38] s/loaded/loader [15:44:45] was looking at the head of the logs [15:45:19] * dcausse should look at logstash not at journalctl [15:45:37] funny, because most of the time i find journalctl / kubectl logs to be better than logstash :P [15:45:52] (for narrow debugging, not for general finding of things) [15:47:30] not finding anything in logstash... [15:48:04] ah they're not ecs logs eventhough they look like they should be [15:48:13] dcausse: https://logstash.wikimedia.org/goto/1bf4066a46ad94a58a92b3216b5f50e9 [15:48:39] thanks! [15:48:39] mostly version_conflict_engine_exception [15:49:06] dcausse: i also am never quite sure...i'm mildly surprised that starting from the mediawiki dashboard, removing the type: mediawiki filter, and then searching for our hosts finds the logs [15:49:31] yes... [15:49:35] version_conflict ... naively guessing updating the same id's from multiple threads? [15:49:44] program:mjolnir-kafka-bulk-daemon helps but you have to know it [15:50:08] but the seqNo jumps quite a bit...although i'm not sure how it's incremented [15:50:16] ebernhardson: yes we have retry on conflict usually but surprising to see this here [15:53:11] hmm, should we enable retry_on_conflict here? doesn't look referenced anywhere in the loader [15:53:36] probably [15:54:44] looks like it would have to be in the source files as part of the action line, or we would have to mutate the action lines. But we try to not decode them and just pipe things through [15:56:33] ah makes sense... the loader does not want to understand what's in there... [15:57:07] we should have seen such errors in the past tho, esp for inclinks [15:57:30] looking at the graph it's the first time I see the error count going above 0 [15:57:51] indeed, it is a bit surprising. I suppose i would have to dig into the files, but it makes me think the same (wiki, pageId) pair is updated multiple times [15:57:52] wondering if the source data does not have dups [15:58:00] yes [15:59:01] * ebernhardson pokes around to find an example file [16:02:27] it does have duplicates [16:02:49] ah I think there's one update per tag instead of per articles [16:03:13] there're two tags now, article level & section level image [16:03:21] and way more for commons [16:03:22] ahh, so they need to do some groupby [16:03:26] hm... [16:03:43] their table schema might not allow that, looking [16:04:09] i'm not sure what their code does, but it must be possible to reshape the data from their table? [16:04:56] yes we have some glue code on our side so there must must be a place [16:06:12] some i'm seeing are literal duplicates. As a random example in file 600 (maiwiki_content, 14248) has "recommendation.image_section/__DELETE_GROUPING__" in one update, and then later in the file again "recommendation.image_section/__DELETE_GROUPING__" [16:06:13] yes it's wikiid,namespace,pageid,tag,array of values [16:06:42] :/ [16:06:47] sounds like a bug [16:07:04] from: curl https://ms-fe.svc.eqiad.wmnet/v1/AUTH_analytics/search_updates/2025-03-21T15:33:17.149191Z/20250310T0/freq=weekly/image_suggestions/image_suggestions/part-00600.gz | zcat | grep -A 1 14248 [16:09:38] filing a task or two [16:27:24] workout, back in ~30 [16:48:00] time to start the weekend! Have fun! [16:53:51] enjoy! [16:54:00] * ebernhardson suspects GrowthExperiments doesn't run CI properly? I added tests to unit/HomepageHooksTests and the other tests in the file fail because the function that mocks all the arguments to HomepageHooks doesn't match the actual constructor signature [16:54:05] filed T389643 & T389639 [16:54:06] T389643: Adapt or transform image_suggestions_search_index_delta to allow creating one update per article - https://phabricator.wikimedia.org/T389643 [16:54:06] T389639: Duplicate tags in analytics_platform_eng.image_suggestions_search_index_delta - https://phabricator.wikimedia.org/T389639 [16:54:30] :/ [16:55:19] also this constructor has double-digit arguments...always fun :P [16:56:47] :) [17:08:46] back [17:09:12] heading out, have a nice week-end [17:11:03] .o/ [17:11:25] I'm sure they disabled linting too many args ;P [17:14:58] **for** too many args, that is [17:52:48] hmm, so CI passes the GrowthExperiemnts test suite even though i think it shouldn't...not ure how to check it actually ran the tests in tests/phpunit/unit [17:58:00] lunch, back in ~40 [18:10:17] and the answer is .... can you guess from the name? HomepageHooksTests.php [18:10:25] The test suite only runs files names *Test.php [18:13:05] and poking at the prod deployment, at least 5 test files are probably not being run. maybe also three more [18:17:49] hmm, no some were intentional. RevisionRecordTests in core is a trait that then gets included elsewhere [18:36:41] back [20:25:49] * inflatador just remembered that our 'ban' cookbook can take a datacenter row as an argument [20:26:56] so we could reimage pretty easily without worrying about the rolling operation code. Of course, we won't have an easy way to reboot the whole cluster until we fix it [20:27:21] hmm, seems plausible [20:50:48] the next question is: rename before reimaging, or punt that to a future date? I think it makes sense to do it now, but I'm afraid of what Puppet weirdness we might uncover [20:51:19] hmm, i suppose i would probably just do it now. reimaging takes a good bit of time [20:52:06] tell me about it ;(. And guess if we're going to run into Puppet problems, it may as well be while on DC is depooled [20:56:18] I guess we can figure out a name at Monday standup and go from there [21:02:29] +1 [21:06:25] have you ever used tmuxp ( https://tmuxp.git-pull.com/configuration/examples.html#id3) ? I'm using it to refactor my awful tmux maintenance viewer and it seems pretty nice so far [21:57:50] we ran the rolling-operation again in cloudelastic btw, same results...cluster went into red. Still log-diving, but we're pretty sure it recovered on its own [22:03:22] looks like we have 2 units for the same thing? opensearch_1@cloudelastic-chi-eqiad.service loaded active running OpenSearch (cluster cloudelastic-chi-eqiad) [22:03:23] ● opensearch_1@cloudelastic-eqiad.service loaded failed failed OpenSearch (cluster cloudelastic-eqiad) [22:05:32] in other words, we have an opensearch_1@cloudelastic-eqiad.service & an opensearch_1@cloudelastic-chi-eqiad.service [22:22:46] ryankemper Added my notes to T383811 before I forget everything ;P Feel free to add anything I missed [22:22:47] T383811: Ensure Search Platform-owned Elasticsearch cookbooks can handle Opensearch - https://phabricator.wikimedia.org/T383811