[01:22:31] One consequence of us automatically setting cluster seeds is that they can briefly fail during rolling restarts (`SystemdUnitFailed: push_cross_cluster_settings_9600.service on elastic1083:9100`) [01:23:12] We might need to have the cookbook downtime every host in the cluster rather than the few it's operating on, although that comes with its own issues. Or perhaps we just downtime the specific alert [07:30:32] was about to ship the config patch to disable writes from the jobqueue [07:30:55] but need to re-enable the SUP consumer in eqiad first [07:38:01] sigh... might not be able to get that done this morning, and this afternoon window is already almost full [07:56:26] ebernhardson: I could not get it deployed, moved it the late backport window, should be the first one [08:35:33] dcausse: have you re-enabled the consumer already or should that be done right right before the config deployment? [08:36:12] pfischer: I think it should be done a couple hours before disabling jobqueue writes [08:36:19] so probably this afternoon? [08:37:10] Sure, I'll do that. [08:37:16] thanks! [09:32:02] hm.. https://www.wikidata.org/wiki/Wikidata:Report_a_technical_problem/WDQS_and_Search#Item_not_searchable_by_initial_part_of_Label_or_its_alias [09:32:47] Q125918173 does indeed seem to be missing from codfw [09:50:16] searching for traces of it in eqiad.cirrussearch.update_pipeline.update.rc0 & eqiad.mediawiki.page_change.v1 from kafka@codfw [09:53:09] found 6 revisions in eqiad.cirrussearch.update_pipeline.update.rc0 [09:53:49] including rev_id: 2153616599 (the latest one) [09:57:48] Hm, so they never made it into one of the indices? [09:59:00] pfischer: apparently no... not sure where they got lost yet, but the producer is not to blame so must be somewhere in the consumer [09:59:54] found something in the error stream [10:01:34] but only from the consumer of cloudelastic [10:03:47] …looking at the logs [10:04:47] pfischer: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2024.20?id=J1ZhdI8Bzo4wKbLAJ1WQ [10:06:34] Hm, sounds like the schema is too restrictive [10:07:33] pfischer: https://phabricator.wikimedia.org/P62377 [10:09:13] Thanks, I guess I’ll broaden the range of rev_id [10:11:43] pfischer: not sure I see where this range is set, from the schemas it's min:1, maximum: 9007199254740991 [10:12:59] That’s weird, we include the definition of that property from /fragment/mediawiki/state/entity/revision/1.0.0#/properties/rev_id and this is bounded by maximum: 9007199254740991, minimum: 1 [10:14:05] seems like jackson willing to parse an int and that's a long [10:14:29] Jep, just figured that [10:16:47] It’s declared as “type”: “integer” and happens when a row is encoded, so we do not have a say via annotation, how to interpret this “integer” [10:19:11] but GzipJsonAsyncResponseConsumer is parsing a plain JsonNode without any knowledge of the schema yet? [10:20:17] or I'm missing something [10:22:20] perhaps we need DeserializationFeature.USE_LONG_FOR_INTS ? [10:23:12] Yep, I’m looking at the EventStreamFactory and how we can get ahold of the ObjectMapper [10:23:42] …and yes, the gzip consumer just parses, unaware of any schema [10:24:21] I think it's org.wikimedia.discovery.cirrus.updater.common.graph.CirrusFetcher#getObjectMapper() that needs to be adapted [10:26:56] Hm, sure? I thought that this happens during encoding of failures [10:27:58] Or do we just see the message in that logger? [10:28:22] logger: org.wikimedia.discovery.cirrus.updater.common.graph.RouteFetchFailures [10:28:49] I’ll figure it out [10:30:59] pfischer: see the full stack in https://phabricator.wikimedia.org/P62377 [10:31:29] it's a failure happening when parsing the cirrus doc [10:32:00] the failure is properly encoded [10:32:46] consumer@codfw is not populating its error queue but that's probably another problem [10:33:46] Jep, reproduced it locally. It’s in the gzip consumer, you’re right [10:41:38] Grr, that’s a limitation of the async json implementation (hc5-async-json) [10:44:41] :( [10:44:46] filed T364837 [10:44:47] T364837: Q125918173 missing from elastic@codfw - https://phabricator.wikimedia.org/T364837 [10:50:21] surprising that these errors started only from may 7th [10:52:08] could we have changed something in the way we package the app that could explain why we never that error before? [10:55:04] Well, all the integration tests use revision IDs below the int boundaries [10:55:32] I just replace one of the rev IDs with 2153599188 and boom: We see that exact failure [10:55:41] Upstream fix is almost ready. [11:08:16] lunch [12:23:06] dcausse: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/128 - Once that’s deployed I could re-inject all events from the failure_topic that where discarded due to the long parsing. [12:38:54] pfischer: thanks! merged [12:57:58] dcausse: Thanks! Should we be smart and try to filter only the latest revisions when re-running the failed events or should I simply forward them as they are? [13:00:03] pfischer: not sure that's worth but we should make sure to re-ship in the right order and keep the proper message key so that they end up in the right partition [13:02:48] still puzzled why we don't have any errors in the error queue for the codfw consumer [13:03:01] Yeah, sadly, the failure_topic is not keyed. So we have to create the keys from the information in raw_event [13:13:55] ah consumer@codfw has: fetch-error-topic: codfw.cirrussearch.update_pipeline.fetch_error [13:14:14] but should be codfw.cirrussearch.update_pipeline.fetch_error.rc0 [13:14:38] seeing the errors in codfw.cirrussearch.update_pipeline.fetch_error [13:14:41] o/ [13:16:50] o/ [13:19:13] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1031441 is fixing the name of the error topic [13:22:37] trying to find the right metric to use to detect such problems and alert us [13:28:42] not seeing a dedicated metric for that, could use the rate of ingested message on the error topic but that might include cloudelastic+prod@eqiad in eqiad but only prod@codfw in codfw [13:41:06] ah we might have flink_taskmanager_job_task_operator_numRecordsOut{operator_name="rerender_route_fetch_failures"} [13:44:05] recordsOut/recordsIn might do the trick [14:27:04] actually not finding a good way to use those :/ [14:32:57] I have sum(rate(flink_taskmanager_job_task_operator_numRecordsOut{operator_name="route_fetch_failures"}[5m]))/ on (job_name) sum(rate(flink_taskmanager_job_task_operator_numRecordsIn{operator_name="route_fetch_failures"}[5m])) [14:34:17] but that does not make sense recordsOut should be successful events [14:39:27] hm that works for wdqs: with sum(rate(flink_taskmanager_job_task_operator_numRecordsOut{job_name="WDQS_Streaming_Updater", operator_name="RouteFailedOpsToSideOutput"}[5m]))/ on (job_name) sum(rate(flink_taskmanager_job_task_operator_numRecordsIn{job_name="WDQS_Streaming_Updater", operator_name="RouteFailedOpsToSideOutput"}[5m])) [14:40:05] where it's close to 1 meaning that ~100% of fetches are successful [14:41:27] but for the sup with sum(rate(flink_taskmanager_job_task_operator_numRecordsOut{job_name="cirrus_streaming_updater_consumer_search_codfw", operator_name="rerender_route_fetch_failures"}[5m]))/ on (job_name) sum(rate(flink_taskmanager_job_task_operator_numRecordsIn{operator_name="rerender_route_fetch_failures", job_name="cirrus_streaming_updater_consumer_search_codfw"}[5m])) [14:41:42] it's close to 0... [14:43:12] or we fail most of the rerender events [14:59:50] \o [15:00:16] that sounds...not good :S [15:00:32] could add a new metric or use the sink metric but we can't make a ratio using solely the sink metric [15:01:13] sum(rate(flink_taskmanager_job_task_operator_numRecordsOut{job_name="cirrus_streaming_updater_consumer_search_codfw", operator_name="rerender_fetch_failure_sink:_Writer"}[5m])) [15:01:21] sum(rate(flink_taskmanager_job_task_operator_numRecordsOut{job_name="cirrus_streaming_updater_consumer_search_codfw", operator_name="fetch_failure_sink:_Writer"}[5m])) [15:01:37] we fail 6 rev based events/s and ~2 rerenders [15:02:08] hard to tell what's acceptable for rerenders [15:02:21] for rev based 6/s is definitely not good [15:02:34] o/ [15:04:32] curiously not much coming out of logging if that many are failing, in last 15m just some warnings that seem unrelated (statsd-exporter something about ratelimit) [15:04:43] for eqiad [15:26:55] ebernhardson: sadly it's a warn without a stack trace when looking at the log [15:51:41] workout, back in ~40 [15:57:17] i didn't look super closely, but when talking with dcausse, had to look. the diff for maximum bearing the maximum int value occurred on may 7 ... https://www.wikidata.org/w/index.php?diff=2147483647 <-- look at that number [16:12:13] anyway, pfischer patch fixes it it sounds like [16:13:59] dcausse: finally deployed the patch to codfw. Did the cirrus config change for eqiad get deployed by any chance? I forgot to deploy the SUP for eqiad. :-( [16:14:07] if rev_id > Integer.MAX_VALUE just last week I expect few more tools to hit that issue or start fetching negative revisions :( [16:14:44] pfischer: no it's still scheduled for this night backport window [16:15:02] it's in 4 hours or so [16:15:12] Okay, I’ll deploy SUP on eqiad then, alright? [16:15:18] yup [16:15:23] pfischer: you'll need to set the kafka start timestamp [16:15:55] yesterday i used: --set 'app.config_files.app\.config\.yaml.kafka-source-start-time=2024-05-13T17:00:00Z' [16:17:01] Hm, sth. does not look right for eqiad: “saneitize-cluster-group: codfw" [16:17:20] pfischer: doh, yea that would check the wrong things. thought i double checked those :( [16:17:34] what i really want is a better templating method where i don't repeat a bunch of values all over the place...but for another day :P [16:19:40] pfischer: did you replay failed events? [16:20:11] dcausse: no, not yet. If the deployment can wait, I can focus on that replay. [16:21:04] pfischer: oh sorry, no we should probably deploy the fix first to codfw & cloudelastic before reshipping those [16:24:22] Sorry, I meant the SUP deployment to eqiad. I already deployed the fix to codfw. [16:27:38] pfischer: either ways, wikidata users hitting codfw are starting to complain so we should probably backfill "soonish" [16:33:20] random unrelated autocomplete bug i just noticed, on en.m.wikipedia.org `EXA` will put the EXA page at top and not mention the Exa metric prefix. `exa` will put the metric prefix at the top and not mention EXA [16:37:44] sigh... it's ranked pretty low: https://en.wikipedia.org/w/api.php?action=opensearch&format=json&formatversion=2&search=exa&namespace=0&limit=100 [16:39:06] the mysql match is case sensitive and helps but completion does not rank higher these matches, a partially matched string or a fully matched one gets only the score computed at index time [16:40:01] i suppose in a way it makes sense, other than as an exact match the EXA page is a super niche thing [16:40:26] although, i guess most of wiki is niche things :P [16:40:33] I remember we talked helping these "short title" by using the title length in the score [16:40:35] :) [16:41:06] random fun fact, EXA apparently stands for "acceleration architecture with no well-defined acronym [16:41:53] can we add another level of autocomplete query that somehow only matches the full string? maybe some minimum match [16:41:55] * ebernhardson pokes docs [16:43:47] we could perhaps add an "end of title char" and have another match with it? [16:44:05] hmm, that would perhaps work to pull them up to the top [16:44:57] didn't look super closely, but no obvious way to provide a minimum number of character matches in suggester [16:45:15] but the marker would probably work [16:45:23] yes, I can't remember seeing an option like that [16:45:57] No hurry, but I created T364861 to discuss the elastic restart/SLO stuff [16:45:58] T364861: Check SLO impact of Elastic cluster rolling restarts/mitigate if necessary - https://phabricator.wikimedia.org/T364861 [16:46:19] also there was some worry that some short searches might yield all the case variations, I would not be surprised if there are pages like aaa, aAA, aAa, Aaa, AAa [16:47:00] hmm, yea i suppose that would be something to test. might not be too hard, maybe snapshot an index over to relforge and then maybe a reindex with a script could append the char to test with [16:47:20] sure [17:44:05] dinner [17:53:58] lunch, back in time for pairing [18:09:02] * ebernhardson finds gitlab docs odd...for example the prerequisite for the package download api is "You need to authenticate with the API" [18:09:06] except, thats not a requirement anywhere [18:09:29] its like they are documenting some specific case instead of how their software works [18:17:17] back [18:18:15] will be ~10mins late to pairing [19:26:26] dcausse: forwarder is running in dry-mode now: roughly 4 million events would be forwarded from fetch_error.rc0 to update.rc0 [19:27:40] …more like 4.7 million until the fix was rolled out [19:33:30] inflatador: ebernhardson: btw proposed patch for the envoy error rate alert: https://gerrit.wikimedia.org/r/c/operations/alerts/+/1031543 (lmk your thoughts, I think alerting when 1/200 requests fail is a good balance) [19:33:40] ouch. But i suppose reasonable if we rate limit the push? Even if we don't i guess thats about an hour or two of lag? [19:34:12] ^ was about the fetch_error forwarding [19:35:02] historically i guess we would have simply pushed that all into the job queue and let it backlog [19:40:45] I’m re-running the dry-run to get some insights, how the errors are distributed across wikis. If it’s limited to only wikidata, we could run a separate backfill with wiki-filter [19:44:35] Well, it’s 99.9% wikidatawiki [19:48:09] pfischer: seems reasonable to backfill wikidata [19:52:54] pfischer: makes sense [20:06:50] Alright. Since we are in control of the timestamp of the kafka-records we produce, we could put set them all to the same value so we can narrow the backfill start/end boundaries. But it does not feel right to pollute the update.rc0 topic. Couldn’t we use a temporary topic for that? [20:07:24] pfischer: i was thinking could backfill from the initial update events? It basically needs to do everything wikidata [20:08:17] Yeah, so no need to forward anything. That’s indeed cleaner [20:09:32] shipping a new version of discolytics now btw, with the empty df fixes [20:09:47] Thanks! [20:09:59] this needs more automation, it's a bunch of steps :P [20:10:25] Okay. What has to be done? Is this documented anywhere? [20:17:44] pfischer: partially documented, it's really not that much but it's taken me a little while intersperesed with other tasks. its mostly release discolytics (button push), update dag_search.py and artifacts.yaml in airflow-dags, the scap deploy airflow-dags [20:25:03] pfischer: about to turn off cirrus writes [20:25:07] for 25% [20:25:41] deploying the pods so both sides are running [20:27:21] can we simply use the consumer-search-backfill helm release for replaying wikidata? [20:27:47] yea [20:27:52] with the script, hopefully [20:31:55] I would deploy consumer-search-eqiad now, the config should be fixed. [20:32:24] pfischer: it's deployed [20:32:28] (like 5 min ago) [20:33:04] i'm watching the cirrus logs to make sure nothing silly coming ot, there are a surprising nuber of (likely unrelated) couldn't connect to host problems [20:33:12] still something wonky network/lvs/etc :S [20:33:27] Oh, thanks. I’m too slow. - Sure [20:34:15] Those connection errors are coming from the fetch operator? [20:34:53] pfischer: no, from cirrus [20:35:08] https://logstash.wikimedia.org/goto/17c39f692a574395979dfadc3643bccb [20:35:14] :eyes [20:35:26] this might not be working right...but watching [20:35:36] it's expected to log spam a little on deloy, due to queue'd jobs that are no longer valid [20:37:14] you mean "http_exception: Unknown error:52"? [20:37:37] and "Search backend error during sending 1 documents to the slwiki_content index(s) after 2: http_exception: Couldn't connect to host, Elasticsearch down?" [20:37:42] I see 55 even 56 [20:38:13] i suspect this didn't do the empty list as expected somewhere, logs are dieing down [20:38:22] oh nevermind, they did [20:40:39] meh can't see consumer-search@eqiad logs in logstash [20:41:14] i wrote a bash script that invokes kubectl logs, that i can call over ssh and show in a shell ... it works :P [20:41:34] :) [20:42:03] consumer-cloudelastic might need redeploy to get Peter's fix [20:42:09] If I can help on the network/LVS side LMK [20:42:15] basically: https://phabricator.wikimedia.org/P62389 [20:42:51] thanks :) [20:43:31] i suppose i actually wrote that because i wanted to use fblog without going through installing it to the cluster, so this runs it locally as well [20:45:24] as a random guess for why the logs went on longer than i expected for unwritable cluster... i guess the deploy takes long enough that some jobs were being still enqueued with the old while hte new was deploying [20:45:25] Okay, I’ll give it a try. Thanks! [20:45:27] redeploying consumer-cloudelastic [20:47:46] Thanks! [20:48:07] "out of range of int" are gone but there are way too many document_missing_exception on cloudelastic :( [20:48:25] :S [20:48:36] ◉_◉ [20:48:38] i suppose this updater should have more of those, since we don't upsert by default anymore [20:48:45] but it should have less, because the index should be correct :P [20:49:51] ah document_missing_exception! somehow I read index_missing_exceptions :) [20:50:46] oh, sadly there are a couple of those still in cirrus logs :( I suspect the cross-wiki config loading is breaking more [20:50:56] but haven't looked deeply, just noticed the logs earlier today [20:51:32] maybe we should have some sort of scheduled log review, there are various problems here that we ignore [20:55:18] you mean the connectivity errors between cirrus and elastic? [20:59:00] dcausse: mostly. also the "Failed to fetch config for nlwikiquote at <...>: timeout", [20:59:40] although maybe those just happen randomly and they mostly work....but there are index_not_found errors that i suspect are related [21:00:34] :/ [21:02:24] crossproject search still works somehow so it must be "transient" [21:03:30] hmm, yea just looking at same and it must not be completely broken, it's still injecting the cross-cluster names [21:04:29] ~10k errors in 15days [21:04:47] this api request should be mem only [21:05:01] yea, any timeout makes me think firewall [21:07:36] i suppose they at least look very consistent over the last 4 weeks :P [21:07:39] oh well [21:08:20] :) [21:08:36] perhaps throwing a quick retry might help [21:09:43] Yes, I was about to write that. Right now the sync clients doe not retry at all [21:10:55] The rate-limit PR introduces a strict retry strategy limited to 429s but we could expand that to handle timeouts too. [21:13:04] pfischer: unfortunately these errors are cirrus side, the in the app servers if de.wikipedia.org wants to query de.wikibooks.org it has to request de.wikibooks.org/w/api.php?action=cirrus-config-dump to load the related config. This gets cached but has to refresh regularly. Those are failing [21:13:15] it's always been a bit of a hack, wikis don't generally make api calls to each other [21:13:47] mostly that config is to figure out what cluster the index lives in, but it also uses some query time stuff [21:14:24] if we can't load the config we assume the other wiki config looks like ours and try sending a search query, that then fails with index_not_fonud because the index lives in a different cluster and needs the cross-cluster syntax [21:15:04] users probably don't notice, but some of the sister-search results will be missing on Special:Search [21:21:43] Okay, that makes more sense, I started to wonder when I couldn’t find the log messages, mentioned above, in the SUP code. Thanks. So any sister-query would 1) fetch the config of the sister project and 2) use that config to try a request to that sister projects backing ES cluster directly? [21:23:26] Why not use an API call to that sister project to get the search results instead of just the config? [21:23:36] pfischer: mostly, but step 1 has server-local cache (iirc, it might be per-dc) of the config so it's not always requested, the then querying the extra indexes is done through an msearch request, basically we send many requests at once to elastic and it routes them (even cross-cluster) appropriately [21:24:33] Ah, so that would perform better then fanning out HTTP requests to other wikis. [21:24:41] s/then/than/ [21:24:43] but the cross-cluster routing requires us to prefix the cluster name to the index name, like omega:eowiktionary_content [21:24:59] yea [21:28:04] Okay, got it. Thank you. [22:36:24] hmm, just saw an alert recovery for 'MediaWiki CirrusSearch update rate - codfw ' but I never saw the alert [23:47:32] something about that metric is quite odd, it was basically 0 until a couple hours ago but it looks like it should be the number of writes [23:48:13] it went to on 5/9, and came back 5/14 [23:50:36] Pretty sure thats coming from DataSender::reportUpdateMetrics, seems impossible it was actually 0 for 5 days