[08:49:33] https://github.com/nomoa/flink-connector-elasticsearch/commit/4fb2b577cc5965cd9977e663116e996b37903190 is what I have so far, not very happy with it but it permits to have access to the context and create more metrics from a BulkItemResponseHandler given by the user when creating the sink [08:52:37] I'm not sure about the signature of the onResponse method returning SUCCESS, RETRY or ERROR, esp. RETRY (i.e. how do you track the number of retries of this particular item) [09:05:00] retrying this way also change the order of operations which might be surprising for some users... [09:09:02] wdqs1022 & wdqs1023 are a bit unstable, ssh & systemctl alerts flapping (seeing weird things like systemd-timedated.service: Unexpected error response from GetNameOwner(): Connection terminated) [10:34:10] dcausse: cool, thanks. Just an idea to get more control over retries: If the `responseHandler` is in charge of building the action to be retried it could keep track of which actions have been retried (and how often). [10:38:02] pfischer: yes, that'd be on the client-code implementing BulkItemResponseHandler (which could be a class also implementing the Emitter) to keep a kind of Map retryCounter or the like [10:38:41] providing a retry counter/limit from the sink seems difficult [10:42:43] another thought was to use some (hints/metadata) in elastic data classes (DocWriteRequest children) to keep and increment a counter but haven't found a place to keep that [10:51:30] lunch [11:22:58] Good morning. Could you check on the status of your Airflow jobs on an-airflow1005 please? I have received some alert emails to discovery-alerts@lists.wikimedia.org with the message: `Executor reports task instance finished (failed) although the task says its queued. (Info: None) Was the task killed externally?` [11:23:53] I have upgraded airflow to version 2.7.3 and restarted the airflow services, so it is possible that these tasks were killed externally. [11:24:35] I paused all active DAGs about 15 minutes before the upgrade, in the hope that their tasks would run to completion. Then I unpaused them just after the upgrade. [12:28:02] btullis: I wonder why all those alert mails stop at “Try 4 of 5”. According to the YARN UI, there have been successful runs (state finished, around 12:35 UTC) for all applications that I see warnings for. [12:29:14] pfischer: Thanks for looking into it. 12:35 UTC hasn't happened yet, were these runs from yesterday? [12:40:54] Maybe I got the time wrong, but I looked at the finished applications: start time is Wed Nov 29 12:35:14 +0100 2023. That would 12:35:14 CEST, right? https://yarn.wikimedia.org/cluster/app/application_1695896957545_374146 [12:41:31] The detail view shows UTC, so 11:35:14, btullis [13:30:20] dcausse: I wonder if we should build in support for retries after all. I thought of mapping the failed action back to an event (so we can route them to a side output), but due to the ES emitter this is not necessary a 1:1 mapping. [13:42:06] pfischer: true, not sure either... we might need to gain more operational knowledge on possible errors and see if there cases where it might make sense [14:15:10] o/ [14:26:15] o/ [14:58:26] depooling wdqs/wcqs eqiad in preparation for https://phabricator.wikimedia.org/T326409 [15:22:56] inflatador: We just got an alert about `RdfStreamingUpdaterFlinkJobUnstable search-platform (WDQS_Streaming_Updater rdf-streaming-updater k8s critical eqiad prometheus)` - This is expected, is it? [15:25:44] btullis: yes it's known Brian is migrating the jobs the k8s-operator deployment model [15:26:36] Cool, I thought it probably was. Just checking. Best of luck, shout if I can help with anything. [15:26:52] thanks btullis! [16:06:57] fyi , David, Peter and myself are still doing the flink migration so no Weds mtg for us [16:39:44] muted the rdf related alerts for 3 hours [16:39:48] inflatador: ^ [16:44:43] thanks ebernhardson [16:45:26] we're on the operator in eqiad, but codfw isn't there yet...maybe fw rules for zk? [16:45:30] checking now [16:50:26] dcausse codfw zk hosts still insetup ;( . Patch to enable here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/978634 [16:57:11] ^^ merged already, setting it up now [17:00:56] gonna need a few more puppet things setup, one sec [17:10:48] btullis dcausse another puppet patch for ZK if y'all have time to look https://gerrit.wikimedia.org/r/c/operations/puppet/+/978639 [17:11:32] looking now [17:37:05] ZK's initscript is being stupid...kinda surprised they don't have a proper systemd unitfile [17:48:00] OK, wikidata is stable in codfw...just waiting on commons [17:50:04] both apps are stable, merging the dp patch [18:00:23] I guess we'll need to update or remove those `RdfStreamingUpdaterNotEnoughTaskSlots` alerts [18:02:08] OK, migration is done! Lunchtime [18:36:14] back [19:01:42] hmm, fetch_failure is mostly RevisionNotFoundException, perhaps more than i would have expected (might write something to check their current state), but we do have a few timeouts that end up in fetch_failure [19:03:15] gehel will you make the alerting mtg? [19:22:39] heh, choosing a couple random pages they also fail regular rendering in time, hard to know when the timeouts are real and when they are transient [19:23:58] first one is 1.3MB of wikitext generated by a bot :S [19:46:18] appointment time, back in ~90 [19:48:39] * ebernhardson wonders if the php side should have some way to timeout the parser cache fetch and report a proper error instead of timing out [19:50:02] we have a library available that makes it possible to invoke a php function with a timeout, although it looks like something transitive that was never used in mediawiki directly [19:50:43] ahh, yea it comes through the phpunit tes tsuite [21:22:22] back [22:05:19] inflatador: \o/ well done! [22:34:48] dcausse thanks for all your help and patience!