[08:49:33] <dcausse>	 https://github.com/nomoa/flink-connector-elasticsearch/commit/4fb2b577cc5965cd9977e663116e996b37903190 is what I have so far, not very happy with it but it permits to have access to the context and create more metrics from a BulkItemResponseHandler given by the user when creating the sink
[08:52:37] <dcausse>	 I'm not sure about the signature of the onResponse method returning SUCCESS, RETRY or ERROR, esp. RETRY (i.e. how do you track the number of retries of this particular item)
[09:05:00] <dcausse>	 retrying this way also change the order of operations which might be surprising for some users...
[09:09:02] <dcausse>	 wdqs1022 & wdqs1023 are a bit unstable, ssh & systemctl alerts flapping (seeing weird things like systemd-timedated.service: Unexpected error response from GetNameOwner(): Connection terminated)
[10:34:10] <pfischer>	 dcausse: cool, thanks. Just an idea to get more control over retries: If the `responseHandler` is in charge of building the action to be retried it could keep track of which actions have been retried (and how often).
[10:38:02] <dcausse>	 pfischer: yes, that'd be on the client-code implementing BulkItemResponseHandler (which could be a class also implementing the Emitter) to keep a kind of Map<SomethingToIdentifyARequest, Integer> retryCounter or the like 
[10:38:41] <dcausse>	 providing a retry counter/limit from the sink seems difficult
[10:42:43] <dcausse>	 another thought was to use some (hints/metadata) in elastic data classes (DocWriteRequest children) to keep and increment a counter but haven't found a place to keep that
[10:51:30] <dcausse>	 lunch
[11:22:58] <btullis>	 Good morning. Could you check on the status of your Airflow jobs on an-airflow1005 please? I have received some alert emails to discovery-alerts@lists.wikimedia.org with the message: `Executor reports task instance finished (failed) although the task says its queued. (Info: None) Was the task killed externally?`
[11:23:53] <btullis>	 I have upgraded airflow to version 2.7.3 and restarted the airflow services, so it is possible that these tasks were killed externally.
[11:24:35] <btullis>	 I paused all active DAGs about 15 minutes before the upgrade, in the hope that their tasks would run to completion. Then I unpaused them just after the upgrade.
[12:28:02] <pfischer>	 btullis: I wonder why all those alert mails stop at “Try 4 of 5”. According to the YARN UI, there have been successful runs (state finished, around 12:35 UTC) for all applications that I see warnings for.
[12:29:14] <btullis>	 pfischer: Thanks for looking into it. 12:35 UTC hasn't happened yet, were these runs from yesterday?
[12:40:54] <pfischer>	 Maybe I got the time wrong, but I looked at the finished applications: start time is Wed Nov 29 12:35:14 +0100 2023. That would 12:35:14 CEST, right? https://yarn.wikimedia.org/cluster/app/application_1695896957545_374146
[12:41:31] <pfischer>	 The detail view shows UTC, so 11:35:14, btullis
[13:30:20] <pfischer>	 dcausse: I wonder if we should build in support for retries after all. I thought of mapping the failed action back to an event (so we can route them to a side output), but due to the ES emitter this is not necessary a 1:1 mapping.
[13:42:06] <dcausse>	 pfischer: true, not sure either... we might need to gain more operational knowledge on possible errors and see if there cases where it might make sense
[14:15:10] <inflatador>	 <o/
[14:17:28] <dcausse>	 o/
[14:26:15] <pfischer>	 o/
[14:58:26] <inflatador>	 depooling wdqs/wcqs eqiad in preparation for https://phabricator.wikimedia.org/T326409
[15:22:56] <btullis>	 inflatador: We just got an alert about `RdfStreamingUpdaterFlinkJobUnstable search-platform (WDQS_Streaming_Updater rdf-streaming-updater k8s critical eqiad prometheus)` - This is expected, is it?
[15:25:44] <dcausse>	 btullis: yes it's known Brian is migrating the jobs the k8s-operator deployment model
[15:26:36] <btullis>	 Cool, I thought it probably was. Just checking. Best of luck, shout if I can help with anything.
[15:26:52] <inflatador>	 thanks btullis!
[16:06:57] <inflatador>	 fyi , David, Peter and myself are still doing the flink migration so no Weds mtg for us
[16:39:44] <ebernhardson>	 muted the rdf related alerts for 3 hours
[16:39:48] <ebernhardson>	 inflatador: ^
[16:44:43] <inflatador>	 thanks ebernhardson 
[16:45:26] <inflatador>	 we're on the operator in eqiad, but codfw isn't there yet...maybe fw rules for zk?
[16:45:30] <inflatador>	 checking now
[16:50:26] <inflatador>	 dcausse codfw zk hosts still insetup ;( . Patch to enable here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/978634
[16:57:11] <inflatador>	 ^^ merged already, setting it up now
[17:00:56] <inflatador>	 gonna need a few more puppet things setup, one sec
[17:10:48] <inflatador>	 btullis dcausse another puppet patch for ZK if y'all have time to look https://gerrit.wikimedia.org/r/c/operations/puppet/+/978639
[17:11:32] <btullis>	 looking now
[17:37:05] <inflatador>	 ZK's initscript is being stupid...kinda surprised they don't have a proper systemd unitfile
[17:48:00] <inflatador>	 OK, wikidata is stable in codfw...just waiting on commons
[17:50:04] <inflatador>	 both apps are stable, merging the dp patch
[18:00:23] <inflatador>	 I guess we'll need to update or remove those `RdfStreamingUpdaterNotEnoughTaskSlots` alerts
[18:02:08] <inflatador>	 OK, migration is done! Lunchtime
[18:36:14] <inflatador>	 back
[19:01:42] <ebernhardson>	 hmm, fetch_failure is mostly RevisionNotFoundException, perhaps more than i would have expected (might write something to check their current state), but we do have a few timeouts that end up in fetch_failure
[19:03:15] <inflatador>	 gehel will you make the alerting mtg?
[19:22:39] <ebernhardson>	 heh, choosing a couple random pages they also fail regular rendering in time, hard to know when the timeouts are real and when they are transient
[19:23:58] <ebernhardson>	 first one is 1.3MB of wikitext generated by a bot :S
[19:46:18] <inflatador>	 appointment time, back in ~90
[19:48:39] * ebernhardson wonders if the php side should have some way to timeout the parser cache fetch and report a proper error instead of timing out
[19:50:02] <ebernhardson>	 we have a library available that makes it possible to invoke a php function with a timeout, although it looks like something transitive that was never used in mediawiki directly
[19:50:43] <ebernhardson>	 ahh, yea it comes through the phpunit tes tsuite
[21:22:22] <inflatador>	 back
[22:05:19] <dcausse>	 inflatador: \o/ well done!
[22:34:48] <inflatador>	 dcausse thanks for all your help and patience!