[09:41:20] <gmodena>	 could someone help me troubleshoot an issue with a flink conatiner on main eqiad?  the mw-page-content-change-enrich app died last night with https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.39?id=Xqro1IoBYI7nmYcaEY_P and is not coming back up
[09:41:27] <gmodena>	 I tried to undeploy on eqiad and re-deploy, but the app won’t spin up PODs
[09:42:01] <gmodena>	 This  log in the flink k8s operator container might be related https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.39?id=Mjrk1YoBYI7nmYca-j1I
[09:45:15] <_joe_>	 gmodena: uh
[09:49:37] <dcausse>	 gmodena: can't see your flinkdeployment, did helmfile end successfully after running apply?
[09:51:06] <dcausse>	 the "RejectedExecutionException: event executor terminated" is concerning and suggests the operator was in a weird/incoherent state
[09:51:24] <gmodena>	 dcausse no, that's part of the problem
[09:51:31] <gmodena>	 I applied, but PODs won't come up
[09:51:51] <gmodena>	 and I could not find much info in logs, other than what I linked 
[09:51:55] <dcausse>	 the pod will come up only after the operator reads the flinkdeployment k8s resource
[09:52:12] <gmodena>	 dcausse helmfile did _seem_ to have ended succesffuly 
[09:52:12] <dcausse>	 kubectl get flinkdeployment -> empty
[09:53:07] <dcausse>	 unless I'm in the wrong env... entered: kube_env mw-page-content-change-enrich eqiad
[09:53:12] <gmodena>	 dcausse i suspect something got stuck in the operator
[09:53:20] <gmodena>	 dcausse that's the right env
[09:53:35] <dcausse>	 helmfile should still create that flinkdeployment resource I think...
[09:53:46] <gmodena>	 i remember a similar issue on staging
[09:54:18] <gmodena>	 that required admin powers to force a deployment 
[09:59:02] <dcausse>	 my assumptions are probably wrong, was assuming that the operator was polling the flinkdeployment for changes to determine what action to do
[09:59:25] <dcausse>	 Brian can attempt a restart when he starts his day
[10:03:13] <gmodena>	 dcausse ack. I pinged Brian in slack (#event-platform). Thanks for the help debugging.
[10:06:29] <btullis>	 I do see some errors from the flink-operator in main-eqiad. Like this...
[10:06:34] <btullis>	 {"@timestamp":"2023-09-27T09:06:41.027Z","log.level":"ERROR","message":"Failed to submit a listener notification task. Event loop shut down?", "ecs.version": "1.2.0","process.thread.name":"Flink-RestClusterClient-IO-thread-1","log.logger":"org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.rejectedExecution","error.type":"java.util.concurrent.RejectedExecutionException","error.message":"event executor 
[10:06:34] <btullis>	 terminated"
[10:08:13] <btullis>	 Not sure why I can't see the flink-operator namespace logs in logstash for now. https://logstash.wikimedia.org/app/dashboards#/view/7f883390-fe76-11ea-b848-090a7444f26c?_g=h@c823129&_a=h@cf9f799
[10:08:50] <btullis>	 I can also try a redeploy of the flink operator, but I'm still a bit tied up in dealing with this kafka incident at the moment.
[10:10:21] <gmodena>	 btullis ack. I think this can wait till after the kafka incident and/or Brian avail
[12:16:49] <jayme>	 Unfortunately I can't help rn as I'm stuck in another issue, but re-deploying flink operator seems like a absolute last resort option to me. That would/could affect all flink deployments running so we should not do that in general and instead take the chance to figure out what is wrong
[13:30:13] <inflatador>	 gmodena I have meeting in 1m but will take a look noce that's done. btullis the missing logs are a known issue, will link a task shortly
[13:40:32] <dcausse>	 logs should be in the "ecs" indices
[13:42:39] <inflatador>	 dcausse got it, I have this ticket to find the logs: https://phabricator.wikimedia.org/T345668 . Am I just looking at the wrong indices? I thought I checked ecs already but maybe not
[13:44:13] <dcausse>	 so orchestrator.cluster.url:"https://kubemaster.svc.codfw.wmnet:6443" AND orchestrator.namespace:rdf-streaming-updater AND "Completed checkpoint" is for the rdf-streaming-updater flink job running in wikikube@codfw
[13:45:06] <inflatador>	 That data is there, just checked
[13:45:25] <inflatador>	 so this ticket is invalid...good! I can close it
[13:45:40] <dcausse>	 yes checked as well
[13:45:55] <dcausse>	 the dashboard shared by ben is not using the ECS index so probably why
[13:46:38] <dcausse>	 orchestrator.cluster.url:"https://kubemaster.svc.eqiad.wmnet:6443" AND orchestrator.namespace:flink-operator works well with the ecs indices
[13:47:24] <dcausse>	 but there's something not working well with the dashboard "App Logs - ECS (Kubernetes)": https://logstash.wikimedia.org/app/dashboards#/view/f3fefa60-f95a-11ed-aacf-e115c4d3fd2c?_g=h@c823129&_a=h@35e22a1
[13:47:32] <dcausse>	 log lines are not showing up
[13:48:19] <dcausse>	 (once you select some filters)
[13:49:33] <inflatador>	 So you can see the logs with `kubectl` but they aren't making it into OS?
[13:50:56] <dcausse>	 logs are in logstash but I'm guessing that it's a filtering problem in the opendashboard dashboard 
[14:00:21] <gmodena>	 inflatador ack
[14:06:58] <jayme>	 I have the feeling there is no general rule about if logs are in ecs or logstash indices. I wondered about that some point last week and forgot again. If it stays unclear, maybe involve o11y folks so we can get a shared understanding of what is/should be where
[15:03:12] <jayme>	 fwiw I just deployed the flink-kubernetes-operator-2.3.3 -> flink-kubernetes-operator-2.3.5 update to all wikikube clusters as that was not done
[15:08:03] <gmodena>	 jayme ack
[15:10:47] <gmodena>	 mw-page-content-change-enrich is still stuck :(
[17:25:49] <inflatador>	 gmodena sorry been out, looking at it now
[18:58:24] <inflatador>	 started https://phabricator.wikimedia.org/T347521 to track the work, haven't found much so far. I did notice the flink-operator in staging is also broken
[19:03:57] <inflatador>	 ^^ correction, the **flinkdeployment** in staging is broken, not the operator