[09:41:20] could someone help me troubleshoot an issue with a flink conatiner on main eqiad? the mw-page-content-change-enrich app died last night with https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.39?id=Xqro1IoBYI7nmYcaEY_P and is not coming back up [09:41:27] I tried to undeploy on eqiad and re-deploy, but the app won’t spin up PODs [09:42:01] This log in the flink k8s operator container might be related https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.39?id=Mjrk1YoBYI7nmYca-j1I [09:45:15] <_joe_> gmodena: uh [09:49:37] gmodena: can't see your flinkdeployment, did helmfile end successfully after running apply? [09:51:06] the "RejectedExecutionException: event executor terminated" is concerning and suggests the operator was in a weird/incoherent state [09:51:24] dcausse no, that's part of the problem [09:51:31] I applied, but PODs won't come up [09:51:51] and I could not find much info in logs, other than what I linked [09:51:55] the pod will come up only after the operator reads the flinkdeployment k8s resource [09:52:12] dcausse helmfile did _seem_ to have ended succesffuly [09:52:12] kubectl get flinkdeployment -> empty [09:53:07] unless I'm in the wrong env... entered: kube_env mw-page-content-change-enrich eqiad [09:53:12] dcausse i suspect something got stuck in the operator [09:53:20] dcausse that's the right env [09:53:35] helmfile should still create that flinkdeployment resource I think... [09:53:46] i remember a similar issue on staging [09:54:18] that required admin powers to force a deployment [09:59:02] my assumptions are probably wrong, was assuming that the operator was polling the flinkdeployment for changes to determine what action to do [09:59:25] Brian can attempt a restart when he starts his day [10:03:13] dcausse ack. I pinged Brian in slack (#event-platform). Thanks for the help debugging. [10:06:29] I do see some errors from the flink-operator in main-eqiad. Like this... [10:06:34] {"@timestamp":"2023-09-27T09:06:41.027Z","log.level":"ERROR","message":"Failed to submit a listener notification task. Event loop shut down?", "ecs.version": "1.2.0","process.thread.name":"Flink-RestClusterClient-IO-thread-1","log.logger":"org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.rejectedExecution","error.type":"java.util.concurrent.RejectedExecutionException","error.message":"event executor [10:06:34] terminated" [10:08:13] Not sure why I can't see the flink-operator namespace logs in logstash for now. https://logstash.wikimedia.org/app/dashboards#/view/7f883390-fe76-11ea-b848-090a7444f26c?_g=h@c823129&_a=h@cf9f799 [10:08:50] I can also try a redeploy of the flink operator, but I'm still a bit tied up in dealing with this kafka incident at the moment. [10:10:21] btullis ack. I think this can wait till after the kafka incident and/or Brian avail [12:16:49] Unfortunately I can't help rn as I'm stuck in another issue, but re-deploying flink operator seems like a absolute last resort option to me. That would/could affect all flink deployments running so we should not do that in general and instead take the chance to figure out what is wrong [13:30:13] gmodena I have meeting in 1m but will take a look noce that's done. btullis the missing logs are a known issue, will link a task shortly [13:40:32] logs should be in the "ecs" indices [13:42:39] dcausse got it, I have this ticket to find the logs: https://phabricator.wikimedia.org/T345668 . Am I just looking at the wrong indices? I thought I checked ecs already but maybe not [13:44:13] so orchestrator.cluster.url:"https://kubemaster.svc.codfw.wmnet:6443" AND orchestrator.namespace:rdf-streaming-updater AND "Completed checkpoint" is for the rdf-streaming-updater flink job running in wikikube@codfw [13:45:06] That data is there, just checked [13:45:25] so this ticket is invalid...good! I can close it [13:45:40] yes checked as well [13:45:55] the dashboard shared by ben is not using the ECS index so probably why [13:46:38] orchestrator.cluster.url:"https://kubemaster.svc.eqiad.wmnet:6443" AND orchestrator.namespace:flink-operator works well with the ecs indices [13:47:24] but there's something not working well with the dashboard "App Logs - ECS (Kubernetes)": https://logstash.wikimedia.org/app/dashboards#/view/f3fefa60-f95a-11ed-aacf-e115c4d3fd2c?_g=h@c823129&_a=h@35e22a1 [13:47:32] log lines are not showing up [13:48:19] (once you select some filters) [13:49:33] So you can see the logs with `kubectl` but they aren't making it into OS? [13:50:56] logs are in logstash but I'm guessing that it's a filtering problem in the opendashboard dashboard [14:00:21] inflatador ack [14:06:58] I have the feeling there is no general rule about if logs are in ecs or logstash indices. I wondered about that some point last week and forgot again. If it stays unclear, maybe involve o11y folks so we can get a shared understanding of what is/should be where [15:03:12] fwiw I just deployed the flink-kubernetes-operator-2.3.3 -> flink-kubernetes-operator-2.3.5 update to all wikikube clusters as that was not done [15:08:03] jayme ack [15:10:47] mw-page-content-change-enrich is still stuck :( [17:25:49] gmodena sorry been out, looking at it now [18:58:24] started https://phabricator.wikimedia.org/T347521 to track the work, haven't found much so far. I did notice the flink-operator in staging is also broken [19:03:57] ^^ correction, the **flinkdeployment** in staging is broken, not the operator