[07:40:43] o/ [07:41:43] dcausse same thing we see in mjolnir. skein keeps track of spark, airflow loses track of skein and re-spawns [07:42:24] but we did not see this issue in other dags, right? import_wikidata_ttl had a similar job submission model, and that worked. [07:43:03] I'm brewing some coffee, than I'll dig into this thing [08:01:49] o/ [08:06:08] gmodena: correct but for import_wikidata_ttl I don't remember us doing anything other than switching to client mode... [08:07:12] the "WARNING - State of this instance has been externally set to restarting. Terminating instance" makes me wonder if the task state in the DB is not forced to kill by something external to the k8s executor [08:11:19] actual log line is "WARNING - State of this instance has been externally set to up_for_retry. Terminating instance" [08:12:55] dcausse ack. I'm looking for airflow logs [08:30:56] psycopg2.OperationalError: could not translate host name "postgresql-airflow-search-pooler-rw.airflow-search" to address: Name or service not known [08:31:41] from the scheduler pod [08:31:56] airflow-scheduler-584d8f8dfc-4tpvp 2/2 Running 62 (36m ago) 4d16h [08:32:05] 62 restarts [08:37:05] also https://codesearch.wmcloud.org/search/?q=scheduler_health_check_threshold&files=&excludeFiles=&repos= [08:38:17] 360 from puppet, unsure what's the setup in k8s [08:38:32] btullis, brouberol: can you help ^ [08:40:04] I can look in 30 minutes. [08:41:01] I have seen these errors in all instances. I'm not 100% sure what the issue is, but I've seen it cause scheduler restarts. We recently found out that the liveness checks call out to the database, which caused other types of issues during the migration, so I think I'll resort to a tcpSocket check instead of an HTTP one, to at least bypass the issue [08:41:01] while I dig [08:41:28] let me create a ticket real quick [08:44:27] thx! [08:47:01] brouberol btullis (for when you are around) is there a way to retrieve container logs for airflow pods in a Completed state? [08:49:18] I've created T383651 [08:49:19] T383651: Scheduler pods are restarted due to the PG pooler DNS name being non-resolvable - https://phabricator.wikimedia.org/T383651 [08:49:33] sure, you should be able to see the logs within the UI [08:49:56] as they have been exported to s3 [08:53:03] brouberol: are logs from e.g. airflow-scheduler-584d8f8dfc-4tpvp exported to logstash? [08:55:02] they should be, it's a regular pod, so the logs it emits should be handled by the global log pipeline [08:55:50] ack [08:58:40] I've opened https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1111162 that should mitigate the issue [09:00:05] just to clarify something. How did the scheduler pod restarting impact you? [09:00:22] brouberol ack. I wanted to compare container logs with UI, just to validate if I'm missing something. [09:01:02] was it causing issues, or was this a "there's something fishy happening here" type of situation? [09:01:15] brouberol: we're looking into issues happening since the switch to k8s, k8s executors losing track of their skein app, logs suggest that the task is invalidated by something else than the k8s-exec-pod [09:01:28] Should we increase the number of coredns replicas, I wonder? [09:02:41] so, that's the other thing. The patch mitigates the effect of whatever is happening, but I'm not 100% clear _why_ the service name would be un-resolvable. I can think of 2 scenarios: CoreDNS can't respond to requests, or none of the pgbouncer pods are in a Ready state [09:02:50] dcausse: I thought that issue had been largely addressed by the client mode change. Is that not the case? [09:03:07] btullis: no I think that's unrelated actually [09:03:22] we're seeing this issue on several dags now (all client mode) [09:04:13] ack. Is joal aware of this, too? [09:04:29] I'm still 15 mins from my desk. [09:05:47] I can't find where we set scheduler_health_check_threshold in k8s (looked at the airflow-config configmap) [09:07:10] I don't think we set it, so we must be using the default value. Let me check what it is [09:07:56] https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#scheduler-health-check-threshold the default is 30s [09:11:55] I'm redeploying the airflow-search scheduler with the new healthcheck config [09:12:09] {{done}} [09:12:46] we seemed to have changed that in https://gerrit.wikimedia.org/r/c/operations/puppet/+/616631 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/663304 but again unsure if it's all related :/ [09:15:17] just to clarify something. There are 2 health_checks at play there. Kubernetes regularly calls the /health API handler to determine whether the pod is healthy (which we just updated to checking whether the API port was open). If that's not the case, the pod is restarted [09:16:13] but the scheduler_health_check_threshold parameter you're referring to is something internal to airflow, that checks whether the scheduler has recently sent some heartbeats, so that airlfow itself can consider it as healthy [09:19:50] yes, trying to figure out if airflow zombie task detection could be causing what we see [09:21:03] I suspect that this postgresql-airflow-search-pooler-rw.airflow-search happening when call /health from the k8s livenessProbed might also happen during "normal" scheduler operations [09:21:26] s/postgresql-airflow-search-pooler-rw.airflow-search/ns resolution issue/ [09:26:23] https://logstash.wikimedia.org/goto/63043c0cda0a150638fbed464b34ef5a [09:28:40] we _should_ see that trend go down now that we've removed that dependency from the kube healthecks of the scheduler pod [09:31:10] What about if we increase the failureThreshold of the liveness checks, so that more sequential failures trigger a restart? Upstream set 5. https://github.com/airflow-helm/charts/blob/main/charts/airflow/values.yaml#L699 [09:34:13] also some weird things, the task retry counter jumps from 6 -> 8 -> 10 -> 12 when seeing that issue [09:34:42] e.g. https://airflow-search.wikimedia.org/dags/import_cirrus_indexes_weekly/grid?dag_run_id=scheduled__2025-01-05T00%3A00%3A00%2B00%3A00&task_id=import&tab=logs [09:35:10] task tries 1 to 6 are legitimate failures [09:35:52] 8, 10 and 12 are failures with "WARNING - State of this instance has been externally set to up_for_retry. Terminating instance." [09:36:39] wondering where are the runs 7, 9 and 11? [09:39:30] let me check in the DB [09:41:32] kubectl get pods -l dag_id=import_cirrus_indexes_weekly -o json | jq '.items[] | .metadata.annotations.try_number' -r [09:41:40] 7, 11 and 9 [09:42:03] the scheduler should take care of cleaning pods after they're done ? [09:46:01] somehow, they are not in DB https://phabricator.wikimedia.org/P72029 [09:48:20] back [09:49:17] pod import-cirrus-indexes-weekly-import-p78es34z is tryNumber 9 and it submits the skein app at: "25/01/13 18:07:47 INFO impl.YarnClientImpl: Submitted application application_1734703658237_820732" [09:50:21] but here it says: "WARNING - State of this instance has been externally set to queued. Terminating instance." [09:50:42] note the "queued" [09:51:57] then it marks the task as UP_FOR_RETRY at : [2025-01-13T18:18:00.619+0000] {taskinstance.py:1225} INFO - Marking task as UP_FOR_RETRY. dag_id=import_cirrus_indexes_weekly, task_id=import, run_id=scheduled__2025-01-05T00:00:00+00:00, execution_date=20250105T000000, start_date=20250113T181748, end_date=20250113T181800 [09:52:57] which is the time when tries number 10 is failed with (from airflow logs not k8s): [2025-01-13, 18:18:08 UTC] {local_task_job_runner.py:346} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance. [09:53:08] there are concurrent runs [09:54:17] concurrent runs of the of the same dag task? [09:54:26] "this instance has been externally set to queued." -> up_for_retry -> "this instance has been externally set to up_for_retry" [09:54:35] brouberol: yes that's my understanding... [09:56:34] there were 6 attempts (tries no 7 to 12), only even tries are reported in the airflow ui, the odd ones are still visible in k8s with kubectl get pods -l dag_id=import_cirrus_indexes_weekly [09:59:22] I'm indeed seeing the even and odd attempts in https://airflow-search.wikimedia.org/dags/import_cirrus_indexes_weekly/grid?tab=event_log&dag_run_id=scheduled__2025-01-05T00%3A00%3A00%2B00%3A00&task_id=import [10:05:47] just wondering, do we have a replicated database, could replication lag cause some weird issues [10:09:57] (link to the scheduler logs https://logstash.wikimedia.org/goto/7511157334de4d74b153cffbe468779d) [10:11:54] hmm, we have 2 PG pods (a master and a failover) running on Kubernetes. and replication lag never peaked above 2s in the past week [10:11:54] https://grafana-rw.wikimedia.org/d/cloudnative-pg/cloudnativepg?forceLogin=&from=now-7d&orgId=1&refresh=30s&to=now&var-DS_PROMETHEUS=eqiad+prometheus%2Fk8s-dse&var-cluster=postgresql-airflow-search&var-instances=All&var-namespace=airflow-search&var-operatorNamespace=cloudnative-pg-operator&viewPanel=16) [10:12:12] ok "[2025-01-13T18:01:43.024+0000] {scheduler_job_runner.py:495} INFO - DAG import_cirrus_indexes_weekly has 0/6 running and queued tasks" is false it just said otherwise two mins before: [10:12:16] [2025-01-13T17:59:36.508+0000] {scheduler_job_runner.py:764} INFO - Received executor event with state queued for task instance TaskInstanceKey(dag_id='import_cirrus_indexes_weekly', task_id='import', run_id='scheduled__2025-01-05T00:00:00+00:00', try_number=7, map_index=-1) [10:13:04] we're over in https://meet.google.com/ort-mznr-eeu if you want to join [11:40:28] lunch [13:37:18] still seeing the scheduler being restarted and the import cirrus indexes is start to fail the same as before [13:37:39] the try "13" has now disappeared [13:41:09] https://usercontent.irccloud-cdn.com/file/hGAuVWl3/image.png [13:41:25] yes same behavior... [13:41:29] Agreed. I didn't touch anything. [13:42:09] so the scheduler pod restart (because of ns resolution issue) happened at 13:28 UTC [13:42:26] The yarn job is still running. https://yarn.wikimedia.org/proxy/application_1734703658237_831883/jobs/job/?id=0 [13:43:03] try 14 happened at 13:28:38 [13:43:14] Now, presumably, dag run 15 will hit an error because of concurrent runs and accessing parquet files at the same time as #13 ? [13:45:06] no I think it'll fail because airflow might attempt to start attempt 16 [13:45:48] Should we kill this task run and its yarn job and pause the DAG, while we investigate the scheduler crash? [13:46:57] btullis: also HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"import-cirrus-indexes-weekly-import-p78es34z\" is forbidden: User \"system:serviceaccount:airflow-search:airflow\" cannot patch resource \"pods\" in API group \"\" in the namespace [13:47:00] \"airflow-search\"","reason":"Forbidden","details":{"name":"import-cirrus-indexes-weekly-import-p78es34z","kind":"pods"},"code":403} [13:47:35] wondering if when the pod is running and it's failing to "adopt" it it might kill the task [13:47:44] Hmm. patch verb is interesting there. [13:49:13] We do not allow patch here: https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/airflow/templates/rbac.yaml#L11 cc: brouberol [13:50:20] the scheduler seems to try adopting all pods in its namespace but fails because of this issue [13:50:40] this is when starting [13:50:55] INFO - Adopting or resetting orphaned tasks for active dag runs [13:50:57] OK, good find. It looks like the upstream chart does allow patching, too. https://github.com/airflow-helm/charts/blob/732a0f0bdbeb420159807d3d5f858e614c6656ec/charts/airflow/templates/rbac/airflow-role.yaml#L40 [13:53:09] good find! so, I'll have a look at what happened to the scheduler and will add the patch rbac rule [13:54:25] ok, things are starting to make sense I think [13:54:53] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1111230 [13:55:01] the scheduer crashes due to $reason, when it re-spawns, it fails to re-acquire running tasks due to missing RBAC, and thus creates new tasks [13:55:34] btullis: approved, thanks [13:55:55] reason is could not translate host name "postgresql-airflow-search-pooler-rw.airflow-search" to address: Name or service not known [13:56:18] Thanks. I wonder if it could be pod memory for the scheduler? That might cause a DNS lookup to fail. I'll deploy the RBAC change now. [13:56:49] but to reproduce we might just try to roll restart the scheduler and see if it re-acquires its running tasks [13:58:58] going to kill the cirrus-import-indices spark jobs, others might be small enough that I'll let them run [13:59:48] yep, I found the same root cause as well: psycopg2.OperationalError: could not translate host name "postgresql-airflow-search-pooler-rw.airflow-search" to address: Name or service not known [14:00:41] and it seems that the issue came from the scheduler process itself, not the healthchecks, This is why the earlier fix did not work [14:00:49] this is an unhandled exception in the main PID [14:01:12] btullis: are you deploying the rbac fix or should i? [14:01:47] Applying now. I was just waiting for the rsync. [14:01:53] Done [14:06:17] There was a bump in CPU usage around the time of the restart of the scheduler, but it doesn't look like memory was exhaused or anything before this. https://grafana.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s-dse&var-namespace=airflow-search&var-pod=airflow-scheduler-df7986b65-65ddf&var-container=airflow-production [14:06:41] stopped the spark jobs and paused the cirrus-import-indexes [14:06:48] I've looked for "psycopg2.OperationalError: could not translate" in logstash, and while I see events occuring in all airflow namespaces, events do not line up namespaces, so this might not be coredns related [14:07:08] let me know if you want to test that again (possibly restarting the scheduler pod manually) [14:07:21] something seems to be occuring in all namespaces independently, at different times [14:08:37] Here is something interesting at the cloudnativepg level: https://grafana.wikimedia.org/d/cloudnative-pg/cloudnativepg?orgId=1&refresh=30s&var-DS_PROMETHEUS=eqiad%20prometheus%2Fk8s-dse&var-operatorNamespace=cloudnative-pg-operator&var-namespace=airflow-search&var-cluster=postgresql-airflow-search&var-instances=All&from=now-1h&to=now&viewPanel=4 [14:08:54] Longest transaction, 2.5 hours. [14:09:50] looking in thanos, the pgbouncer pods are steadily in a ready state [14:10:01] Ack, thanks. [14:10:29] the scheduler is not in a good state looking at the logs... [14:10:30] Should we try roll-restarting the scheduler pod, to see if it adopts the running processes now? [14:10:49] struggling with the k8s api [14:10:59] urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.67.32.1', port=443): Max retries exceeded with url: /api/v1/namespaces/airflow-search/pods?labelSelector=airflow-worker%3D2745709%2Cairflow_executor_done%21%3DTrue&resourceVersion=0&timeoutSeconds=3600&watch=True (Caused by ConnectTimeoutError(, 'Connection to [14:11:01] 10.67.32.1 timed out. (connect timeout=30)')) [14:12:04] Oh, that looks like it is unable to talk to the k8s API. I wonder if I broke it with a different patch. [14:12:44] o/ [14:12:48] I just merged this, but perhaps I broke it. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1111206 [14:14:19] let me try to revert the networkpolicy elector to something simpler [14:14:44] Ack, thanks. [14:15:42] done [14:19:38] I've changed the selector to `selector: app == 'airflow' && component in { 'scheduler', 'webserver' }` but somehow that does not resolve [14:23:16] seems to work now [14:23:30] {◕ ◡ ◕} [14:23:49] I mean the scheduler can talk to the k8s api [14:24:52] ok good [14:25:09] btullis: let's revert that patch and make sure it works in test-k8s first? [14:26:18] Yep. We will still have to bump the chart number upwards though, right? Will you do it, or shall I? [14:27:54] can you? 🙏 and yes, we should bump the version. I'm looking at something else atm [14:28:35] OK, on it. [14:31:23] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1111237 [14:33:37] it's failing again [14:34:14] but dunno if related to k8s tho [14:34:38] 10.64.134.5 - - [14/Jan/2025 14:33:01] "GET /health HTTP/1.1" 503 - [14:36:29] [2025-01-14T14:26:13.121+0000] {kubernetes_executor.py:760} INFO - Shutting down Kubernetes executor [14:37:27] perhaps the livenessProbe might be better with http [14:37:35] because here it seems dead [14:38:05] btullis, brouberol ^ [14:40:45] Looking at https://github.com/apache/airflow/blob/454e63fd6684d31e5a662aa48d82e2bba8896cac/airflow/utils/scheduler_health.py#L36-L54, it might indicate that either no scheduler job is found, or that it's not alive [14:40:52] let me check what the API responds [14:41:15] Ack. I'm about to deploy the networkpolicy change now. [14:42:00] it died with the same postgresql-airflow-search-pooler-rw.airflow-search issue but this time it did not stop the container [14:42:14] it just said [2025-01-14T14:26:13.121+0000] {kubernetes_executor.py:760} INFO - Shutting down Kubernetes executor [14:43:03] [14:43:35] ok, so it seems that the http liveness check was better, in the sense that it would kill the scheduler when needed [14:44:08] although killing the pod is what triggered the error [14:44:35] what's happening here is that the scheduler process crashed due to the DNS error, but it left the healthcheck API server up [14:44:37] But at least with the rbac change, it should now be more resilient to untimely death [14:44:37] gehel I added some doc to the MLR investigation phab https://phabricator.wikimedia.org/T383048. Hope it makes sense & happy to discuss further! (cc / dcausse Trey314159) [14:44:56] so the liveness check still passes even if the scheduler job is down [14:45:06] becuase it's only checking a tcp port at the moment [14:45:17] Oh, right. [14:45:42] so I think we need to revert to liveness using http + the new improved rbac [14:46:01] like btullis said, the scheduler would still crash, but it'd be able to retrieve its ducklings [14:47:08] brouberol: Do you want me to revert https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1111162 or will you? [14:47:58] on it! [14:48:53] gmodena: thanks! [14:49:19] Maybe we should relax the check a bit as per: https://github.com/airflow-helm/charts/blob/main/charts/airflow/values.yaml#L694-L699C5 [14:50:59] dcausse I plan to attempt to reproduce mjolnir training on stat boxes between tonight and tomorrow morning. I'll label the spark jobs as "test" or something to avoid confusion (though I don't think we want to restart the dag till the scheduler is in a happier place). [14:51:20] gmodena: sounds good [14:51:50] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1111245 [14:53:34] btullis: done [14:54:43] I'm still seeing k8s API error messages [14:57:35] Oh, my previous helmfile apply is still running. [14:59:09] I quit out of it. It's ready for you to try deploying 0.0.75 brouberol. [15:00:08] gotcha, thanks [15:05:57] the deployment seems to have had an effect, as all `Completed` containers were re-adopted by the scheduler and ultimately deleted [15:06:08] s/containers/pods [15:06:08] {◕ ◡ ◕} [15:06:56] jayme had a good suggestion over in #k8s-sig. We could be using a DNS fqdn, ending with .svc.cluster.local, to bypass the DNS search domains [15:08:16] nice [15:10:01] I'll test something over in airflow-test-k8s [15:18:19] ok, I have something working. I suggest we hold off on some more rash updates and monitor the state of airflow-test-k8s over the next few hours/days [15:18:38] btullis: let's now fix the calico selector ? [15:19:10] OK. [15:19:12] dcausse: I believe we're now in a state where the airflow-search scheduler will still be killed from time to time, _but_ it should be able to recover [15:19:59] btullis: actually, before we do, let me see whether I can tweak the `inspect_network_policy` script to handle these more complicated selector grammar [15:20:00] I am still intrigued by these long transactions, but not 100% sure if they are indicative of a problem. https://grafana.wikimedia.org/d/cloudnative-pg/cloudnativepg?orgId=1&refresh=30s&viewPanel=4 [15:20:18] this way, we'd be able to validate that the pods are matched [15:20:24] brouberol: Ack. Thanks. [15:21:13] Did you get the fqdn based pooler address working? [15:23:11] I did in airflow-test-k8s yes [15:24:11] airflow@airflow-scheduler-86b74f8f8b-kpn4p:/opt/airflow$ ./usr/bin/pg_pooler_uri [15:24:11] postgresql://app:XXXX@postgresql-airflow-test-k8s-pooler-rw.airflow-test-k8s.svc.cluster.local:5432/app [15:25:02] re network policy script: actually that won't work, as the || (or) operator isn't supported in vanilla label selector [15:25:02] brouberol: thanks for the update, please let me know when I can resume some operations [15:25:51] dcausse: I think you should be able to resume them now, with the knowledge that the scheduler might get killed, which I believe/hope should recover nicely on its own and shoudn't impact your task [15:27:01] Is it not worth trying to replicate the DNS change in airflow-search before we restart DAGs? I know that you said "hold off on some more rash updates" but this seems more likely to show if it's working than monitoring airflow-test-k8s [15:27:18] gmodena: the new info on T383048 look good! [15:27:19] T383048: Investigate current MLR models for Search and identify improvements - https://phabricator.wikimedia.org/T383048 [15:32:03] if dcausse is ok with that, I am too [15:33:26] brouberol, btullis sure please test at will and let me know once you're done [15:36:24] thanks. btullis I've opened https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1111259 [15:47:17] all deploed [15:47:20] *deployed [15:49:38] If I recall correctly, we had to use `sql_alchemy_conn_cmd` instead of `sql_alchemy_conn` in order to support the partially migrated case, where some instance databases are on bare-metal and others are on k8s. Is that right, or is it just the best means of finding the pooler address? [15:53:03] the issue was that cloudnative-pg-operator creates a secret containing everything required to connect to the PG directly, but not the pooler itself [15:53:37] Ah yes, I remember. Thanks. [15:53:52] so we have the env variable PG_URI assigned to .data.uri, and we then "massage" it to convert it into an uri to PGBouncer [15:54:07] it's an open issue on the operator side IIRC [15:54:09] So are we good for dcausse to unpause DAGs, or is there anything else we would like to try first? [15:54:18] I think we're good to unpause [15:54:28] afk for a bit [15:54:34] brouberol, btullis ack unpausing, thanks for the fixes! [16:06:34] well, tentative fixes at least :D [16:11:17] the output of "kubectl get pods" looks much cleaner now :) [16:15:00] yep, my OCDs got a little endorphin boost [16:36:37] dcausse: (for when you have time) do you know what kind of dependencies the Search Platform has on Dumps? We need Wikidata dumps for reloading WDQS, but do we have something else? [16:36:54] Either on dumps themselves, or on dumps loaded into hadoop? [16:40:13] gehel: we use RDF dumps loaded into hdfs to build a triple dataset that's then used to: split the graph and prepare n3 files for the data-reload cookbook and the subgraph analysis (done by Aisha) [16:40:44] and prepare the bootstrap dataset for the wdqs updater flink job (rarely used) [16:42:00] indirectly the SD team's image suggestion dag is relying on datasets dependent on wikidata json IIRC [16:42:28] and we still have a sensor on their dataset [16:47:51] workout, back in ~40 [18:52:02] dinner [19:25:25] back [19:47:34] As far the graph split goes, we are expecting the majority of traffic to go to the main graph, as opposed to the scholarly? If there's any docs on this LMK