[06:58:02] good morning folks [06:58:49] to proceed my quest for PKI and kafka, I'd move kafka-main1001's broker to PKI if you are ok (https://gerrit.wikimedia.org/r/c/operations/puppet/+/904667) [06:59:29] the idea is that any client with 1001 in their connection string (basically all) should try to connect, and fail periodically in case something is not right (so we'll see it via tshark etc..) [06:59:45] the clients should automatically fall back to 1002 etc.. [07:00:04] lemme know if it is ok to proceed [08:21:54] (proceeding) [08:23:08] ack [12:59:40] FYI I'm moving some k8s alerts to their specific instances, there's no functional change [12:59:50] change is https://gerrit.wikimedia.org/r/c/operations/alerts/+/905216 and context is https://phabricator.wikimedia.org/T309182 [13:11:12] I'm reviewing https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/899611 and wondering if there's something to be done besides the removal of the files from production-images [13:11:34] Since it's now published through kokkuri, I imagine not [13:15:40] On second thought, I should probably ask that question directly in -releng [15:55:20] elukey: btullis etc. anyone know why not all namespaces from dse-k8s-eqiad are available in prometheus? [15:55:39] simplest dashboard example here: https://grafana.wikimedia.org/goto/PAb1YVYVz?orgId=1 [15:56:43] https://www.irccloud.com/pastebin/2pxrbAPv/ [15:57:26] interesting, not sure [15:57:29] e.g. missing flink-operator, stream-enrichment-poc [15:57:38] rdf-streaming-updater [15:57:51] we have the #wikimedia-k8s-sig channel, broader k8s audience if you want to post in there [15:58:02] oh okay [15:58:09] is this channel more for just wikikube? [15:58:12] Probably the service doesn't expose metrics ? [15:58:23] i'm looking for basic k8s pod metrics [16:00:32] ottomata: https://w.wiki/6XjL doesn't return anything though [16:00:39] are there pods running in other namespaces ? [16:00:55] yes [16:01:14] https://www.irccloud.com/pastebin/BKr7G7ie/ [16:01:33] https://www.irccloud.com/pastebin/xq6LKopy/ [16:03:52] cc also dcausse [16:08:24] the kubelet does expose it apparently: container_start_time_seconds{container="POD",id="/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podd6e060ee_86c8_4184_a955_30df6c531834.slice/docker-087ccf19f191f8c43c43a01471a4101faeef01b2a1614a5bcd2eb8627f725b1f.scope",image="docker-registry.discovery.wmnet/pause:3.6-1",name="k8s_POD_flink-kub [16:08:24] ernetes-operator-869f9f954b-7xmdg_flink-operator_d6e060ee-86c8-4184-a955-30df6c531834_0",namespace="flink-operator",pod="flink-kubernetes-operator-869f9f954b-7xmdg"} [16:09:10] akosiaris: how did you get that? I was trying to find that last week. nsenter + local curl on k8s worker node? [16:09:41] akosiaris@prometheus1006:~$ curl dse-k8s-worker1005.eqiad.wmnet:10255/metrics/cadvisor [16:09:49] oh so simple [16:09:53] is there something special (in puppet?) we have to do to get namespaces ingested by prometheus? [16:10:11] prometheus has no concept of namespaces [16:10:23] all it does is talk to the k8s api, discover nodes and pods [16:10:26] and scrapes them [16:10:43] hm kay [16:10:58] pods are labeled with the namespace and that's what you see in that dashboard [16:12:15] right [16:13:46] found your issue [16:14:09] for some reason, I don't see the worker having that pod being a valid target in prometheus [16:14:22] so, not scraped at all [16:14:46] I see though 6 nodes scraped [16:14:50] godog: ^ [16:15:07] somehow prometheus isn't scraping some dse-k8s nodes [16:16:13] (meeting) [16:17:44] yeah, dse-k8s-worker100{5,6,7,8} aren't discovered for some reason [16:18:03] oh its the actual worker nodes? [16:18:46] interesting, all the new ones [16:19:21] how does that work compared to the prometheus.io/scrape: "true" label? [16:19:40] is that for app/pod specific stuff? [16:22:22] yes, it's unrelated to nodes. [16:26:28] i ask because i'm also surprised i don't have app metrics, but those could be unrelated problems [16:29:38] ottomata: IIRC the annotations for the flink pod say port X and that is not exposed by the pod [16:29:59] so this could be why you don't have app metrics as well [16:35:35] I have to go shortly but what I'm looking at is this: https://prometheus-eqiad.wikimedia.org/k8s-dse/classic/targets [16:35:50] and for namespace=flink-operator there's indeed an error [16:35:52] Get "http://10.67.28.8:9999/metrics": dial tcp 10.67.28.8:9999: connect: connection refused [16:36:19] that tis app metrics, okay looking into that [16:36:29] would the error keep pod metrics from being scraped too? [16:36:47] I don't know tbh [16:37:01] there's also "k8s-pods-tls" job failing since earlier today ~13 [16:37:13] bottom of that page, I don't know what that is about though [16:37:34] https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1&from=1680524944637&to=1680539621561&var-datasource=thanos&var-Filters=job%7C%3D%7Ck8s-pods-tls [16:37:46] Apr 03 16:34:00 prometheus1006 prometheus@k8s-dse[2488157]: level=error ts=2023-04-03T16:34:00.953Z caller=klog.go:96 component=k8s_client_runtime func=ErrorDepth msg="runtime/asm_amd64.s:1374: Failed to watch *v1.Node: failed to list *v1.Node: nodes is forbidden: User \"prometheus\" cannot list resource \"nodes\" in API group \"\" at the cluster [16:37:46] scope" [16:37:49] there we go ^ [16:38:23] nice find [16:38:29] elukey: this is probably token/pki related ? I would ping jay.me but his is on PTO [16:38:32] I have to go afk, will read the scrollback later [16:38:54] thanks godog laters [16:39:21] akosiaris: ah that rings a bell! Lemme check [16:39:29] ottomata: you probably want to also look at the connection refused thing for "http://10.67.28.8:9999/metrics" [16:39:32] elukey: are you talking about flink pod in stream-enrichment-poc or the flink-k8s-operator pod in flink-operator namespace? [16:39:42] it's unrelated to this one, but still, look at it [16:39:50] we actually do have app metrics for flink app, just not for flink-k8s-operator [16:39:52] ottomata: no idea, the one that you asked to check on Friday IIRC [16:39:56] okay [16:40:18] I recall that annotations specified port 9999, but with nsenter and netstat I didn't see it [16:40:27] that would explain the connection refused [16:40:33] right [16:41:36] elukey: so that is becauase the app is not opening/listening on 9999, right? not because k8s doesn'tt expose it? [16:41:42] the operator container should have [16:41:45] https://www.irccloud.com/pastebin/8h5q8tJ6/ [16:42:55] yeah correct, if you tell me the pod I can check [16:43:03] flink-kubernetes-operator-869f9f954b-7xmdg [16:43:52] it runs on dse-k8s-worker1005.eqiad.wmnet [16:44:34] elukey@dse-k8s-worker1005:~$ sudo nsenter -t 3548838 -n netstat -nlpt [16:44:37] Active Internet connections (only servers) [16:44:39] Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name [16:44:43] tcp6 0 0 :::8085 :::* LISTEN 3548838/java [16:44:46] ottomata: --^ [16:44:50] so there is no port 9999 that the java process exposes [16:44:57] only 8085 [16:45:23] okay great. that narrows down my problem then for that one [16:45:25] i will figure that out. [16:45:27] thank you. [16:45:31] np! [16:45:37] maybe that scrape error then is causing k8s metrics to fail too? [16:46:45] seems more an issue with k8s perms :( [16:47:21] ok so out of my hands? :p i'm gonna get some lunch then figure out app metrics. thank yOUuuuU! [16:47:40] akosiaris: no ok I didn't find anything, it is weird though that only in dse we see the issue [16:48:35] elukey: maybe we need to restart something ? [16:49:17] restarting prometheus@k8s-dse.service didn't fix it [16:49:58] I can try with the dse kube-api [16:50:08] and ofc... now no nodes scraped at all [16:50:17] which at least makes things more consistent [16:50:32] ah lovely :D [16:50:52] it can at least discover pods so at least the user has some perms [16:50:56] ok then if we restart others we may get into trouble as well [16:50:57] and only lacks nodes [16:51:43] should we restart another prometheus instance to see if we have the same issue? That would pin-point it to 1.23 for sure [16:52:16] we can, it will happen anyway [16:52:34] say prometheus@k8s-staging.service [16:53:05] be bold [16:53:11] I already did the wikikube one [16:53:20] ahahahah [16:53:56] well, I don't see problems [16:54:18] ah no, scrap that [16:54:25] it was taking forever to replay the WAL [16:54:30] and now I see the same exact error [16:55:04] ## Used by prometheus in k8s 1.16, with 1.23 we use the default system:monitoring [16:55:05] aha [16:55:44] akosiaris: I recall this https://gerrit.wikimedia.org/r/c/labs/private/+/883130 [16:56:02] but now I don't have a good idea where to check it [16:56:24] the new token/tls-certs part is a little fuzzy for me [16:58:50] akosiaris: it is weird though, from https://phabricator.wikimedia.org/T325268 I think that prometheus was not migrated yet [16:58:56] or am I too tired now? [16:59:12] same here, not sure yet [16:59:14] just digging [16:59:33] also in the description I see "Promehteus 2.24.1 (bullseye) does not support client cert auth for kubernetes_sd, we would need to have 2.33.5 from bullseye-backports" [16:59:54] but we should be using the tokens in this case [17:03:59] wondering if it has anything todo with https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/helmfile_rbac.yaml#41 [17:08:38] ah no ok https://phabricator.wikimedia.org/rLPRIb4bf985c254e052c09364f66f646e445a16c459f [17:09:57] akosiaris: it seems to me that we'd need to add an RBAC rule for prometheus no? [17:10:25] in a 1:1 right now, but I think so, yes [17:10:58] okok [17:25:48] something like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/905260/ [17:26:06] and then we add it in puppet private [17:26:09] not sure about naming though [17:28:37] going afk, let's restart tomorrow :) [17:32:09] yup [17:44:37] <3 [21:16:35] 10serviceops, 10ChangeProp, 10Page Content Service, 10Product-Infrastructure-Team-Backlog-Deprecated: Mobile-html not purged after action=purge - https://phabricator.wikimedia.org/T333887 (10Jgiannelos) [23:10:34] 10serviceops, 10Infrastructure-Foundations, 10Mail, 10MediaWiki-extensions-TranslationNotifications: Investigate if TranslationNotification's DigestEmailer.php is really sending emails and what happens to them - https://phabricator.wikimedia.org/T333899 (10MarcoAurelio) [23:46:06] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10RLazarus) I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/892570 would have smoothed this out, at least in part -- we just didn't get it...