[10:19:44] Ok so the underlying problem I have is that the number of metrics (and their cardinality) means that the result set is so big that Grafana seems to run into a Thanos timeout. [10:20:02] E.g. this dashboard https://grafana.wikimedia.org/dashboard/snapshot/lcr8lU3tTWEX1ew9nQWwODCFJ4kUGXKW?orgId=1 [10:20:25] most of the time, one of the widgets won't load. [10:21:37] I'm unsure what to do about this, since the services/metrics won't get smaller in the future. Maybe we need a recording rule that makes aggregates per k8s namespace? I haven't done that in a production env before, so suggestions appreciated. [10:22:28] Also maybe only querying one site would help, but in those snapshots, the dropdowns at the top never were active. Not sure if that is a grr-preview thing [10:30:44] klausman: yes a recording rule seems best to me here, check out ./modules/profile/files/prometheus/rules_*.yml in puppet, depending on which prometheus instance you want the recording rule, or maybe grizzly already supports shipping recording rules? I'm not sure [10:31:01] I'll have a look [10:31:17] ok! I'm going to lunch shortly, will read later [10:32:02] same :) [11:10:14] Hmm something may be broken with the prometheus k8s cluster in eqiad [11:10:27] Maybe I broke something putting the new nodes in production? [11:10:36] https://grafana.wikimedia.org/goto/JR4Grie4z?orgId=1 [11:10:49] node count and api requests are broken [11:12:51] Looks like the datasource query isn't returning k8s for eqiad anymore [11:40:07] I changed the query variables to another, and it works. Weird that metric disappeared though. [11:41:46] Ah no it doesn´t, it just brings the value back, but all the apiserver metrics are gone [12:07:55] mmhh interesting claime, I'll take a look too [12:08:02] godog: Thanks [12:08:43] For now I've ended up trying to query the kubernets metrics api from the prometheus server with the certificate and had no luck [12:09:44] Ah, no that works [12:22:53] so prometheus1005 (k8s) instance is reporting a 401 when talking to https://10.64.0.117:6443/metrics and https://10.64.32.116:6443/metrics [12:23:58] the last apiserver_request_total datapoint I'm seeing is on aug 2nd at 8:41 [12:26:13] Seeing nothing in puppet CR merges [12:27:36] yeah I can't find a smoking gun, I'm attempting a reload [12:27:47] ack [12:27:50] not the best approach heh [12:27:54] Seeing nothing in the puppet logs of the kubemaster [12:28:02] godog: the windows admin approach [12:28:21] "Have you tried turning it off and on again?" [12:28:29] heheh indeed [12:28:58] What's very weird to me is that the equivalent curl to what prometheus is supposed to be doing works [12:31:26] ok restart it is, I think prometheus hasn't reloaded the certs [12:31:53] so far that's the only reason I can think of [12:32:41] and systemctl reload prometheus@k8s did nothing earlier [12:33:21] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-08-04T12:29:26Z is after 2023-08-02T08:43:00Z, v [12:33:36] >_> [12:33:36] that's from kubernetes-apiserver on kubemaster1001, smells like a smoking gun [12:33:42] It does [12:34:05] Now why does curl say nothing, that's a question for another day [12:34:24] For my education, where do you see that log ? [12:34:27] because curl is loading the new non-expired cert, prometheus has never picked it up [12:34:31] Aaaaah [12:34:35] root@kubemaster1001:/var/log# journalctl -u kube-apiserver.service [12:34:52] I thought you were seeing that on the prometheus side [12:35:30] ok I'll restart on 1006 too [12:35:52] I didn't think to check the kube-api logs smh [12:36:06] * claime gives his SRE badge back [12:36:47] heheh that's okay, here's your badge back [12:36:53] :D [12:38:21] I have to run an errand now, would you mind filing a task for this issue claime ? definitely need to at least have alerts on the prometheus side [12:38:56] godog: Sure, I'll do that :) Thanks for the help [12:39:09] cheers! thanks for reaching out