[10:19:44] <klausman>	 Ok so the underlying problem I have is that the number of metrics (and their cardinality) means that the result set is so big that Grafana seems to run into a Thanos timeout.
[10:20:02] <klausman>	 E.g. this dashboard https://grafana.wikimedia.org/dashboard/snapshot/lcr8lU3tTWEX1ew9nQWwODCFJ4kUGXKW?orgId=1
[10:20:25] <klausman>	 most of the time, one of the widgets won't load.
[10:21:37] <klausman>	 I'm unsure what to do about this, since the services/metrics won't get smaller in the future. Maybe we need a recording rule that makes aggregates per k8s namespace? I haven't done that in a production env before, so suggestions appreciated.
[10:22:28] <klausman>	 Also maybe only querying one site would help, but in those snapshots, the dropdowns at the top never were active. Not sure if that is a grr-preview thing
[10:30:44] <godog>	 klausman: yes a recording rule seems best to me here, check out ./modules/profile/files/prometheus/rules_*.yml in puppet, depending on which prometheus instance you want the recording rule, or maybe grizzly already supports shipping recording rules? I'm not sure
[10:31:01] <klausman>	 I'll have a look
[10:31:17] <godog>	 ok! I'm going to lunch shortly, will read later
[10:32:02] <klausman>	 same :)
[11:10:14] <claime>	 Hmm something may be broken with the prometheus k8s cluster in eqiad
[11:10:27] <claime>	 Maybe I broke something putting the new nodes in production?
[11:10:36] <claime>	 https://grafana.wikimedia.org/goto/JR4Grie4z?orgId=1
[11:10:49] <claime>	 node count and api requests are broken
[11:12:51] <claime>	 Looks like the datasource query isn't returning k8s for eqiad anymore
[11:40:07] <claime>	 I changed the query variables to another, and it works. Weird that metric disappeared though.
[11:41:46] <claime>	 Ah no it doesn´t, it just brings the value back, but all the apiserver metrics are gone
[12:07:55] <godog>	 mmhh interesting claime, I'll take a look too
[12:08:02] <claime>	 godog: Thanks
[12:08:43] <claime>	 For now I've ended up trying to query the kubernets metrics api from the prometheus server with the certificate and had no luck
[12:09:44] <claime>	 Ah, no that works
[12:22:53] <godog>	 so prometheus1005 (k8s) instance is reporting a 401 when talking to https://10.64.0.117:6443/metrics and https://10.64.32.116:6443/metrics
[12:23:58] <godog>	 the last apiserver_request_total datapoint I'm seeing is on aug 2nd at 8:41
[12:26:13] <claime>	 Seeing nothing in puppet CR merges 
[12:27:36] <godog>	 yeah I can't find a smoking gun, I'm attempting a reload
[12:27:47] <claime>	 ack
[12:27:50] <godog>	 not the best approach heh
[12:27:54] <claime>	 Seeing nothing in the puppet logs of the kubemaster
[12:28:02] <claime>	 godog: the windows admin approach
[12:28:21] <claime>	 "Have you tried turning it off and on again?"
[12:28:29] <godog>	 heheh indeed
[12:28:58] <claime>	 What's very weird to me is that the equivalent curl to what prometheus is supposed to be doing works
[12:31:26] <godog>	 ok restart it is, I think prometheus hasn't reloaded the certs
[12:31:53] <godog>	 so far that's the only reason I can think of
[12:32:41] <godog>	 and systemctl reload prometheus@k8s did nothing earlier
[12:33:21] <godog>	 "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-08-04T12:29:26Z is after 2023-08-02T08:43:00Z, v
[12:33:36] <claime>	 >_>
[12:33:36] <godog>	 that's from kubernetes-apiserver on kubemaster1001, smells like a smoking gun
[12:33:42] <claime>	 It does
[12:34:05] <claime>	 Now why does curl say nothing, that's a question for another day
[12:34:24] <claime>	 For my education, where do you see that log ?
[12:34:27] <godog>	 because curl is loading the new non-expired cert, prometheus has never picked it up
[12:34:31] <claime>	 Aaaaah
[12:34:35] <godog>	 root@kubemaster1001:/var/log# journalctl -u kube-apiserver.service
[12:34:52] <claime>	 I thought you were seeing that on the prometheus side
[12:35:30] <godog>	 ok I'll restart on 1006 too
[12:35:52] <claime>	 I didn't think to check the kube-api logs smh
[12:36:06] * claime gives his SRE badge back
[12:36:47] <godog>	 heheh that's okay, here's your badge back
[12:36:53] <claime>	 :D
[12:38:21] <godog>	 I have to run an errand now, would you mind filing a task for this issue claime ? definitely need to at least have alerts on the prometheus side
[12:38:56] <claime>	 godog: Sure, I'll do that :) Thanks for the help
[12:39:09] <godog>	 cheers! thanks for reaching out