[01:37:09] * bd808 off
[07:00:52] <blancadesal>	 dcaro: this is what I'm (still) confused about: https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/94/diffs
[07:00:52] <blancadesal>	 where did these changes go? it's the same patch where you removed the stale use_envvar_for_harbor_setup branch, that we now removed again
[08:05:54] <dcaro>	 Ohhh, I see, I think it might have gotten lost in a rebase as it was merged into the other branch, not main 🤦‍♂️
[08:06:51] <dcaro>	 I don't like multi-branches xd
[08:07:08] <dcaro>	 I'll stick to multi-commit in a branch if needed
[08:07:10] <dcaro>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/98
[08:24:55] <blancadesal>	 xd
[11:25:00] <dcaro>	 quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/58
[11:55:50] <blancadesal>	 dcaro: approved! didn't see your message until now, sorry. if you ping me when you need a quick review, I will get to it quicker :)
[11:56:34] <dcaro>	 ack, no problem, it's not urgent
[12:03:16] <dhinus>	 I was also looking at it but got lost in a rabbit hole of how we send the command args to k8s :P
[13:42:51] <dcaro>	 got tired of navigating through gitlab job logs :) https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/99
[14:20:30] <blancadesal>	 dcaro: haven't tested it yet, bit I'm very, very excited about this :)))
[14:21:12] <dcaro>	 I just started using it and it's quite a time saver :) (I'm sure there will be a bunch of bugs and edge-cases, but so far so good)
[15:27:03] <Rook>	 I've been looking at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/toolforge/prometheus.pp where I was previously directed, but I'm not clear on how to include a new k8s cluster that has kube state metrics running on it. How does one direct prometheus to scrape a new k8s cluster and can it be tested in codfw?
[15:56:29] <dcaro>	 Rook: that one is for toolforge prometheus, you'll need to configure something similar for paws prometheus (if this is for paws and paws has a prometheus), or for metricsinfra (the "default" prometheus for cloudvps projects), it shows what entries to add to the prometheus config to do the scraping (the 
[15:56:29] <dcaro>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/toolforge/prometheus.pp#263 part)
[15:58:13] <Rook>	 So if I was hoping to get this scraped (1 p in scraped, english is awful) for monitoring and graphing I would be able to get that into metricsinfra?
[15:59:21] <dcaro>	 I think so, though metricsifra uses 'prometheus-configurator' that generates the prometheus config :/, looking on how to add stuff there (there's a DB also with alerts, but not sure if it has config)
[15:59:38] <dcaro>	 is for a project without it's own prometheus then?
[16:00:00] <Rook>	 Correct, I was under the impression, apparently incorrectly, that we only had one prometheus
[16:00:10] <Rook>	 Is the idea that each project runs its own prometheus/monitoring?
[16:00:57] <dcaro>	 only the ones that need something more than simple stats (ex. toolforge, and paws I think also has it's own)
[16:01:27] <dcaro>	 then we have metricsinfra, that is the default for new projects, and projects that don't have one
[16:02:10] <dcaro>	 there's some info here https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS
[16:02:39] <dcaro>	 it's an unfinished project, as the idea is to allow cloudvps users to define alerts and such (from ui or cli or something), but currently we just have a database that we update manually
[16:33:42] <andrewbogott>	 dhinus: I'll make the task for the bobcat upgrade since I think there's a parent task to attach it to which I want to dig up
[16:33:56] <dhinus>	 ok!
[16:34:37] <Rook>	 dcaro: this is indeed all for paws, which does have prometheus nodes in the project. I was hoping to replicate their function inside of the paws k8s cluster, thus removing the nodes. In my mind those were the nodes the existed to be scraped by metricsinfra. Perhaps they are up to something additional? The only thing that paws does with prometheus is alert when it is down (which if I understand correctly is done through metrics 
[16:34:37] <Rook>	 infra)
[16:35:35] <dcaro>	 let me check the config, for tools at least the tools-prometheus sends the alerts to metricsinfra alertmanager
[16:36:01] <andrewbogott>	 hm, or maybe there isn't a parent task *shrug*
[16:41:09] <dcaro>	 Rook: I think that paws-prometheus might be already scraping k8s, from the config: `- job_name: k8s-apiserver`
[16:41:13] <andrewbogott>	 dhinus: done (such as it is), I made a ticket for trove backups too
[16:41:34] <dhinus>	 thanks!
[16:41:38] <dcaro>	 it also has cadvisor and kube-metrics there
[16:41:45] <dcaro>	 *kube-state-metrics
[16:43:46] <Rook>	 There's a `prometheus-paws.wmcloud.org` directing to one of the nodes, I don't see how that job_name translates to the same, but maybe it does?
[16:44:02] <dcaro>	 it's just failing to connect though ` Failed to watch *v1.Pod: failed to list *v1.Pod:` using the wrong name I guess
[16:45:17] <dcaro>	 Rook: not really, but that should give you access to the prometheus UI if there, where you can query stats, including the ones gathered by that job (that get the label `job="k8s-apiserver"`), but it's failing to scrape it though
[16:45:32] <dcaro>	 et \"https://k8s.svc.paws.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/metrics/pods?limit=500&resourceVersion=0\": dial tcp 172.16.1.171:6443: connect: no route to host"
[16:45:52] <dcaro>	 I'm guessing k8s.svc.paws.eqiad1.wikimedia.cloud is not pointing to the k8s apiserver anymore
[16:46:24] <Rook>	 Could be, how are svc addresses configured?
[16:46:46] <dcaro>	 manually I think, let me check, somewhere under dns
[16:47:22] <dcaro>	 it's under dns->zones
[16:47:42] <dcaro>	 then 'record sets' tab
[16:49:30] <Rook>	 That appears blank under paws...Though I thought some were still active
[16:49:56] <dcaro>	 I see stuff, did you choose the zone?
[16:50:18] <Rook>	 Oh there they are
[16:50:31] <dcaro>	 👍
[16:51:00] <dcaro>	 you can also query the data from the paws prometheus from the grafana ui, selecting the prometheus-paws datasource
[16:51:02] <dcaro>	 https://usercontent.irccloud-cdn.com/file/L1vqKM4F/image.png
[16:51:22] <Rook>	 So ` k8s.svc.paws.eqiad1.wikimedia.cloud. ` could be updated to something that exists, but I believe it is already working. Which if true begs the question of: how?
[16:52:17] <Rook>	 Though I suppose that is less of my goal than figuring out how to remove the prometheus VMs and get them inside of the k8s cluster (With a secondary desire of getting information about the cluster itself)
[16:52:19] <dcaro>	 prometheus is failing to connect with 'no route to host', how do you see it working?
[16:52:44] <Rook>	 It will alert if paws goes down
[16:53:02] <dcaro>	 ah, that does not use the k8s data I think
[16:53:11] <Rook>	 Though if I'm understanding correctly, nothing is really pulling from paws, rather the paws prometheus instances are probably pushing to metrics infra?
[16:53:31] <dcaro>	 I think so yes, let me verify
[16:58:39] <dcaro>	 Or no, I think the alerts are on the metricsinfra database, as they are not using the stats from paws prometheus at all
[16:58:54] <dcaro>	 https://www.irccloud.com/pastebin/xq3WYG1m/
[16:59:17] <dcaro>	 it's just checking that it can scrape jupyterhub
[16:59:49] <dcaro>	 looking for that job def
[17:01:26] <dcaro>	 here it is https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/wmcs/paws/prometheus.pp#113
[17:01:53] <Rook>	 Where do the alerts from the db get their data?
[17:03:03] <andrewbogott>	 dcaro: is it disruptive if I cause a few restarts/few minute downtimes for harbordb?
[17:04:11] <dcaro>	 Rook: manually set
[17:04:34] <Rook>	 I mean, where do they get the data to decide to alert or not?
[17:04:36] <dcaro>	 andrewbogott: it blocks anyone from pushing/pulling images while the db is down
[17:04:53] <andrewbogott>	 ok, I'll start with a different victim then
[17:05:23] <dcaro>	 Rook: that metric is generated by the metricsinfra prometheus, that has this in it's config
[17:05:30] <dcaro>	 https://www.irccloud.com/pastebin/cWjYOrTv/
[17:08:07] <Rook>	 So it's just querying a URL, that I guess is partially defined somewhere else?
[17:09:08] <Rook>	 Or scraping jupyterhub for metrics that zero to jupyterhub has developed for being scraped by prometheus? (and metrics infra is prometheus?)
[17:10:51] <dcaro>	 so metricsinfra prometheus, is scraping hub-paws.wmcloud.org:443/hub/metrics directly, and alerting on the fact that the get request to that url returned 200 OK
[17:11:18] <Rook>	 So I could remove those two vms and have a net loss of 0?
[17:12:10] <dcaro>	 as it is I think so yes, they lost their utility when the k8s ip they were connecting to stopped working
[17:13:38] <dcaro>	 you can try turning them off (or stopping prometheus there or something)
[17:13:55] <Rook>	 So if I wanted to get cluster usage data, it would be inappropriate to use metrics infra to aggregate that? That project is only meant for alerting?
[17:15:49] <dcaro>	 not really, it also gathers basic VM level data. To fetch the k8s one would mean adding project specific config to it, that for what I'm seeing, the prometheus-configurator  might be able to do
[17:16:17] <dcaro>	 it might require authentication and such, that means entangling a bit both projects though
[17:16:39] <Rook>	 So metrics infra is not a place to send data to make graphs out of?
[17:17:58] <dcaro>	 I'd say not right now? I guess it can be whatever we want it to be, let me check if it does support custom scraping configs already
[17:19:09] <dcaro>	 it should be possible to add a custom scrape target, this means that if you have an endpoint (say k8s.svc.paws.eqiad1.wikimedia.cloud/metrics ), it can scrape it
[17:19:38] <Rook>	 And it is keeping that data around, which is what allows it to do a duration value for the alerts?
[17:20:16] <dcaro>	 yes, not sure what's the current retention, but should be in the order of months
[17:20:52] <Rook>	 Though our current assumption is that if you want more than alerting data you should be setting up your own prometheus/graphana stack in the project that is being observed?
[17:21:56] <dcaro>	 it does not seem to have a way to add authentication though, or configure more specific endpoints (like k8s internal stuff and discovery like https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/toolforge/prometheus.pp#423)
[17:22:14] <taavi>	 dcaro: uh, no, that's not how any of this works
[17:22:21] <dcaro>	 nice :)
[17:22:29] <dcaro>	 taavi: please correct me!
[17:23:21] <taavi>	 so in promethes stacks, prometheus itself is the thing that collects data from "endpoints" (individual things being monitored), stores it, and exposes an API to query it
[17:24:14] <dcaro>	 ok, we agree on that
[17:24:20] <taavi>	 in addition, you have related components like alertmanager (which, uh, sends out alerts a prometheus server tells it about), or grafana (which plots data from a prometheus server on a nice dashboard)
[17:24:34] <dcaro>	 ack
[17:25:10] <taavi>	 you can have multiple different prometheus servers using a single alertmanager, or grafana, or vice versa
[17:25:24] <dcaro>	 yep
[17:26:02] <taavi>	 we have one prometheus server pair in the metricsinfra project, which monitors all cloud vps instances and some individual services it's configured to monitor
[17:26:33] <dcaro>	 yes, and does that using the data from the prometheus-configurator right? that ends up a manually maintained db
[17:27:00] <taavi>	 yes, that is how the config for that one specific prometheus instance is generated
[17:27:11] <taavi>	 in addition, the toolforge project has a prometheus server that monitors the toolforge k8s cluster and related services
[17:27:25] <dcaro>	 ack, I follow
[17:27:35] <taavi>	 paws had a similar setup to toolforge, but it was broken when paws was moved from kubeadm to magnum
[17:27:58] <dcaro>	 ok, that's why the k8s.svc.paws.* entry does not reply anymore
[17:28:06] <taavi>	 all of these prometheus instances are independent, they don't scrape/"pull data from" each other
[17:28:17] <dcaro>	 ack, yes
[17:29:17] <taavi>	 however, there is a common set of alertmanager and grafana instances, also in the metricsinfra project, that both the metricsinfra prometheus server and the toolforge prometheus server use to draw dashboards and send alerts
[17:29:39] <dcaro>	 yes, those are in metricsinfra right?
[17:29:53] <dcaro>	 (grafana + alertmanager)
[17:30:07] <taavi>	 yes, those are also in that project
[17:30:16] <dcaro>	 and paws-prometheus pushed alerts to that metricsinfra-alertmanager
[17:30:37] <dcaro>	 (still does, though the alert is only for config refresh, so not triggered often)
[17:30:46] <taavi>	 toolforge at least does, I don't recall if paws moving to magnum predates that setup
[17:30:59] <dcaro>	 just checked the config before
[17:31:30] <dcaro>	 did not paste it though, but yes, it's configured to send alerts to metricsinfra (might not work though)
[17:32:02] <Rook>	 So if I wanted to make a graph of the number of pods running on paws over the last month, I would install kube state metrics, prometheus and grafana into the paws cluster and access it through paws? 
[17:32:43] <dcaro>	 that's an option yes (given that we fix the prometheus in paws to connect to k8s)
[17:33:00] <dcaro>	 another option would be to make metricsinfra prometheus scrape paws-k8s data
[17:33:12] <dcaro>	 taavi: is that correct?
[17:33:18] <Rook>	 in my mind prometheus would be inside the k8s cluster, so hopefully would be able to talk to k8s
[17:33:33] <taavi>	 the tl;dr is that you should indeed install kube-state-metrics to the paws cluster, and then either fix the current puppetized paws-prometheus instances to scrape them, or indeed install prometheus inside paws
[17:34:22] <Rook>	 So in putting kube-state-metrics and prometheus inside of the paws cluster, does that let me make a graph of the number of running pods somewhere?
[17:34:26] <dcaro>	 Rook: if you want to setup prometheus inside the cluster, all that needs changing is the grafana config to use paws-prometheus-in-k8s as a datastore (publicly expose that k8s prometheus url)
[17:34:37] <taavi>	 then the metricsinfra grafana instance (grafana.wmcloud.org) can be configured to include that prometheus instance as a data source, and we can make a dashboard on grafana.wmcloud.org that will graph that data
[17:35:11] <dcaro>	 we just said the same thing it two different ways xd
[17:35:32] <taavi>	 prometheus will load ("scrape") data from endpoints like kube-state-metrics, store it, and expose an API endpoint that grafana can use to load that data and draw a chart based on it
[17:35:47] <Rook>	 Ok, I've got part of that. I'll see about installing the other bit then ask about how to include that in grafana. 
[17:35:54] <Rook>	 Can this be tested in codfw1dev?
[17:37:15] <taavi>	 at least the prometheus and kube-state-metrics parts can
[17:37:56] <dcaro>	 summarizing, all the options need kube-state-metrics:
[17:37:56] <dcaro>	 * prometheus inside paws k8s + metricsinfra grafana using it as datastore
[17:37:56] <dcaro>	 * prometheus as VMs in paws (as they are) + metricsinfra grafana using them as datastore (already configured)
[17:37:56] <dcaro>	 * prometheus from metricsinfra scraping kube-state-metrcs from paws directly + metricsinfra grafana using prometheus-metricsinfra as datastore (already configured)
[17:38:00] <taavi>	 I don't have a metricsinfra test setup with grafana there at the moment, but it's something that I've been meaning to set up anyway and is fairly well automated so we should be able to set one up fairly easily
[17:38:29] <taavi>	 I would not consider metricsinfra-prometheus scraping inside the paws-k8s cluster an option here, for two reasons
[17:38:59] <taavi>	 first, it's not something the current config database supports, and figuring out certificates etc. for authentication is non-trivial
[17:39:27] <dcaro>	 yep, that was my comment before
[17:39:32] <Rook>	 Ok, I'll see where I can get prometheus inside the paws cluster in codfw1dev to offer an endpoint to be scraped. thank yinz
[17:39:42] <taavi>	 second, kubernetes data tends to be fairly large from a storage/processing resources pov, and so I'd rather have it in a separate instance at least at the moment
[17:39:53] <dcaro>	 fair enough
[17:40:02] <taavi>	 Rook: yw, and just ping me if I can be any more helpful with that
[17:40:07] <dcaro>	 Rook: sounds good, that would allow you to get rid of the VMs
[17:40:19] <Rook>	 👍
[17:41:11] <dcaro>	 btw. taavi I'm guessing that you are testing something on toolsbeta-mail* ? there's puppet failing there
[17:41:59] <taavi>	 uh indeed, I did an OS upgrade there and apparently forgot to shutdown the old VM. will fix, thanks
[17:43:06] <dcaro>	 👍 np
[17:47:18] <taavi>	 btw, still looking for a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/994167 before friday
[17:47:39] <taavi>	 aiui that's the last patch needed to not get blocked by google
[17:49:56] <dcaro>	 taavi: can you rerun pcc after rebasing? I think that the current one includes the changes from the previous patches
[17:50:30] <taavi>	 sure, one moment
[17:52:46] <taavi>	 dcaro: https://gerrit.wikimedia.org/r/c/operations/puppet/+/994167
[17:52:52] <taavi>	 sorry, https://puppet-compiler.wmflabs.org/output/994167/1254/tools-mail-4.tools.eqiad1.wikimedia.cloud/fulldiff.html
[17:52:56] <dcaro>	 👍
[17:54:47] <dcaro>	 taavi: looks way nicer xd, on the other pcc though, tools-mail-03 did not get the arc-sign line (https://puppet-compiler.wmflabs.org/output/994167/1245/tools-mail-03.tools.eqiad1.wikimedia.cloud/corediff.html), is that expected?
[17:56:14] <taavi>	 yes, tools-mail-03 was running buster. the first version of the PS had a condition to only setup ARC on bookworm where it's supported, but tools-mail-03 and the buster mail server in toolsbeta are now offline so I removed it
[17:56:44] <dcaro>	 ack
[18:01:30] <taavi>	 `arc=pass`
[18:01:38] <taavi>	 it seems to work :-P
[18:02:30] <dcaro>	 \o/
[18:08:33] * dcaro off
[19:10:07] * bd808 lunch