[00:03:37] in the 'gitlab-runners' cloud VPS project, has its own puppetmaster. instances up to runner-1019 exist and happily use their own puppetmaster. gitlab-runners-puppetmaster-01.gitlab-runners.eqiad1.wikimedia.cloud. but when I just created a new instance runner-1020, same role, same Hiera values.. it gets: [00:03:42] certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain [00:03:51] for /CN=Puppet CA: gitlab-runners-puppetmaster-01.gitlab-runners.eqiad1.wikimedia.cloud] [00:04:02] am I missing something? [00:04:41] or maybe something changed about cert handling on puppetmasters meanwhile but you don't notice on existing instances that have been running since a couple months [00:05:13] https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster#Step_2:_Setup_a_puppet_client [00:06:08] that error is 100% expected on new instances I think until the previous certs are cleared [00:06:25] thanks! trying [00:09:44] bd808: ACK! rm -rf /var/lib/puppet/ssl on client, make new cert request, manually sign it on puppetmaster (no autosigning) and I am past the issue on to other issues :) so there is always a 'previous' master, makes sense [00:10:06] even when applying the puppetmaster setting asap [00:11:06] yeah, the cloud-wide puppetmaster is used to initialize the instance. A.ndrew tried to figure out how to get rid of that once and ended up giving up. :) [00:12:58] yep, something needs to bootstrap it, *nod* [10:44:08] !log tools disabled debug mode on the k8s jobs-emailer component [10:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:45:44] !log toolsbeta disabled debug mode on the k8s jobs-emailer component [10:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [15:31:48] I'm tryng to find the right grafana-labs board for a single cloud instance stats [15:31:59] e.g. the replacement for 404 at https://grafana-labs.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=deployment-xhgui01:9100&var-datasource=Beta%20Prometheus [15:32:12] I tried https://grafana-labs.wikimedia.org/d/000000590/instance-details but this seems to give errors for anything I type [15:32:35] Also tried while logged-in but no difference [16:04:34] Krinkle: maybe https://grafana-labs.wikimedia.org/d/000000059/cloud-vps-project-board?orgId=1&var-project=deployment-prep&var-server=deployment-xhgui03 ? [16:06:11] bd808: ack that works [16:06:38] should I delete these three? [16:06:47] - https://grafana-labs.wikimedia.org/d/000000019/prometheus-machine-stats [16:06:49] - https://grafana-labs.wikimedia.org/d/000000590/instance-details [16:06:56] - https://grafana-labs.wikimedia.org/d/000000023/prometheus-overview-per-project [16:07:48] instance details is cross-linked a fair bit it seems. seems useful to have it work directly as well since you can't always link the project at the same time I think depending on the query [16:13:26] Krinkle: :shrug: all of this is a bit of a mess really. We still don't have a full prometheus replacement of the basic metrics collection that we did (do?) with diamond. I think several different people have built dashboards on grafana-labs over time, but to my knowledge none of them collaborated with each other or documented what and why. [16:14:37] every time I think that the WMCS team is going to get to put someone on metrics things we have some other issue come up or staff loss and it goes back to "important but not staffed" again [19:14:18] when looking at other repos in gitlab, the structure is after WMF teams, repos/sre, repos/releng, repos/security, repos/research. I guess that means repos/wmcs would be most fitting for cloud infra stuff while repos/cloud _could_ be used by users of cloud services [19:28:20] I did move my repo out of the cloud namespace though as requested. [20:59:03] !log gitlab-runners - pausing runner-1008 from accepting new jobs, hoping it will finish all existing jobs already queued and once that is down to 0 I can replace it with a new runner on bullseye (T297659) [20:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Gitlab-runners/SAL [20:59:05] T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 [22:01:30] !log gitlab-runners - deleting instance runner-1008 in Horizon and also deleting it in gitlab admin UI about the same time T297659 [22:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Gitlab-runners/SAL [22:01:34] T297659: upgrade gitlab-runners to bullseye - https://phabricator.wikimedia.org/T297659 [22:03:06] !log gitlab-runners - deleting instance runner-1020 and recreating it with the same name but flavor g3.cores8.ram24.disk20 T297659 [22:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Gitlab-runners/SAL [22:12:59] I am creating a new instance and it is in "Task: Scheduling" and "Power State: No State" since 8 minutes. I have not seen that before when creating new instances I think. Not that it stays in "Scheduling". Thinking maybe this is that bug when you use the same name for an instance that you just deleted shortly before that. [22:18:06] yea, doesnt happen when I give it a different name, it goes right to "Build" / Spawning after a second