[07:10:16] I'll be rebooting bast1003/bast3005 in 20 minutes [07:36:02] both bastions are back up now [08:19:47] headsup; I'll rebooting deploy1002 in ~ 5 minutes [08:32:08] and back up again [09:17:19] is there a way to downtime a single icinga check across multiple hosts? [09:19:47] I don't think, programatically, but what I do is to search the check name on the web and usually it is quite easy [09:19:51] (in this case, i want to downtime 'Check systemd state' for all prometheus hosts) [09:23:23] jynus: ah. i didn't realise that you could search by check name. the 1860 results is a bit much, but still, this is useful. thanks! [09:23:45] yeah, it works better when the check is only on a few hosts :-( [09:36:27] kormat: yes [09:36:30] the downtime cookbook [09:37:05] volans: i couldn't see any flags on it for specifying a check [09:37:11] (at least spicerack has the capability, checking if it was exposed there or not) [09:38:40] kormat: so, it's not exposed to the cookbook, but can be easily used by a REPL if needed, patches are welcome :) [09:38:56] https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.IcingaHosts.downtime_services [10:53:50] vgutierrez: Am I OK to puppet-merge Log emergency messages to disk (887dc7cb96) ? [10:53:55] yup [10:54:01] go ahead please [10:54:13] done, thanks :) [10:54:17] thx [11:03:41] what again is the point of "predictable" network interface names if they keep changing with every major OS release? for a random Ganeti server: [11:03:43] eno1 (stretch) -> ens3f0np0 (buster) -> enp175s0f0np0 (bullseye) [11:04:26] looking forward to bookworm, maybe enX8493s0044m888dfm4fffe4 or so? [11:04:29] not gonna defend the way predictable interface names are implemented but in this case this may be more on ganeti and/or qemu moving the card to a different PCI slot [11:05:27] sure, but if the PCI assignments are seemingly random that makes the whole assumption of predictable kinda moot... [14:37:05] predictable as long as "HW" doesn't change [14:37:11] with VMs that's kinda volatile as well [14:38:05] but I'm not the one that's going to defend predictable interface names here :) [14:49:30] how do operational metrics like QPS get from individual services to grafana these days? Where are they aggregated? [14:54:49] ori: basically graphite is around for mediawiki, prometheus for ~everything else, and there's some global aggregation done by thanos [14:55:22] * bd808 guesses about the same as godog has confirmed [14:57:05] bd808: o/ great guess [14:57:35] I was making assumptions based on what I've learned about the k8s cluster with Toolhub :) [14:59:24] heheh, I hope that means you didn't have to deal with the graphite bits [15:00:31] heh. no just learn more about prometheus which I has somehow been avoiding [15:02:18] *nod* only tangetially related but web access to the prometheus interface per-site is coming soon [15:04:46] https://thanos.wikimedia.org/ got me what I needed for exploring. [15:05:29] inevitable! [15:05:50] * bd808 looks for the 1 timeline where he survives [15:07:54] lolz [15:13:53] <_joe_> ori: more in detail, services that use the node template all have prometheus metrics export baked in [15:14:28] <_joe_> and they're standardized, so once your service is deployed to k8s, it takes a few clicks to have a grafana dashboard like [15:14:53] <_joe_> https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?orgId=1 [15:15:27] ak.osiaris documented the scraper config magic for pods recently too -- https://wikitech.wikimedia.org/wiki/Kubernetes/Metrics#Workload/Pod_metrics [15:15:30] <_joe_> also the envoy that acts as a tls terminator/service proxy exports more metrics, also collected [15:27:57] thanks for the pointers all