[00:21:26] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:07:49] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:22:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:21:26] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:07:49] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:22:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:21:26] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [09:45:06] 10netops, 06Infrastructure-Foundations, 06SRE: Routinator 0.14 causing tempfs file system to fill up - https://phabricator.wikimedia.org/T383116#10487931 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff 0.14.1 is out, I'll import and upgrade [10:07:49] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:22:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:51:26] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:28:21] 10netops, 06Infrastructure-Foundations, 06SRE: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10488267 (10fgiunchedi) I'm assuming you meant "this won't be too hard", anyways the simplest solution off the top of my head would be to have a map network... [12:26:13] 10netops, 06Infrastructure-Foundations, 06SRE: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10488524 (10cmooney) >>! In T384345#10488267, @fgiunchedi wrote: > I'm assuming you meant "this won't be too hard", anyways the simplest solution off the top... [12:28:39] 10netops, 06Infrastructure-Foundations, 06SRE: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10488529 (10cmooney) [12:28:40] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488530 (10cmooney) [12:51:28] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488586 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a6b392ba-8b36-4fa0-8d3d-10c8b2d2eb48) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and th... [13:29:39] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488748 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f0f61f83-b1f7-48c8-9e4a-2e436917a7d3) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th... [13:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:07:49] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:22:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:42:23] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10488927 (10cmooney) So I rolled-back the patch to collect the BGP metrics. The config puppet produced worked fine in magru and esams, but for some reason in eqiad stats... [14:44:08] topranks: we could try adding more vCPU again I suppose [14:44:23] yeah it's an option [14:44:39] though glancing at htop it seems two of them were pegged at 100%, with the other two fairly idle [14:44:52] hm we might indeed need separate instances then [14:46:14] do we already have a docker image? [14:46:22] no [14:46:39] there may also be a config that works better with our setup [14:47:07] i.e. if the prometheus output is built in advance of scraping events etc [14:47:33] I was looking at their K8s example, they use consul - do we have an equivalent? [14:47:35] https://gnmic.openconfig.net/deployments/clusters/kubernetes/cluster_with_prometheus_output/ [14:49:41] Arzhel was chatting to a guy who had a huge deployment using it with 15s metrics and working well, on the Nokia discord I think. When he returns we might be able to tax that contact for some advice too. [14:50:09] The BGP stuff isn't urgent - I started looking as there is a LibreNMS bug in it at the moment [14:51:26] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:56:29] locker: [14:56:31] # type of locker, only consul is supported currently [14:56:33] sigh [14:56:39] it's not like k8s includes native support for leader election or anything [14:57:04] this is kind of a silly design on their part IMO [14:57:11] that sucks [14:57:23] well, we could just not use the automatic clustering mode, and split things up manually [14:58:53] yep I'm sure in theory possible, just more effort on our part [14:59:52] hm they also imply you want to be running NATS as well [15:01:39] I don't know what that is but I don't like the name :P [15:01:58] (tbh NAT's not all bad just sticking to form as a neteng) [15:02:24] topranks: it's golang kafka, basically [15:02:26] https://github.com/openconfig/gnmic/issues/344 [15:02:28] hmmm [15:02:31] the docs are wrong?? there are other lockers? [15:02:56] entirely possible [15:03:30] this says only consul too, but perhaps out of date: https://gnmic.openconfig.net/user_guide/HA/ [15:03:37] yeah it's out of date since last year [15:03:52] there are example configs for redis and for k8s-api Lease objects [15:03:57] the latter is what I would try firts [15:04:05] https://kubernetes.io/docs/concepts/architecture/leases/ [15:04:24] ok that's hopeful anyway [15:04:57] do we already use that elsewhere? [15:05:19] well, yes -- it's how nodes report their liveness to the k8s control plane 😅 [15:05:24] I don't know if we have any apps using it though [15:06:01] oh ofc, cert-manager also uses it [15:06:14] and flink and kserve and a few other pieces [15:06:16] so yeah :) [15:08:07] lol, a juniper horror story for you topranks https://github.com/openconfig/gnmic/issues/261 [15:08:44] > Also opened a case with juniper and they said its by design and to "just use grep" 🤦 [15:09:26] hahahaha [15:09:34] I literally returned to this tab to paste that in :D [15:11:24] Tbh the Juniper side of things has mostly been ok, gnmic has given us more issues on the prometheus output side. The subcriptions seem to work and are quick. [15:12:11] hmm [15:12:13] https://github.com/workfloworchestrator/gnmic-cluster-chart [15:12:35] I wonder if we really need kafka or some other intermediary [15:12:44] I'm not sure why the proms couldn't just scrape each gnmic [15:12:51] (or each gnmic do remote write) [15:14:07] ok, the gnmic example k8s objects don't do that: https://github.com/openconfig/gnmic/tree/main/examples/deployments/2.clusters/2.prometheus-output/kubernetes/gnmic-app [15:14:47] topranks: how close is the configmap.yaml file there to our current configuration? [15:15:47] the one on that link is very basic [15:15:54] we don't have any "clustering" configured on ours [15:16:14] and we have a bunch of "targets" and "subscriptions" defined, as well as a single output (prometheus, same as the example) [15:16:39] so overall it's fairly similar, I guess in the cluster setup you don't statically configure the targets and subscriptions in the config file, hence them being empty there [15:18:35] topranks: no I think you do and it's a placeholder [15:18:59] I think they all get the config file, with stuff filled in, they elect a leader, and then the leader makes assignments [15:19:01] ok yep, well we have a bunch of things in there but no reason to think they won't slot in [15:19:05] cool [15:19:18] what binaries have we been using? [15:19:29] we've a few other bits too, like for auth (rancid user/pass + the tls cert stuff) [15:19:31] do we have our own deb or are we just using the github releases? [15:19:40] that's a good question [15:20:06] installed deb anyway [15:20:10] cmooney@netflow1002:~$ sudo dpkg -l | grep gnmic [15:20:10] ii gnmic 0.39.0 amd64 gNMI CLI client and collector [15:20:51] "ensure_packages(['gnmic'])" in puppet but I'll need to dig into it more, Arzhel set it up originally [15:22:53] it comes fro our bookworm-wikimedia APT [15:23:12] so it was eithe rbuilt or imported [15:23:40] I think based on the original task we just imported it from here [15:23:41] https://github.com/openconfig/gnmic/releases/ [15:24:15] there is a point update perhaps we should do that again [15:29:44] how would I go about that? [15:30:10] download to an apt host and add with reprepro ? [15:30:37] yeah for golang binaries we often just import the debs [15:32:47] ok [15:33:02] fwiw I'd be reasonably confident about the maintainers on this one [15:33:45] topranks: yeah also it's github-actions generating the binaries [15:33:52] I suspect that what Arzhel did was this procedure https://wikitech.wikimedia.org/wiki/Reprepro#Copying_between_distributions [15:34:18] the `includedeb` variant discussed second [15:35:12] https://sal.toolforge.org/production?p=0&q=includedeb&d= [15:39:13] ok thanks, yeah spying the bash history that's exactly what he did [15:39:48] so I could do the same for the updated one? the repo will clock the later version and supply that if the package is requested? [15:40:08] yeah as long as you add it to the same component etc [15:40:20] gotcha yeah makes sense [15:40:44] might be worth a try, the point fix is just a small thing, and there is no related comment/issue to explain the PR [15:41:16] but it does touch the same output processors we are using, so maybe there is a performance benefit [15:43:36] yeah it's plausible [15:59:45] cdanis: so that seemed to work ok [15:59:57] how would I roll it out to the netflow VMs? [16:00:09] you could use debdeploy, but for this, i would just do it by hand :) [16:00:47] ok yep. puppet run did not do it but that makes sense [16:01:24] there's debdeploy docs at https://wikitech.wikimedia.org/wiki/Software_deployment but really it's overkill for this [16:01:30] so just apt-get update and such :) [16:03:57] yep thanks, all done :) [17:00:18] 10netops, 10Hiddenparma, 06Infrastructure-Foundations, 10Prod-Kubernetes, 07Kubernetes: Allow reaching services on the aux k8s cluster bypassing the CDN - https://phabricator.wikimedia.org/T382269#10489758 (10CDanis) >>! In T382269#10458292, @CDanis wrote: > I am wondering if we really need the ability t... [17:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [17:54:26] 10CAS-SSO, 06Infrastructure-Foundations, 10Phabricator: Phabricator should use IDP for developer account logins - https://phabricator.wikimedia.org/T377061#10489965 (10Aklapper) I find [a lot of tasks](https://phabricator.wikimedia.org/maniphest/?ids=256628,305874,256958,267186,159584) linking to changes in... [18:04:49] RESOLVED: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:07:49] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:22:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:51:26] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:04:09] 10CAS-SSO, 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Unable to log in to Netbox - https://phabricator.wikimedia.org/T373702#10490265 (10taavi) After being blocked on this and a few questions and pings on IRC with no response I've decided to fix this myself. What I did in the database was... [19:41:37] 10netops, 06Infrastructure-Foundations, 06SRE: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345#10490388 (10cmooney) 05Open→03Resolved a:03cmooney This is working now {F58260515 width=700} [21:12:00] 10netops, 06Infrastructure-Foundations, 06SRE: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253#10490568 (10cmooney) 05Open→03Resolved Gonna close this one for now, the balance is better with the changes we added and we can review as time goes on. [21:14:22] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490586 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7b39f587-684b-42ab-a96c-cf552c03a29d) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th... [21:33:09] 10netops, 06Infrastructure-Foundations, 06SRE: Manage fundraising network elements from Netbox - https://phabricator.wikimedia.org/T377996#10490638 (10cmooney) [21:38:24] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490654 (10cmooney) Fwiw I thought I saw a potential optimisation to allow us to go back to the "on change" style subscription. gNMIc has a parameter that can be configu... [21:59:21] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490691 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3f0feb1a-6c73-4906-bb5a-2df62eb7e156) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and th... [22:07:49] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:22:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:51:26] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:06:02] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490768 (10cmooney) The current configuration we have requires us to enable [[ https://gnmic.openconfig.net/user_guide/caching/ | gnmic caching ]], as we group certain me... [23:11:14] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10490804 (10cmooney) FWIW I used the config from P72314 in the most recent tests. I'd tried to use some of the advice from [[ https://github.com/openconfig/gnmic/issues/4... [23:16:26] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed