[02:07:49] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:22:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [03:16:26] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:07:49] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:22:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:16:26] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:59] 10SRE-tools, 06Infrastructure-Foundations, 06SRE, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10491430 (10MoritzMuehlenhoff) [09:43:10] 10CAS-SSO, 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Unable to log in to Netbox - https://phabricator.wikimedia.org/T373702#10491451 (10Volans) >>! In T373702#10490265, @taavi wrote: > After being blocked on this and a few questions and pings on IRC with no response I've decided to fix th... [09:43:48] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10491452 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=43ff15dd-e256-46b3-aea6-882240b9fe64) set by cmooney@cumin1002 for 1:00:00 on 1 host(s) and th... [10:07:49] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:22:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:48:23] 10CAS-SSO, 06Infrastructure-Foundations, 10Phabricator: Phabricator should use IDP for developer account logins - https://phabricator.wikimedia.org/T377061#10491654 (10Aklapper) [11:18:37] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:37:55] FIRING: MaxConntrack: Max conntrack at 81.65% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [11:42:56] RESOLVED: MaxConntrack: Max conntrack at 81.65% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [11:43:37] FIRING: [3x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:01:38] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Check link from msw1-eqiad et-0/1/0 to msw2-eqiad et-0/1/0 - https://phabricator.wikimedia.org/T384708 (10cmooney) 03NEW p:05Triage→03Low [13:55:32] slyngs, elukey: I don't suppose either of you know how to get a new script to show in nextbox "custom scripts" ? [13:56:02] I just merged a patch to the netbox-extras repo with a new one, and have merged it and run the cookbook to update on the netbox hosts [13:56:05] but it's not showing [13:56:39] I know this changed with the version 4 upgrade. And that patches to existing scripts work ok with the process I've done, but there must be something else needed to make it aware of the existence of the new file [13:57:17] you can add with the "+" in the gui and upload, which is how we test them, I'm not sure how we do the equivalent programmatically [14:08:37] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:12:19] and now somehow netbox-next is broken cos I restarted the service, even after a VM reboot [14:12:22] this week hates me [14:13:56] topranks: o/ [14:14:33] so IIRC there was a manual sync to do, but I don't recall 100%.. do we have anything written in wikitech? [14:14:57] I couldn't find anything no, nor in Phabricator [14:16:01] there is a systemd timer for "netbox_housekeeping.service" which runs ever 24h, not sure if maybe that does something [14:17:34] ^^ nah this is just cleaning up expired auth sessions and stuff [14:17:54] topranks: PermissionError: [Errno 13] Permission denied: '/srv/deployment/netbox/current/src/netbox/netbox/configuration.py' [14:18:07] on netbox-next ? [14:18:11] that's my fault damn [14:18:14] this is on netbox-dev2003:/srv/log/netbox/main.log [14:18:22] ok cool been trying to find the logs [14:18:25] thanks <3 [14:20:07] woohoo at least netbox-next is working again :) [14:20:11] once you have fixed it we can check how to refresh the extras, even if I have the memory of reading something on wikitech [14:20:25] or maybe it was Arzhel telling me how to do it :D [14:20:46] yeah Arzhel told me too, or at least the basics there were some tricks to make it work [14:20:57] but I can't remember, and then he abandoned us :( [14:23:30] topranks: https://wikitech.wikimedia.org/wiki/Netbox#More_details [14:23:37] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:24:10] elukey: wow how did I miss that [14:24:13] thank you :) [14:24:19] sounds easy let's see :P [14:24:46] I don't find the sync though :D [14:25:23] but now that I see https://netbox-next.wikimedia.org/extras/scripts/add/ I seem to recal Arzhel telling me to just select Netbox extras as datasource, and right below the new file [14:25:59] ok [14:26:06] I ran this manually already [14:26:07] sudo runuser -u www-data /srv/deployment/netbox/venv/bin/python3 /srv/deployment/netbox/deploy/src/netbox/manage.py syncdatasource "Netbox extras" [14:26:48] is https://phabricator.wikimedia.org/T379072 still broken? [14:27:07] cool, done! [14:27:30] elukey: yeah that's still broken, that's why I was messing with configuration.py and got the error [14:27:51] should be easy to fix I think, just need to remove the server_time var, or replace it with something else [14:28:18] okok I'll try to check later on if I can think of any trick to make it work, as you mentioned weird that it happens only now [14:30:52] (meeting, I'll read with some lag) [14:32:29] cool it's no problem... I was able to verify everything we need from cookbook actually runs successfully, just we get that error [15:37:20] topranks: do you happen to know if there are any plans/tasks on making those pesky BGP session errors silenceable per session/endpoint? [15:38:40] It's probably way better now given we could downtime the checks for a ToR in some cases ... but is still does not feel righ to silence all of them while expecting one to go away [15:39:19] these guys? [15:39:21] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4 [15:39:39] yeah we should probably do something alright [15:40:17] yeah, those [15:40:25] The proper way to do it would probably be for the cookbook to unset the 'bgp' flag for the server in Netbox, and then run homer against the switch/router which would remove the session config [15:40:26] happens all the time for k8s maintenance [15:40:30] then re-add at the other side [15:40:35] uuh... [15:40:42] that sounds pretty invasice [15:40:46] *invasive [15:40:55] or have something that just reached out to the router and deactivated the peering session for the duration [15:41:25] but isn't it okay for a session to be down for a bit? [15:41:40] the logic of alarming on them being down says no [15:42:01] I think I was asking about changing the logic :D [15:42:11] but perhaps it is, maybe we shouldn't alert on the K8s sessions being down. Or maybe we should only do so if they are down for over some length of time [15:42:57] ...or be able to downtime them together with the host [15:43:06] that's what I wanted to suggest [15:43:08] it's hard to filter an alert about a specific peer separate from the others on a router is the issue there [15:43:32] like I say the CRs are too critical to the overall health of things for us to downtime them regularly cos one k8s host is being reimaged [15:43:37] FIRING: [3x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:44:44] I agree. But given the alerts are still based on icinga I thought there might already be plans to move them to prometheus and make them 'better' during that [15:45:19] better in terms of more fine grained so we can downtime particular sessions [15:45:53] it might be possible [15:46:34] topranks: re: netbox-next - you mentioned that you ran manually "manage.py syncdatasource [15:46:40] I've been having something of a nightmare all week trying to build a different metrics path for them away from Icinga, so it probably won't be too soon [15:46:42] did it fail like in the cookbook? [15:46:42] https://phabricator.wikimedia.org/T369384#10483855 [15:47:06] elukey: yes, and then I changed "configuration.py" manually to test and removing "server_time" worked [15:47:14] ahhh okok got it [15:47:39] uuh, wall of text 👀📖 [15:47:56] yeah the TL;DR is I can't waste more weeks on it right now ;P [15:48:04] hrhr, okay [15:48:26] The simplest way forward would not trest K8s BGP sessions as "high priority" [15:48:37] i.e. remove the ASN here [15:48:38] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/nagios_common/files/check_commands/check_bgp.cfg#3 [15:49:01] but that is probably not good as it leaves us without alerting when there are issues (we could check from the host if that was possible) [15:49:27] alternately the best approach is going to be to disable the sessions router-side as part of the workflow [15:49:40] which shouldn't be too hard we just need to work out the right way to do it [15:50:19] we can maybe talk to o11y also to see if they've any ideas [15:50:42] right now the Icinga check doesn't report the IP/hostname of the down sessions, which makes it hard to compare to any list of "downtimed" hosts. perhaps we could work on that [15:54:04] jayme: sorry I don't have good/simple answers here. I'll open a task on it so we can weigh up the options, and maybe get some input from o11y [15:54:44] if that's easier to do then moving away from icinga I would prefer that. Fiddling with netbox and homer is pretty time consuming (even in cookbooks) and it's not uncommon for k8s workers to get depooled or woked on - and there are many of them [15:55:02] (extending the icinga report) [15:55:51] absolutely fine! I was not expecting a solution or simpel answer (if there would be one, we would have probably implemented it already :)). [15:57:01] We just had so much of them during the mass-rename-reimage-movevlan sessions that I thought they wear off and people start to not take them seriously [15:57:44] yeah that's a fair point. Tbh I am suffering from that too [15:57:51] monitoring them from the k8s side would also be an option I suppose [15:58:00] yeah, same here [15:58:07] and conversely if the maintenance was less common we could maybe downtime the CRs, but they happen too much for us to do that I think [15:58:47] there is an outside chance I'll get the gnmi -> prometheus stats export working in the next weeks which would give us a lot more options [16:01:29] you'll have two 12hour flights of time soon :p [16:04:57] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:19:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:29] 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware needs work - https://phabricator.wikimedia.org/T384722#10492747 (10Andrew) [16:21:53] 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware needs work - https://phabricator.wikimedia.org/T384722#10492754 (10Andrew) ` andrew@cumin1002:~$ sudo cookbook sre.hardware.upgrade-firmware --new --c nic 'cloudcephosd1013.eqiad.wmnet' Acquired lock for key /spicerack/locks/cookbooks/sr... [16:34:08] 10SRE-tools, 06Infrastructure-Foundations: sre.hardware.upgrade-firmware needs work - https://phabricator.wikimedia.org/T384722#10492840 (10Andrew) 05Open→03Invalid papaul just tried and it worked for him, so maybe I was doing something silly? The usage statement still needs work but I can probably fi... [17:06:55] FIRING: MaxConntrack: Max conntrack at 81.41% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [17:16:56] RESOLVED: MaxConntrack: Max conntrack at 83.21% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [17:56:03] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: Prevent BGP alerts triggering when K8s host maintenance is being done - https://phabricator.wikimedia.org/T384731 (10cmooney) 03NEW p:05Triage→03Low [18:08:37] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:23:37] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [19:43:37] FIRING: [3x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:13] FIRING: [3x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:08:37] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:23:37] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange