[01:24:15] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:28:34] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @papaul looks good to me. I can do them any day this week except today (Tuesday), so whenever... [08:38:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) >>! In T327919#8732605, @cmooney wrote: > > @aborrero are we ok to proceed with theis second... [08:55:33] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) >>! In T296832#8729881, @Volans wrote: > Looks ok to me too, I'm no sure about all the details involved if w... [08:58:07] 10netops, 10Analytics-Radar, 10Infrastructure-Foundations: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10aborrero) 05Resolved→03Open This happened to me today in a couple of hardware servers, see {T333281} and {T333282}. [09:03:17] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) >>! In T296832#8729881, @Volans wrote: > Looks ok to me too, I'm no sure about all the details involved if w... [09:09:01] 10netops, 10Analytics-Radar, 10Infrastructure-Foundations: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10cmooney) @aborrero do you have more details on what happened with those? I'm not sure the symptoms are the same. In the Ganeti case the hyperviso... [09:11:44] 10netops, 10Analytics-Radar, 10Infrastructure-Foundations: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10aborrero) >>! In T273026#8732900, @cmooney wrote: > @aborrero do you have more details on what happened with those? > > I'm not sure the symptoms... [09:26:38] 10netops, 10Infrastructure-Foundations, 10SRE: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) p:05Triage→03Medium [09:26:51] 10netops, 10Infrastructure-Foundations, 10SRE: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) [09:26:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [09:28:34] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:02] 10netops, 10Infrastructure-Foundations, 10SRE: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) Logs from switch at during operation: `lines=20 Mar 28 09:28:50 cloudsw1-b1-codfw sshd[11342]: WARNING: could not open /etc/ssh/moduli... [09:36:17] 10netops, 10Infrastructure-Foundations, 10SRE: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) [09:56:35] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:12:32] i have sent a CR so that warnings will only repeate once per week https://gerrit.wikimedia.org/r/c/operations/puppet/+/903615 but open to other tuning options e.g. shorter longer time, dont send warninigs here, something else ... [10:32:34] 10netops, 10Analytics-Radar, 10Infrastructure-Foundations: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10cmooney) >>! In T273026#8732916, @aborrero wrote: > I don't know exactly what happened. > > My hunch is that the systemd service has been in faile... [10:42:12] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) @ayounsi thanks for the response. Overall I've no objection so let's proceed. I agree in terms of addin... [10:46:12] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10cmooney) Thanks for the task, does indeed look like a useful tool that could simplify adding additional monitoring without having to modify the LibreNMS codeb... [10:46:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10cmooney) a:03cmooney [11:23:44] (SystemdUnitFailed) firing: (10) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:24] jbond: yeah it depends on how noisy it will be [11:56:44] those are all -test so hopefully they will never alerte [12:41:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10fgiunchedi) I took a quick look at the exporter and looks good to me too! Also +1 on the general testing/deployment plan re: SSH from a quick read through th... [14:13:34] (SystemdUnitFailed) firing: (10) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:35] (SystemdUnitFailed) firing: (12) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:35] (SystemdUnitFailed) firing: (16) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:59] (SystemdUnitFailed) firing: (18) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:25:26] (SystemdUnitFailed) firing: (18) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:28:34] (SystemdUnitFailed) firing: (18) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:42] (SystemdUnitFailed) firing: (18) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:36:50] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney can we do this on Thursday ? Can we also do the other batches(3-4) on the same day? [14:38:44] (SystemdUnitFailed) firing: (18) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:38] godog: anyidea why the above alert is being so spammy. looks like exactly the same message fired 5 times in the last 15 mins? [14:40:38] jbond: it is a group of 18 alerts, though only the text from the first is displayed [14:41:41] i.e. these guys https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed&q=team%3Dinfrastructure-foundations&q=%40state%3Dactive [14:41:48] godog: i thought that the fact that this is groups meant it only fires onc for the whole group. looking at the output there is no way to determin the difference between e.g. the message from 14:29 and the one from 14:38 [14:43:29] (SystemdUnitFailed) firing: (17) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:31] there's another case for re-fire, namely if another alert comes in for the same group, which I suspect that's what happening here [14:44:42] (SystemdUnitFailed) firing: (16) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:44:54] godog: all the five between 14:23 and 14:38 have "firing: (18)" as such it dosn;t seem like anything change unless one recorved and one failed for each of those triggeres which seems a little unlikley (but possible) [14:45:38] mmhh you are right, checking [14:46:43] * jbond wonderes if we get a refire for every state change. i.e. check fails at 12:00 so we have firering (1) with a refire at 16:00. second vcheck fails at 12:10 so we have firering 2 swith re-alert at 16:10. if nothing changes we get two refires at 16:00 and 16:10 that look exactly the same [14:46:48] godog: ^^ [14:48:44] (SystemdUnitFailed) firing: (12) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:48:51] heh that's a good theory too, I'm looking at this https://logstash.wikimedia.org/goto/34770ebf18ab0b7ea78fb27b89225d3c [14:49:57] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:50:15] * jbond also looking at the puppet agent issues [14:50:40] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:50:56] ack [14:52:17] thanks jbond! [14:53:12] godog: i think we can state that the widespreadpuppet alert is fireing correctrly ;) [14:54:32] heh, I mean it is working as intended but since we're grouping per cluster and team the results are spammy for sure [14:55:19] I'm taking a look btw [14:55:25] yes very true [14:55:27] thanks [14:59:36] I think it is fair to say we'd rather alert on widespread puppet failures on a site instead [15:00:29] godog: idealy we would have one for when we have puppet faliure fleet wide and one per cluster. with the former automaticly shadowing the later [15:00:55] the cluster one is usefull as sometimes there is an issue in e.g. just the cassandra class [15:02:38] yeah that's fair, ok my current thinking is: the site-wide can be critical while cluster-wide is a warning and a bit more lax in terms of how long a failure has to persist [15:03:41] godog: osgtm [15:03:47] godog: *sgtm [15:22:42] need to go afk now but patch is https://gerrit.wikimedia.org/r/c/operations/alerts/+/903687 [15:23:28] 10netops, 10Infrastructure-Foundations, 10Observability-Alerting: Bonded interface setup for alert hosts - https://phabricator.wikimedia.org/T333371 (10herron) p:05Triage→03Medium [15:55:22] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [15:55:38] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [16:02:24] 10netops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE: Bonded interface setup for alert hosts - https://phabricator.wikimedia.org/T333371 (10ayounsi) See guidelines on https://wikitech.wikimedia.org/wiki/Wikimedia_network_guidelines#Servers_uplinks but it's usually not worth it. We only... [17:10:37] 10netops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE: Bonded interface setup for alert hosts - https://phabricator.wikimedia.org/T333371 (10herron) 05Open→03Declined Thanks, fwiw I added a talk topic on wiki in hopes that link redundancy can be explored the next time switch upgrades/... [18:38:54] 10netops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE: Bonded interface setup for alert hosts - https://phabricator.wikimedia.org/T333371 (10cmooney) Yeah I tend to agree, with one top-of-rack switch two connections only protects against link failure (as they both land on the same switch)... [18:48:29] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:49] 10netops, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE: Bonded interface setup for alert hosts - https://phabricator.wikimedia.org/T333371 (10herron) >>! In T333371#8736041, @cmooney wrote: > In the case of a server failure do the alert hosts fail over? Not automatically at the present... [19:29:25] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8735024, @Papaul wrote: > @cmooney can we do this on Thursday ? Can we also do... [20:39:42] 10Mail, 10Infrastructure-Foundations, 10Observability-Logging, 10SRE-Sprint-Week-Sustainability-March2023, and 2 others: Graph outbound mail volume on per-service or hostgroup level - https://phabricator.wikimedia.org/T197171 (10lmata) [20:41:16] 10Mail, 10Infrastructure-Foundations, 10Observability-Alerting, 10SRE-Sprint-Week-Sustainability-March2023, and 2 others: Improve outbound mail service alerting - https://phabricator.wikimedia.org/T197172 (10lmata) [22:48:29] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed