[01:30:12] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:12] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:25:12] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:21] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:32] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:23:49] moritzm: is that one for you ^ ? [05:30:12] (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:39:43] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10ayounsi) Hijacking this thread, @wiki_willy what do you think about making the LibreNMS port speed issue check to open a DC... [05:49:02] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability, 10Patch-For-Review: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi) Read only, there are already some "prometheus-*-expoter" images in https://docker-registry.wikimedia.org/ so it might just be a matt... [06:09:18] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10ayounsi) @jbond an-tool1005 is reporting at 100M but https://gerrit.wikimedia.org/r/c/operations/alerts/+/908556/ didn't se... [06:12:51] XioNoX: yeah, this is part of the "Debian stretch removal on Debian mirrors" cleanup work, should resolve later the week [06:13:19] moritzm: cool, is it possible to silence the alert? [06:56:54] I've silenced it now for six hours, but it's not really great process to silence these, this warning needs to be squashed in two places [06:57:42] and silencing the systemd generic alert also means that for the next six hours it won't alert on other, unrelated systemd failures on build2001 [07:05:12] (SystemdUnitFailed) firing: (5) docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:05:27] (SystemdUnitFailed) firing: (5) docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:40:14] moritzm: is that build2001 trolling us? ^ [07:41:01] I think there was some discussions to split the alert per systemd services instead for better granularity [07:41:35] hmmh, docker-reporter-k8s-images should be fixed by John's patch from yesterday, having a look [07:45:21] ugh, there's a of images which are based on stretch, but don't have stretch in the image name [07:46:02] we'll need some policy/consistent naming scheme, but in the interim that means some whack-a-mole of random images excluded from the reporter, I'll make a patch [09:15:12] (SystemdUnitFailed) firing: (2) docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:23:19] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10jbond) >>! In T333007#8803629, @ayounsi wrote: > @jbond an-tool1005 is reporting at 100M but https://gerrit.wikimedia.org/r... [10:38:01] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10jbond) >>! In T333007#8804112, @jbond wrote: >>>! In T333007#8803629, @ayounsi wrote: >> @jbond an-tool1005 is reporting at... [11:30:12] (SystemdUnitFailed) resolved: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability: Alertmanager rule for network interface errors? - https://phabricator.wikimedia.org/T335350 (10cmooney) p:05Triage→03Low [11:53:06] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability: Alertmanager rule for network interface errors? - https://phabricator.wikimedia.org/T335350 (10ayounsi) FYI we do alert on those on the network side, see "Inbound interface errors" and "Outbound interface errors" on https://librenms.wikimedia.... [12:23:56] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10User-jbond: document puppet/netbox/hiera interaction - https://phabricator.wikimedia.org/T311304 (10jbond) 05Open→03Resolved updated, please re-open if anything needs adding/clarifying [12:24:04] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10observability, and 2 others: Puppet: get data (row, rack, site, and other information) from Netbox - https://phabricator.wikimedia.org/T229397 (10jbond) [13:00:12] (SystemdUnitFailed) firing: docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:12] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:12] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:20:12] (SystemdUnitFailed) firing: docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:42:06] 10netops, 10Infrastructure-Foundations: Adjust routing policy to increase SSH session speed from East Asia to toolforge - https://phabricator.wikimedia.org/T334530 (10BCornwall) [22:20:12] (SystemdUnitFailed) firing: docker-reporter-base-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed