[00:34:17] (SystemdUnitFailed) firing: (5) httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:17] (SystemdUnitFailed) firing: (4) krb5-admin-server.service Failed on krb2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:27:49] (SystemdUnitFailed) firing: (3) confd_prometheus_metrics.service Failed on sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:29:17] (SystemdUnitFailed) firing: (3) confd_prometheus_metrics.service Failed on sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:01:30] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Marostegui) [07:02:03] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Marostegui) @jcrespo kindly check what is needed for backup involved hosts, thanks! [07:29:36] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10jcrespo) [07:30:43] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10jcrespo) >>! In T335042#8795210, @Marostegui wrote: > @jcrespo kindly check what is needed for backup involved hosts, thanks! Done. [08:32:49] 10netops, 10Infrastructure-Foundations, 10SRE, 10observability: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi) Thanks for the quick reply! This now works: ` prometheus1006:~$ curl lsw1-e8-eqiad.mgmt.eqiad.wmnet:9100/metrics | wc -l 3412 ` I guess next step is to s... [09:14:17] (SystemdUnitFailed) firing: (3) confd_prometheus_metrics.service Failed on sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:17:49] (SystemdUnitFailed) firing: (3) confd_prometheus_metrics.service Failed on sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:39:25] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Jenkins: PCC runs failing with complaints about disk space - https://phabricator.wikimedia.org/T335111 (10jbond) [10:42:11] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Jenkins: PCC runs failing with complaints about disk space - https://phabricator.wikimedia.org/T335111 (10jbond) p:05Triage→03Medium [11:39:59] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Traffic: cookbooks.sre.hosts.reimage should not fail if the first Puppet run failed and if the user was prompted - https://phabricator.wikimedia.org/T334880 (10Volans) p:05Low→03Medium a:03Volans [13:19:17] (SystemdUnitFailed) firing: (3) confd_prometheus_metrics.service Failed on sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:35:23] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10bking) a:03bking [17:19:17] (SystemdUnitFailed) firing: (3) confd_prometheus_metrics.service Failed on sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:20:13] 10Puppet, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10bking) [18:34:17] (SystemdUnitFailed) firing: (4) httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:47:28] 10Puppet, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10Gehel) [19:34:17] (SystemdUnitFailed) firing: (4) httpbb_hourly_appserver.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:12:49] (SystemdUnitFailed) firing: (3) confd_prometheus_metrics.service Failed on sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:17] (SystemdUnitFailed) firing: (3) confd_prometheus_metrics.service Failed on sretest1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:37:49] (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:34:17] (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-web_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed