[01:14:25] FIRING: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:40] RESOLVED: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:41:10] FIRING: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:52:00] FIRING: [2x] CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:10:56] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10212965 (10Papaul) [06:09:25] RESOLVED: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:55] FIRING: SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:52:00] FIRING: [2x] CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:54:55] FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:46:10] FIRING: [2x] SystemdUnitFailed: routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:48:07] for the RPKI one, `/var/lib/routinator/repository` was full on 2003, but only at 67% on 1001, I cleared the directory and restarted the daemon, seems healthy now [07:55:50] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10213162 (10elukey) Found a new interesting issue when running the provision cookbook for mc-misc2001: ` "Message":... [09:57:02] hey, people, it seems certmanager is having some issues, maybe due to recent k8s issues? https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=cert-manager&var-deployment=cfssl-issuer&orgId=1&from=now-12h&to=now [10:13:09] moritzm: around? Do you know who I can ask about certmanager? [10:13:33] there seems to be some ongoing availability/latency issues [10:16:42] https://grafana.wikimedia.org/goto/-kssMXkNg?orgId=1 [10:18:13] no idea about this, best to ask the serviceops folks [10:18:52] ah, my fault [10:19:01] I was confusing this with the TLS service [10:19:03] apologies [10:19:15] this is k8s [10:52:00] FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:49:55] FIRING: SystemdUnitFailed: wmf_auto_restart_routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:54:55] RESOLVED: SystemdUnitFailed: wmf_auto_restart_routinator.service on rpki2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:34] 10Mail, 06Infrastructure-Foundations, 06SRE: Lisa@wikipedia.org is receiving a large number of donor responses - https://phabricator.wikimedia.org/T375643#10213939 (10Aklapper) (Per T376798 I removed an image from this task.) [12:54:42] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10213960 (10ayounsi) Phase 2 lgtm, one point though : you need to trunk the management vlan between the old and new switch for fasw to be re... [14:52:00] FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:55:20] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10214641 (10elukey) Last issue worth to report is T371416#10214548. The backup1012 host seems to have a very old fir... [15:06:47] slyngs: anything going on with IDP? people can't login to LibreNMS after the IDP screen (Error about being unauthorized) [15:07:21] tried a fresh login as well [15:16:44] httpd reports MOD_AUTH_CAS: INVALID_TICKET, referer: https://librenms.wikimedia.org/ on netmon1003 [15:17:18] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: codfw:frack:servers migration task - https://phabricator.wikimedia.org/T375151#10214744 (10Papaul) @ayounsi thanks for the feedback [15:19:05] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 3 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10214754 (10Papaul) @cmooney thanks for the feedback for the migration let us work with the way it is setup for know and we... [15:19:57] or moritzm? ^ [15:24:36] or slyngs --^ [15:24:53] looks like it's discussed in -private [15:27:18] should be resolved [15:28:39] thanks! [16:30:30] 10netops, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, and 2 others: dns: integrate PTR support for 2a02:ec80:a100::/48 - https://phabricator.wikimedia.org/T376462#10214998 (10cmooney) The delegations for the 4 subnets used so far on the infra-side are working also: ` cmooney@cumin1002... [18:52:30] FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:26:01] FIRING: NTPNoSynced: NTP not synced on dns7001:9100 - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.wikimedia.org/?q=alertname%3DNTPNoSynced [19:31:01] RESOLVED: NTPNoSynced: NTP not synced on dns7001:9100 - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.wikimedia.org/?q=alertname%3DNTPNoSynced [19:44:01] FIRING: NTPNoSynced: NTP not synced on kafka-stretch2001:9100 - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.wikimedia.org/?q=alertname%3DNTPNoSynced [19:54:01] RESOLVED: NTPNoSynced: NTP not synced on kafka-stretch2001:9100 - https://wikitech.wikimedia.org/wiki/NTP - TODO - https://alerts.wikimedia.org/?q=alertname%3DNTPNoSynced [22:52:30] FIRING: CertAlmostExpired: Certificate for service cloudidm2001-dev:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#cloudidm2001-dev:443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired