[00:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:22:56] FIRING: MaxConntrack: Max conntrack at 84.6% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [00:27:55] RESOLVED: MaxConntrack: Max conntrack at 84.6% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [01:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [05:42:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:47:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:30:12] 10netops, 06Infrastructure-Foundations, 06SRE: Manage VRRP priority from Netbox - https://phabricator.wikimedia.org/T381873#10478784 (10cmooney) 05Open→03Resolved a:03cmooney This is all complete and I've set priorities in Netbox to balance traffic from the 4 legacy rows in eqiad across the CRs there. [10:44:56] 10netops, 06Infrastructure-Foundations, 06SRE: Improve Eqiad outbound traffic balance - https://phabricator.wikimedia.org/T384253#10478825 (10cmooney) FWIW I have made the same change in codfw for routes learnt from eqord (Chicago). Locally-learnt routes will now be preferred unless the AS-Path from Chicago... [11:01:23] 10SRE-tools, 06Infrastructure-Foundations, 06SRE, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10478911 (10MoritzMuehlenhoff) [11:15:20] 10netops, 06Infrastructure-Foundations, 06SRE: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288 (10cmooney) 03NEW p:05Triage→03Medium [11:15:36] 10netops, 06Infrastructure-Foundations, 06SRE: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10478967 (10cmooney) [11:16:42] 10netops, 06Infrastructure-Foundations, 06SRE: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10478971 (10cmooney) [12:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:47] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10479262 (10cmooney) So looking at a specific peer - 2620:0:863:1:198:35:26:6 on cr4-ulsfo - I can see the SNMP 'index... [13:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:39:23] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10479515 (10cmooney) >>! In T384258#10477783, @ssingh wrote: > Might be a red herring: The only thing I see that might... [14:44:34] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10479889 (10Volans) If I understand the db structure correctly that should convert into this query: ` select * from b... [14:45:36] 10netops, 06Infrastructure-Foundations, 06SRE: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10479894 (10RobH) @cmooney, I'm updating the order task, but this was delivered in December so I can open a remote hands to get it fixed. Do we need to schedule th... [15:23:37] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10480120 (10cmooney) Thanks @volans you have helped me a lot with this and given me confidence to look at the DB. I s... [15:35:19] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10480208 (10cmooney) It also appears we are getting values populated for AcceptedPrefixes for IPv6 peers for some devi... [15:58:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:03:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:08] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10480398 (10cmooney) Running the poller manually on netmon1003 I can also see it's getting the right value back, but i... [16:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:35] 10netops, 06Infrastructure-Foundations, 06SRE: Configure gnmic to collect data from routers at network pops - https://phabricator.wikimedia.org/T384345 (10cmooney) 03NEW p:05Triage→03Medium [17:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [17:37:53] 10netops, 06Infrastructure-Foundations, 06SRE: Dec 2024: cr3-ulsfo errors on et-0/0/0 link from cr4 - https://phabricator.wikimedia.org/T384288#10480844 (10cmooney) >>! In T384288#10479894, @RobH wrote: > I'm assuming we need to schedule it, and we should give them a couple days notice if we want a set sched... [18:35:48] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10481199 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d0f01fc7-5a29-49c5-8292-aebad021ff73) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th... [19:43:52] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10481570 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=26b7dbb9-1906-4b10-a433-cc2ffb6bdb61) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th... [20:16:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:33:55] FIRING: MaxConntrack: Max conntrack at 83.04% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:39:52] RESOLVED: MaxConntrack: Max conntrack at 83.04% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:34:51] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [21:46:25] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:51:25] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed