[00:02:13] (DiskSpace) resolved: Disk space puppetmaster1001:9100:/ 5.31% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:03:36] (SystemdUnitFailed) firing: docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:03:36] (SystemdUnitFailed) firing: docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:36] (SystemdUnitFailed) firing: (2) docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:18:37] (SystemdUnitFailed) firing: (2) docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:37:04] 10netops, 10Infrastructure-Foundations, 10SRE: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 (10ayounsi) Isn't OSPF required there to benefit from the end to end link cost calculations (eg. draining a transport link)? [07:46:03] FYI I'll be running some cookbook against sretest hosts to test spicerack 8.0.0 [08:04:56] why all 3 sretest are on bookworm? [08:12:41] 10netops, 10Infrastructure-Foundations, 10SRE: Bring Juniper switches in eqiad racks E5-7 and F5-7 online and ready for servers - https://phabricator.wikimedia.org/T334230 (10ayounsi) [08:12:49] 10netops, 10Infrastructure-Foundations, 10SRE: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [08:12:57] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10ayounsi) 05Resolved→03Open a:05cmooney→03Jclark-ctr I can't get the links to the Dell switches up, only looking at lsw1-e8 for now it seems li... [08:16:05] 10netops, 10Infrastructure-Foundations, 10SRE: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [09:36:52] i have created T349176 which relates to some miss routred alerts e.g. httpbb_*, pleae add any other innstances or context i may have missed (cc volans ) [09:36:53] T349176: Route systemd unit alerts to the correct team - https://phabricator.wikimedia.org/T349176 [09:37:14] thanks [09:37:22] <3 [09:38:19] volans: the bookworm thing is a mistake [09:38:32] ill reimage sretest1001 unless its usefull for you to test [09:38:48] I could use a reimage for testing, so I can do it [09:39:14] ack [09:49:39] jbond: so sretest1001, reimage to bullseye? puppet 5 or 7? [09:51:16] volans: bullseye puppet7 [09:51:22] ack thx [09:51:35] volans: i think it also means i may not have tested that path way [10:18:37] (SystemdUnitFailed) firing: docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:29:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:48:09] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10SRE, and 2 others: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10ayounsi) [12:57:10] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10SRE, and 2 others: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10ayounsi) Let's move all the A/AAAA SVC records to Netbox. And keep the CNAMEs in the DNS repo if we can't get rid of them. Then have follow up tasks t... [13:30:02] 10Puppet, 10SRE, 10Patch-For-Review, 10User-jbond: Extend Puppet CA Expiry date - https://phabricator.wikimedia.org/T236277 (10fgiunchedi) [13:30:08] 10Puppet, 10Observability-Alerting, 10SRE, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10fgiunchedi) 05Open→03Declined [13:39:00] 10netbox, 10Infrastructure-Foundations, 10IPv6, 10User-jbond: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK) - https://phabricator.wikimedia.org/T253173 (10fgiunchedi) [13:40:16] 10netbox, 10Infrastructure-Foundations, 10IPv6, 10User-jbond: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK) - https://phabricator.wikimedia.org/T253173 (10fgiunchedi) [13:59:49] volans: fyi sretest1001 is now back as bullseye and puppet7 [14:00:41] yay, thx [14:18:37] (SystemdUnitFailed) firing: docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:35] jbond: I did some styling for the PCC html output https://phabricator.wikimedia.org/F38608194 :) [14:31:30] hashar: nice ill take a look and try to do a release tomorrow [14:41:26] \o/ [14:44:36] jbond: maybe each of the changes can attach an entry to the changelog? [14:45:10] I can `git rebase -i` to edit each of them and then add an entry, that is quite easy [14:45:28] this way `git blame` will point to the proper commit which might be handy [14:45:43] hashar: that would be great yes [14:45:55] doing :) [14:45:58] thx <3 [14:46:08] ill test it out properly tomowwor and do a release [14:46:12] and I have another independent change I will rebase on top of my series to avoid a conflict in CHANGELOG [14:46:18] sgtm [15:45:48] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Cabling for Eqiad racks E5-8 and F5-8 - https://phabricator.wikimedia.org/T334231 (10Jclark-ctr) Unsure if port is turned off or if fs dell optics are not compatible. I put loopback on optic in dell switch and link did not come up [17:25:14] 10netops, 10Infrastructure-Foundations, 10SRE: Automate L3 Switch to Core Router BGP peerings (and remove OSPF on drmrs switches) - https://phabricator.wikimedia.org/T349125 (10cmooney) >>! In T349125#9260678, @ayounsi wrote: > Isn't OSPF required there to benefit from the end to end link cost calculations (... [17:54:26] (SystemdUnitFailed) firing: (2) docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:26] (SystemdUnitFailed) firing: (2) docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:45] (SystemdUnitFailed) firing: (2) docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:33:38] (SystemdUnitFailed) firing: (2) docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:47:13] (DiskSpace) firing: Disk space puppetmaster1001:9100:/ 5.937% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [23:34:26] (SystemdUnitFailed) firing: docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed