[06:59:19] (SystemdUnitFailed) firing: update-tails-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:14] 10netops, 10Infrastructure-Foundations, 10SRE: Firewall filter blocking traceroute in underlay QFX5120 EVPN - https://phabricator.wikimedia.org/T348120 (10ayounsi) Thinking a bit more about that, as the loopback is already on a private IP it can't be targeted directly, and packets with TTL=1 being sent to th... [08:59:19] (SystemdUnitFailed) resolved: update-tails-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:00:31] (SystemdUnitFailed) firing: update-tails-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:19] (SystemdUnitFailed) resolved: update-tails-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:33:47] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10Clement_Goubert) 05Open→03Resolved >>! In T352893#9471788, @ayounsi wrote: > Nice !! > > T... [10:33:55] 10netops, 10Infrastructure-Foundations, 10SRE: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10Clement_Goubert) [10:34:05] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) 05In progress→03Resolved [10:34:13] 10netops, 10Infrastructure-Foundations, 10SRE: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10Clement_Goubert) [11:29:19] (SystemdUnitFailed) firing: prometheus-ganeti-exporter.service Failed on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:34:19] (SystemdUnitFailed) resolved: prometheus-ganeti-exporter.service Failed on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:32] (SystemdUnitFailed) firing: networking.service Failed on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:55] moritzm: not sure what the above alert is about? checking on the host it seems happy, main interface and vlan subs are UP and working? [13:34:39] actually I see in the logs it's failing to enable the 'private' bridge, I think cos it exists already [13:35:20] I'm unsure why, /etc/network/interfaces seems ok [13:37:15] moritzm: I think it may be the lack of indentation on the "up ip addr add 2620:0:861:102:10:64:16:33/64 dev eno12399np0" line [13:37:48] so it's not bound to the interface definition, that's the only thing I see that's out of place [13:39:00] I've fixed that now, we should reboot and see if that sorts it [13:41:51] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Gehel) [14:03:16] topranks: mysterious! it would be surprising if indentation would matter, having a look in a bit [14:03:49] I think the indentation does matter - without being under "iface private" the "up xxxx" command isn't tied to that interface [14:04:13] the *amount* of indentation doesn't matter, but there was none at all before that command prior to my edit [14:04:36] did you reboot yet? then we can confirm [14:06:01] no pad for today's meeting? [14:06:22] moritzm: no I didn't know the status so didn't reboot, I'll do that now if you confirm it's ok ? [14:07:49] I've created it [14:08:00] topranks: ack, let's reboot to find out [14:08:35] ok doing now [14:10:32] (SystemdUnitFailed) resolved: networking.service Failed on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:20] mortizm: I rebooted again - service seemed happy that time but that "ip addr add" command for the v6 address had the physical dev referenced, not 'private' bridge like it should [14:15:30] moritzm: even [14:15:32] (SystemdUnitFailed) firing: networking.service Failed on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:19] (SystemdUnitFailed) resolved: networking.service Failed on ganeti1036:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:41] topranks: ah yes, good catch. thanks [14:20:57] ok yep it's happy now [14:21:42] really gotta figure out that interface naming and push this from netbox waay too much hassle [14:22:49] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) p:05Triage→03Medium [14:24:20] yeah, that would be good to simplify eventually [14:25:05] we're a good part of the way there, but one of those things that doesn't bear fruit until all the bits are in place [14:46:41] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10cmooney) p:05Triage→03Medium [14:49:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10cmooney) [14:49:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [15:35:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10Jhancock.wm) @cmooney I think that's doable. I'll block out my schedule for it. [15:58:08] 10SRE-tools, 10Infrastructure-Foundations: Cookbook for rack depool - https://phabricator.wikimedia.org/T327300 (10joanna_borun) p:05Triage→03Medium [16:00:18] 10SRE-tools, 10Infrastructure-Foundations: Add warning when provision cookbook is ran without the virtualization flag on hypervisors - https://phabricator.wikimedia.org/T344342 (10joanna_borun) p:05Triage→03Medium [16:16:21] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks could be more verbose in listing the completed/missing steps - https://phabricator.wikimedia.org/T345375 (10Volans) 05Open→03Declined Declining because of inactivity and unclear line of action due to the opposed views. Feel free to re-open if you feel li... [16:19:45] 10netbox, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: Netbox device location information not available on the first Puppet run of a device - https://phabricator.wikimedia.org/T347375 (10joanna_borun) a:03cmooney [16:21:42] 10Packaging, 10Infrastructure-Foundations: Build and package gnmic - https://phabricator.wikimedia.org/T347461 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [18:21:23] 10SRE-tools, 10Infrastructure-Foundations: Reimage cookbook fails to downtime hosts when run concurrently - https://phabricator.wikimedia.org/T355187 (10Volans) Debugging this it seems that this was caused by a race condition in which `run-puppet-agent` check passed and said that puppet was not running but by... [23:46:40] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10Papaul) @cmooney can we get those 2 hosts back in decom? Thanks