[03:50:18] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [06:00:13] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) [06:28:11] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) [07:15:18] (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:45:18] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:18] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [08:41:51] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) >>! In T352974#9449656, @ABran-WMF wrote: > Maybe it also has something to do with: > >>>! In T352974#9441563, @ABran-WMF wr... [09:56:00] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10ayounsi) >>! In T352893#9450804, @akosiaris wrote: > I 've been fearing this and started thinki... [10:32:03] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [10:47:22] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839 (10cmooney) p:05Triage→03Medium [11:01:08] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) >>! In T352893#9452446, @ayounsi wrote: >>>! In T352893#9450804, @akosiaris wrote: >>... [11:02:19] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10ops-codfw: Migrate lvs2013 and lvs2014 codfw row A-B connections to new switches - https://phabricator.wikimedia.org/T348218 (10cmooney) [11:03:41] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) 05Open→03Resolved All work completed on this, lvs2014 made active for several hours and no issues. [11:08:37] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate mr1-codfw from asw-a1-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) Traffic has now been re-routed over the new link. Old interfaces from mr1-codfw to asw-a1-codfw has been disabled, as have the sub-interf... [11:54:11] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [12:40:13] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10ayounsi) > The problem remains that the switch name is not going to be enough to know what to... [12:42:26] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) >>! In T352893#9452969, @ayounsi wrote: > Yep, I mentioned it in the loooong Gerrit CR... [13:07:56] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) [13:18:08] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839 (10ayounsi) > Once agreed it probably makes sense to remove profile::pybal::override_bgp_med from the puppet class, and replace it with some... [13:36:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839 (10cmooney) >>! In T354839#9453034, @ayounsi wrote: > On the implementation I'm wondering if instead of introducing a new BGP community, we... [14:02:37] 10netops, 10Infrastructure-Foundations, 10SRE: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [14:25:53] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10ABran-WMF) I've moved a bit further on the testing part. @MoritzMuehlenhoff showed me [[ https://github.com/ikapelyukhin/go-x509-issuer-... [14:29:15] moritzm: how did you create the debmonitor-client repository? I don't see it mirrored on github [14:29:26] I don't know what's triggering the mirroring nowadays [14:38:18] 10netops, 10Infrastructure-Foundations, 10SRE: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) p:05Triage→03Medium [14:38:57] 10netops, 10Infrastructure-Foundations, 10SRE: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [14:39:07] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10cmooney) [14:39:15] 10netops, 10Infrastructure-Foundations, 10SRE: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [14:39:27] 10netops, 10Infrastructure-Foundations, 10SRE: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [14:39:35] 10netops, 10Infrastructure-Foundations, 10SRE: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [14:39:43] 10SRE-tools, 10Data-Persistence, 10Infrastructure-Foundations, 10Patch-For-Review: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 (10cmooney) [14:39:55] 10netops, 10Infrastructure-Foundations, 10SRE: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [14:40:03] 10netops, 10Infrastructure-Foundations, 10SRE: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [14:40:13] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 3 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10cmooney) [14:46:10] 10netops, 10Infrastructure-Foundations, 10SRE: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [15:11:23] volans: via the Gerrit UI, but I have no idea how the github mirror works [15:18:14] :/ [15:42:11] 10netops, 10Infrastructure-Foundations, 10SRE: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10Volans) [15:48:56] 10netops, 10Infrastructure-Foundations, 10SRE: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [15:54:11] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [15:55:29] 10netops, 10Infrastructure-Foundations, 10SRE: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [16:06:02] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:11:47] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox: get rid of WMF Production Patches - https://phabricator.wikimedia.org/T310717 (10MatthewVernon) [19:55:18] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [23:55:18] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk