[09:27:23] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [10:09:25] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [10:10:23] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [10:23:04] Traffic team - in Brandon's absence I was wondering could anyone give me a steer about the redundancy of our DNS infra/servers? [10:23:28] dns1002 is in eqiad row D - which we are doing maintenance on next Tuesday (https://phabricator.wikimedia.org/T286069) [10:23:39] dns1001 is in another row and won't be affected by that. [10:24:12] My guess is, given how DNS works in general, that a short interruption to traffic (20 seconds or less we expect) won't cause a major issue? [10:24:36] Or should we take some action in advance to de-pool it or otherwise (excuse lack of knowledge on correct terminology!) [10:35:33] repooling one of the dns1* servers is fairly simple, let's err on the safe side and just depool them [10:37:15] Sounds good - thanks for the input Moritz [10:38:00] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:06:14] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:06:40] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [11:13:59] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:14:17] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [11:16:44] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [11:17:16] 10Traffic, 10netops, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [11:42:00] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [11:45:59] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:14:26] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:14:38] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [12:35:47] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:36:11] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:37:57] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:39:18] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [12:43:28] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:44:30] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [12:47:55] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [14:24:53] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10ops-codfw: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) [14:26:23] 10netops, 10DC-Ops, 10ops-codfw: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) [14:27:48] 10netops, 10DC-Ops, 10ops-codfw: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) Please note I'm not putting this request into CyrusOne until after Arzhel confirms they are ready for this step. [14:33:05] 10netops, 10DC-Ops, 10ops-codfw: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10cmooney) First log of an issue was this sent from the master switch in the virtual-chassis: Jul 16, 2021 @ 13:15:55.000 %-SNMP_TRAP_LINK_DOWN: ifIndex 927, ifAdminStatus up(1), ifOperStatus down(2),... [14:52:03] 10netops, 10DC-Ops, 10SRE, 10ops-codfw: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) I've opened support ticket 2022508 with cyrunsone to have them use remote hands to powercycle this. > The switch is a Juniper QFX5100-48S-6Q, labeled asw-a2-codfw, located in U26 (re... [14:57:34] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Majavah) [15:29:44] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) Remote hands has completed the powercycle of the switch (via removing all power cables). Both before and after power removal, all LEDs are illuminated, which is no... [15:39:02] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) a:03Papaul I'll attempt to summarize the IRC discussion. @ayounsi, @cmooney, and myself discussed how it is likely safer to let a single switch sit broken over t... [15:41:49] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10RobH) [16:11:03] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [16:15:35] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney)