[05:15:13] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10Aklapper) a:05RobH→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee o... [05:22:44] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Patch-Needs-Improvement: Disavow emails from wikipedia.com - https://phabricator.wikimedia.org/T184230 (10Aklapper) a:05herron→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to... [10:11:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [11:39:18] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [11:40:10] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [11:41:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [11:42:08] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [11:42:51] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [11:43:02] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:09:40] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:13:42] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:40:25] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) Thanks for the ping - this needs some thought from the DB side. We have some of our misc db masters on row A - db1159 m1 A6. Affected services: bacula (... [12:43:04] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) @Bstorm @nskaggs please see above - we might need to depool the affected clouddb* hosts. [12:44:03] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) dbproxy1013 is the active proxy for m2. I will depool it next week. [12:47:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) [13:00:33] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) Hi @Marostegui thanks for the feedback. > Will this stop traffic on all switches at the same time? Or do you plan to d... [13:05:09] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10fgiunchedi) With my observability and swift maintainer hats on, I think we're ok to tolerate a network blip, specifically: * ms-be... [13:08:39] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) [13:08:59] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) >>! In T286032#7193704, @cmooney wrote: > Hi @Marostegui thanks for the feedback. > >> Will this stop traffic on al... [13:14:32] 10netops, 10Infrastructure-Foundations, 10SRE, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10ayounsi) The proper fix is T263277. However there are 2 options to get data quickly and temporarily: The easiest and "cleanest"... [13:15:24] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) @cmooney do you know when you'll know how long this change can take? [13:19:44] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10BBlack) Traffic-related bits: * dns1001 will need a manual depool so that it doesn't have knock-on effects on all of the other clus... [13:20:38] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10ayounsi) @Marostegui "as the standby host is on row A too" that sounds like SPOF to me and should be moved to a different row. Due... [13:21:33] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) >>! In T286032#7193787, @ayounsi wrote: > @Marostegui "as the standby host is on row A too" that sounds like SPOF to me... [13:27:55] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) @Marostegui Our only real option to test is on new switches due to be installed under T277340. We are working with DC-Ops... [13:29:23] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) Ok, I think what we can do from our side is to get the replacement hosts ready but without failing over things to them,... [13:29:53] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10jcrespo) Speaking on behalf of: ` dbprov1001 ms-backup1001 db1116 ` That could cause ongoing backup runs to fail, but that is "norm... [13:49:20] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [14:55:15] 10netops, 10Infrastructure-Foundations, 10SRE, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) Agreed option 1 seems easier and safer than option 2, the sampling isn't great but not the end of the world if we're... [15:01:03] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10fgiunchedi) [15:15:54] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [15:16:56] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [15:40:11] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [15:41:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [16:02:41] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [16:10:56] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Bstorm) @Andrew Just a heads up that cloudcontrol1003 is in the list. It might be fine and will catch up, but it also could crash r... [16:13:03] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm) [16:13:45] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Bstorm) @Ottomata One of the cloudbs is clouddb1021. FYI. I understand you likely won't be using it that late in the month, but I w... [16:14:17] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm) lists1001 is a SPOF currently, we'll probably just announce a downtime when we get closer to the actual time [16:28:02] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [16:29:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [16:35:43] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Ladsgroup) I can draft an announcement for downtime of lists.wikimedia.org, maybe we can use the time to increase its capacity (mor... [16:36:20] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [16:44:12] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) [16:54:23] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10nskaggs) Impacted clouddb's will be clouddb1013, clouddb1014, clouddb1021. I believe interrupting traffic on 2 of 4 of the "web" r... [16:55:22] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [16:56:27] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [16:59:22] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) @dcaro and @Andrew on the ceph and cloudvirts, I have concerns. We've seen that a lack of network to enough OSDs for a while will cause problems, and the cluster can... [17:17:06] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) I'll make a meeting for our team to discuss. There is a ticket for row B as well :) [17:23:05] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10faidon) >>! In T232343#7058654, @herron wrote: > **Lists:** Lists/mailman has an internet facing exim inst... [18:02:24] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ayounsi) [18:08:22] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343 (10Legoktm) >>! In T232343#7194626, @faidon wrote: > While some of them could be mitigated (e.g. separate exi...