[05:15:15] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10Aklapper) a:05RobH→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee o... [05:17:10] 10Traffic, 10DNS, 10SRE: GSuite Test Domain Verification - https://phabricator.wikimedia.org/T223921 (10Aklapper) a:05mark→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and `T270544`). Please a... [05:22:13] 10Traffic, 10SRE: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742 (10Aklapper) a:05MoritzMuehlenhoff→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and... [05:22:36] 10Traffic, 10SRE, 10Goal, 10Performance-Team (Radar), 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Aklapper) a:05Vgutierrez→03None Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee... [09:34:13] 10Traffic, 10SRE, 10User-MoritzMuehlenhoff: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742 (10MoritzMuehlenhoff) [09:51:32] 10Traffic, 10SRE, 10Goal, 10Performance-Team (Radar), 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Ladsgroup) This is done, isn't it? The performance issues are being mitigated by migrating to nginx light I think (someone needs to double check) [09:54:51] 10Traffic, 10SRE, 10Goal, 10Performance-Team (Radar), 10Wikimedia-Incident: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Vgutierrez) TLSv1.3 is up & running, performance issues are being mitigated by replacing ats-tls with envoy or haproxy in the short term :) [10:11:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [10:39:53] vgutierrez: what is the status of the puppetization/test deployment of envoy? [10:40:42] could you update betterworks for end-of-Q4 status? [10:40:53] will do [10:42:05] i think the idea was to do the same for haproxy at some point, is that right? [10:42:20] thanks :) [10:44:30] * question_mark is looking at Q1 planning somewhat [10:44:30] yes, you're right [11:39:20] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [11:40:12] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [11:41:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [11:42:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [11:42:53] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [11:43:04] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:09:42] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:13:44] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [12:40:29] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) Thanks for the ping - this needs some thought from the DB side. We have some of our misc db masters on row A - db1159 m1 A6. Affected services: bacula (... [12:43:06] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) @Bstorm @nskaggs please see above - we might need to depool the affected clouddb* hosts. [12:44:05] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) dbproxy1013 is the active proxy for m2. I will depool it next week. [12:47:54] 10netops, 10Infrastructure-Foundations, 10SRE, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) [13:00:35] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) Hi @Marostegui thanks for the feedback. > Will this stop traffic on all switches at the same time? Or do you plan to d... [13:05:11] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10fgiunchedi) With my observability and swift maintainer hats on, I think we're ok to tolerate a network blip, specifically: * ms-be... [13:08:41] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) [13:09:01] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) >>! In T286032#7193704, @cmooney wrote: > Hi @Marostegui thanks for the feedback. > >> Will this stop traffic on al... [13:14:34] 10netops, 10Infrastructure-Foundations, 10SRE, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10ayounsi) The proper fix is T263277. However there are 2 options to get data quickly and temporarily: The easiest and "cleanest"... [13:15:26] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) @cmooney do you know when you'll know how long this change can take? [13:19:45] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10BBlack) Traffic-related bits: * dns1001 will need a manual depool so that it doesn't have knock-on effects on all of the other clus... [13:20:40] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10ayounsi) @Marostegui "as the standby host is on row A too" that sounds like SPOF to me and should be moved to a different row. Due... [13:21:36] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) >>! In T286032#7193787, @ayounsi wrote: > @Marostegui "as the standby host is on row A too" that sounds like SPOF to me... [13:27:57] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) @Marostegui Our only real option to test is on new switches due to be installed under T277340. We are working with DC-Ops... [13:29:25] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) Ok, I think what we can do from our side is to get the replacement hosts ready but without failing over things to them,... [13:29:55] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10jcrespo) Speaking on behalf of: ` dbprov1001 ms-backup1001 db1116 ` That could cause ongoing backup runs to fail, but that is "norm... [13:49:22] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [14:55:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10fgiunchedi) Agreed option 1 seems easier and safer than option 2, the sampling isn't great but not the end of the world if we're... [15:01:05] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10fgiunchedi) [15:15:56] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [15:16:58] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [15:40:13] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [15:41:35] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [16:02:43] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [16:10:58] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Bstorm) @Andrew Just a heads up that cloudcontrol1003 is in the list. It might be fine and will catch up, but it also could crash r... [16:13:05] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm) [16:13:47] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Bstorm) @Ottomata One of the cloudbs is clouddb1021. FYI. I understand you likely won't be using it that late in the month, but I w... [16:14:19] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Legoktm) lists1001 is a SPOF currently, we'll probably just announce a downtime when we get closer to the actual time [16:28:04] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [16:29:41] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [16:35:46] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Ladsgroup) I can draft an announcement for downtime of lists.wikimedia.org, maybe we can use the time to increase its capacity (mor... [16:36:22] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [16:44:14] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) [16:54:25] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10nskaggs) Impacted clouddb's will be clouddb1013, clouddb1014, clouddb1021. I believe interrupting traffic on 2 of 4 of the "web" r... [16:55:24] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [16:56:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [16:59:24] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) @dcaro and @Andrew on the ceph and cloudvirts, I have concerns. We've seen that a lack of network to enough OSDs for a while will cause problems, and the cluster can... [17:17:08] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) I'll make a meeting for our team to discuss. There is a ticket for row B as well :) [18:02:26] 10netops, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ayounsi) [18:02:50] 10Traffic, 10DNS, 10SRE: GSuite Test Domain Verification - https://phabricator.wikimedia.org/T223921 (10HMarcus) 05Open→03Resolved a:03HMarcus Thanks, this can be closed. [18:49:01] 10Traffic, 10SRE, 10serviceops, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Legoktm) It would also be nice if the cookbook could check all services, and then fail if at least one didn't verify... [21:32:59] 10Traffic, 10SRE, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: open firewall ports on gitlab1001.wikimedia.org (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10brennen)