[02:37:27] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10Papaul) @cmooney I was about to update the table but I can't only you can. So for everything going from A1 to Bx and A8 to Bx should be 12m (x=1,2,3,4,5,6,7,8). I will g... [07:49:17] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi) > @ayounsi interested to hear your thoughts, personally my instinct is to stick with the Spine1->CR1 and Spine2->CR2 setup, keeping things the same as Eqiad. A... [07:50:34] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10ayounsi) [08:38:01] hi folks, qq - for https://gerrit.wikimedia.org/r/c/operations/puppet/+/893008 I need to restart pybal right? [08:38:15] * vgutierrez looking [08:38:24] elukey: indeed [08:38:42] lovely I completely forgot about it [08:39:12] is it something that we can do now or is it better next week? [08:39:21] I don't really have any timeline, it is just a clean up [08:39:57] elukey: I think we should do it, 3 days with a different config loaded than on disk is already bad enough :) [08:40:06] "bad" [08:40:50] elukey: I can take care of that if needed [08:41:56] vgutierrez: nono lemme do it, I'll do some prep work in the next mins and write down a plan [08:42:03] ack [08:42:04] if you +1 it I'll do it, does it sound ok? [08:42:06] perfect [08:46:39] vgutierrez: ok so it should be simple - restart on lvs2010 + check logs + restart on 2009, same thing for lvs1020 and 1019 [08:46:45] indeed [08:46:56] maybe in the opposite order? [08:47:03] eqiad -> codfw [08:47:08] yes right switchover time :) [08:47:09] considering we in that time of the year [08:47:14] good point! [08:47:16] *we are [08:47:17] I'll start with eqiad [08:50:49] eqiad done :) [08:51:22] and I see the new check on the nodes [08:56:35] and codfw done [12:37:40] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10cmooney) a:03cmooney [12:43:30] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) @ayounsi thanks for updating the desc! @papaul I'll update the table with the info provided and get back to you if any more questions. I'll also put together... [12:49:10] 10netops, 10Infrastructure-Foundations, 10SRE: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [14:15:08] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 10 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [14:17:44] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10jbond) [14:18:06] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10ayounsi) [14:18:32] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [14:19:06] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10MoritzMuehlenhoff) [14:22:12] lvs1013 is out of production with https://phabricator.wikimedia.org/T301142 but still active in Netbox https://netbox.wikimedia.org/dcim/devices/121/ I'm wondering if it's missing a decom task? [14:22:30] saw that while looking at https://phabricator.wikimedia.org/T329073 [14:26:36] I'm not sure if the status is wrong or questionable, but in general these are decoms that are beind held for future testing (for L4LB project) [14:26:58] they're current up and running in role(insetup_noferm) [14:27:58] so if you're just looking for how to handle downtime/depool: lvs101[3456] can be ignored [14:28:17] yeah, I was worried a bit to see 2 LVS, in the list [14:28:27] but as long as it's known, etc [14:31:00] is there a benefit from leaving it in that state vs. running the decom cookbook then re-provision it? Faster turnaround? I'd expect the host to need to be re-numbered like lvs-test, or even not have lvs in the name anymore [14:45:00] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Infrastructure-Foundations, and 8 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10lbowmaker) [14:45:33] 10Traffic, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10lbowmaker) [14:55:24] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10elukey) [14:58:02] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10fgiunchedi) [15:10:35] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10herron) [15:17:43] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10BTullis) [15:22:20] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10hnowlan) [15:36:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) thanks for the update! Please let me know if there is something I can do to help with this (... [16:14:59] 10Traffic, 10SRE: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) [17:25:50] 10Traffic, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10colewhite) [18:06:48] 10Traffic, 10SRE, 10Patch-For-Review: per-backend-service concurrency limits in ATS-BE - https://phabricator.wikimedia.org/T306223 (10BCornwall) 05Open→03Stalled [19:56:50] 10Traffic, 10SRE: purged rdkafka crashes: assert: rkq->rkq_refcnt > 0 - https://phabricator.wikimedia.org/T293605 (10BCornwall) 05Open→03Resolved a:03BCornwall I'm not seeing evidence that this is an issue any more. Please re-open if this re-occurs! [20:11:09] 10Traffic: Prometheus Varnish Exporter fails to start on some instances in DRMRS with Out of Memory Error - https://phabricator.wikimedia.org/T302206 (10BCornwall) 05Open→03Invalid I went through historical metrics and found no evidence that this is still occurring. Considering the cp nodes have been reimage... [20:14:08] 10Traffic, 10DNS, 10SRE: Need Assistance adding DNS records to claim domain - https://phabricator.wikimedia.org/T300076 (10BCornwall) 05Stalled→03Resolved a:03BCornwall Looks like this was forgotten to be resolved.