[06:26:14] 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Cookbook to depool a site in AuthDNS - https://phabricator.wikimedia.org/T334048 (10ayounsi) [06:41:22] 10netbox, 10Infrastructure-Foundations: Improve Netbox "locations" use - https://phabricator.wikimedia.org/T333948 (10ayounsi) Another way to turn the question is: 'what "locations" should match to?' and apply it consistently. To me, 1:1 match to rows is not relevant anymore. Indeed your suggestion is good fo... [07:10:30] 10netops, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) p:05Triage→03Medium [07:13:04] 10netops, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 8 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) [07:13:54] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 8 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) [07:21:19] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 8 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) @ayounsi to confirm, codfw will be depooled before this maintenance right? @akosiaris @Joe ? [07:30:47] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10akosiaris) Yes, we 'll have to depool codfw. [07:31:05] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) >>! In T334049#8757732, @Marostegui wrote: > @ayounsi to confirm, codfw will be depooled before this maintenance right? @akosiaris @Joe ? That's my un... [07:31:33] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [07:43:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10serviceops, 10Patch-For-Review: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10ayounsi) 05Open→03Resolved a:03ayounsi This has been rolled to all k8s clusters. [08:00:05] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) 05Open→03Stalled Marking it as stalled until the cookbook is reviewed/merged. [08:01:10] 10netops, 10Infrastructure-Foundations, 10SRE: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10ayounsi) a:03cmooney [08:02:05] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) [08:03:37] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Marostegui) @jcrespo kindly check backup servers needs. Thanks [08:19:13] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10jcrespo) [08:32:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) https://www.juniper.net/documentation/us/en/software/junos/system-mgmt-monitoring/topics/ref/statement/enhanced-hash-key-e... [09:46:23] 10netbox, 10Infrastructure-Foundations: Improve Netbox "locations" use - https://phabricator.wikimedia.org/T333948 (10jbond) >>! In T333948#8757652, @ayounsi wrote: > Another way to turn the question is: 'what "locations" should match to?' and apply it consistently. To me, 1:1 match to rows is not relevant any... [09:47:32] 10netbox, 10Infrastructure-Foundations: Improve Netbox "locations" use - https://phabricator.wikimedia.org/T333948 (10jbond) p:05Triage→03Medium [11:06:00] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) @wiki_willy that's because all those have `N/A` in the Accounting tab of the spreadsheet in the `Asset tag` column and so they don't match. [11:33:42] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10hnowlan) [11:45:28] jbond / moritzm I'm done testing the update partman cfg files. Nothing broke :-) [11:46:12] excellent :-) [12:07:06] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Core routers: replace bootp with dhcp-relay - https://phabricator.wikimedia.org/T320508 (10ayounsi) a:03ayounsi [12:09:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Allow managing drmrs DHCP settings with Homer - https://phabricator.wikimedia.org/T328737 (10ayounsi) a:03ayounsi Taking that task, even if the current CR does the job, it could be refactored with @cmooney work to remove the duplicated co... [12:11:28] 10netops, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 8 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10ayounsi) [12:28:00] 10netops, 10Infrastructure-Foundations, 10SRE: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845 (10ayounsi) 05Open→03Resolved a:03ayounsi This is completed in drmrs, the same will be applied to the other sites when we bring L3 on the ToR switches as I don't think... [12:40:21] slyngs: great thanks [12:55:13] topranks: XioNoX: chatting with volans about where to put the refreshed cumin host and he suggested that rack that has the rack with the switch that has the uplink to the core router. which racks are they? [12:55:33] so VMs are not an option? :) [12:55:51] also not sure I understand the question [12:56:16] my own view is that we want our main managment host to have a shared fait (but will leave the autorative answer foir volans) [12:57:12] basically I was thinking as a wishlist that if I had to pick a random place I'd put them in the racks that are more "reliable" and IIRC in our existing row-model we have 2 of them (X2 and X7?) that are physically more reliable, or am I wrong? [12:57:15] jbond: not 100% sure what the goal is [12:57:26] on the VM option side... I'm not sure [12:57:42] CPU wise is not a problem, RAM/disk wise depends on how data-persistence uses them [12:57:56] volans: the CRs connect directly to all 4 rows [12:58:05] XioNoX: do all the aws have an uplink to the cr's or is it just one (will leave to volans for motivation) if the later then wed porefer to be in the same rack [12:58:22] redundancy wise, is true that we do have one per/dc and so we need 2 ganeti cluster to fail [12:58:25] sure it’s a longer cable run, but I’m not sure any is “more reliable” than another [12:58:42] to make them both unavailable [12:59:01] we should not have preferential placement for servers [12:59:11] unless they are like heavy on bandwidth [12:59:34] ack thanks im convinced :) [13:00:18] volans: we could even have cumin VMs in ulsfo for example [13:00:25] if we want 3 sites [13:04:41] sure sure [13:05:09] do we want to make a quick poll here for the team? could cumin be slow down in VMs? [13:06:03] what do you mean with "be slow down"? [13:06:48] the cumin's concurrency, nework wise, the fact that opens multiple ssh connections and such [13:06:52] I guess not, but just to be sure [13:08:17] well, the primary reason we have cumin on baremetal is to have a server to drive orchestration even if there's a major Ganeti outage, is this about adding additional Cumin VMs in the edges? [13:08:33] well i think im convinced it can probably be a vm, we can at least try it and see (netops 2, jbond 0 :P) [13:09:24] moritzm: the point was mad earlier that we would need a failur in both sites for that to cause an issue and we could mitagate that risk by adding cumin in PoP's [13:09:34] moritzm: but having 2 in 2 different DC means that we need 2 ganeti outages to not have the [13:09:47] to be weighted againest the probability of ganeti failing on multiple sites, and cumin being he only/much faster way of solving the issue [13:09:50] and we can alwyas reimage sretest1001 as cumin host temporarily I guess [13:12:54] for me part I've do doubt cumin would work as a vm [13:13:13] I do see some sense in having "as few layers" that need to work to be able to use it in a fix [13:13:55] topranks: was that first one "i do doubt / i've no doubt" [13:14:08] "I've NO doubt" [13:14:09] sorry [13:14:16] thats what i thought thanks [13:14:17] yeah, I'm not convinced this is a good idea either. We're also treating cumin hosts as rather powerful hosts for DBs as well and not having virtualisation around it also adds an additional layer against a possible cross-VM compromise from a different VM [13:15:51] do we have other VMs that are "powerful hosts for DBs"? orchestrator? [13:16:08] not with the current build out of orchestrator [13:19:29] but if I'm the only one, feel free to proceed regardless. But I reserve the right to say "I told you so" when there's an issue with it :-) [13:19:44] * jbond hehehe [13:21:10] I'm on the fence tbh [13:23:17] there's also the issue that some cookbooks might actually exceed what we typically run on VMs, e.g. Search people have cookbooks to shuffle data between WDQS hosts [13:23:21] I'm happy there is an actual conversation, so I'll be happy with any conclusion [13:24:29] FWiW, most of the WDQS stuff is just polling, nothing intensive on the orchestration side [13:30:22] the intensive data transfers are usually P2P and the cumin host is just the orchestrator [13:30:45] the new swift cookbook will be the one more I/O intensive that needs to copy to the cumin host few ~100 MB files [14:04:23] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs) As an update, this is now blocked on {T297596}. The previous implementation discussion led to a finalization of guidelines, w... [14:04:42] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs) [14:16:43] 10netbox, 10Infrastructure-Foundations: Netbox: import from PuppetDB script creates VIP also if exists - https://phabricator.wikimedia.org/T278936 (10Volans) p:05High→03Medium a:05Volans→03None De-assigning and de-prioritizing as this didn't happened again. [14:45:05] topranks: hi! morning (afternoon) [14:45:11] maybe late for you so please ignore [14:45:17] but I was wondering when you plan on merging https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/822439/ [14:46:01] sukhe: hey [14:46:06] *strictly* not urgent and doesn't block us on anything, but just curious [14:46:10] still only getting warmed up :) [14:46:11] since it seems like you have already done the hard work [14:46:19] topranks: did you move to the US? :P [14:46:25] haha no [14:46:58] just need to get some reviews on it really, will merge once I get them [14:47:20] (people will think I put you up to this I suspect!) [14:47:45] np! [14:47:46] hahaa [14:47:49] you can tell them that :P [14:48:20] but more seriously, this will be immensely helpful for the upcoming LVS reboots [14:48:26] er, reimages [14:48:28] and hence the question [14:48:31] I tried to review it a few times, but that file has grew out of hands [14:48:39] I need to fully focus [14:48:50] <3 [14:48:56] that's also what triggered https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/905570 [14:49:07] again, not urgent and we are not doing the more complicated hosts (with multiple interfaces) for now [14:49:18] XioNoX: interesting [14:50:16] XioNoX: of course, the file is complex (not necessarily because of that patch) and there is a good bit in the CR [14:50:23] so no stress. [14:50:37] also lol my poor code is making us improve our CI :P [14:50:41] yeah, not because of that patch [14:51:07] just organic growth [14:53:04] The patch to add bandit and prospector looks good, idea makes sense [14:53:09] some of the LVSes are simple enough [14:53:14] but then there is stuff like https://netbox.wikimedia.org/dcim/devices/3654/interfaces/ [14:53:27] in which case, a human is probably better out of the picture doing it manually :) [14:53:34] we have enough to go through first so we will do that [14:53:35] In terms of LVS I think it's the things we can do on the back of having that merged, like report on switch inconsistencies, that will help with them [14:54:15] sukhe: yeah that one is fun to see :) [14:54:32] and partly why I guess a new L4LB that doesn't need to be on every vlan is a good idea longer term. [14:54:48] yep! [14:54:54] https://phabricator.wikimedia.org/T332027 soon! [14:54:55] I think my patch - or more to the point representing the interface relations in netbox - is a big part of the puzzle there [14:57:06] we could also improve the "server provisioning" script (https://w.wiki/6YBH) dc-ops run to allow multiple switch ports be selected [14:57:21] (one tagged primary), allow selection of vlans on others [14:57:46] or even just an "lvs" drop down that know to add the required vlans on each port [14:58:17] yeah would be nice to reduce the human element of error here [14:58:18] The last bit then is we could move to drive the full network config from Netbox if that info was getting populated there in advance [15:06:00] 10netops, 10DBA, 10Data-Engineering, 10Discovery-Search, and 9 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10colewhite) [16:34:42] I don't think it's worth spending time to implement multi-links servers provisioning when their actual number is going down (WMCS T319184, LVS->L4LB) [16:34:43] T319184: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 [16:36:12] but +1 on diving the server's config from Netbox, with a question on provisioning a Netbox server if netbox is down :) [16:43:53] ok im off for the easter break now, enjoy all and see you after [16:44:05] enjoy the break [16:44:26] enjoy! [16:46:21] thanks you too :) [19:41:08] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10MarcoAurelio) Not sure if there's anything actionable here left to do. Lo... [20:30:35] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10Observability-Alerting, and 2 others: validate what we need from the check_eth check - https://phabricator.wikimedia.org/T333007 (10cmooney) Above CR has a potential replacement to alert in the cases mentioned, but not require a list of net device names t... [21:26:50] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Hi @volans - thanks for the details on S/N #7S5LMH3, 7S5MMH3, 7S5NMH3, 7S5PMH3, and 5BF90C3. The first four were deleted in error, which @RobH just fixed...and...