[05:22:20] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [06:18:13] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [06:18:28] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [06:26:52] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [06:38:59] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [08:22:06] 10netops, 10Infrastructure-Foundations, 10ops-eqsin: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10ayounsi) p:05Triage→03High [08:26:33] vgutierrez: there are active icinga alerts for ulsfo LVS BGP sessions, any ideas what's up? maybe it needs to be cleaned up from the recent LVS move? [08:28:37] 10netops, 10Infrastructure-Foundations, 10ops-eqsin: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10ayounsi) @RobH I couldn't find any open task so I opened this one, in the future please make sure a task is opened as soon as any issue happens. Can you also let us know the im... [08:39:35] XioNoX: probably. Lvs4005 got decommissioned yesterday [08:47:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqsin: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10ayounsi) JTAC case 2022-1115-586910 opened. [08:48:08] vgutierrez: is there a task? I can't find anything relevant with "Lvs4005" [08:54:31] https://phabricator.wikimedia.org/T317247 [08:55:06] dunno if sukhe created anything specific for decommissioning lvs4005 [09:07:07] thx, I commented [09:07:07] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ayounsi) @ssingh We have those 2 active alerts: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr3-ulsfo&service=BGP+status https:... [09:11:55] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqsin: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10Volans) I would also like to understand why eqsin was not immediately depooled when it happened. The patch was already ready on Gerrit. There was [[ https://grafana.wi... [10:00:12] 10Traffic, 10Data Pipelines, 10Data-Engineering-Planning, 10Foundational Technology Requests, and 2 others: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10elukey) >>! In T314981#8391260, @elukey wrote: > * Meeting between me Joseph Andrew Filipp... [13:24:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqsin: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10ayounsi) Note that we will need to move the mastership back to `asw-0604-eqsin` to keep everything standardized. For that, better depool the site. [13:51:53] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Consider lowering IPv6 TCP MSS - https://phabricator.wikimedia.org/T283058 (10ayounsi) 05Open→03Declined We haven't seen any issue related to this so closing the task. [14:53:28] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10Papaul) [15:16:55] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10Papaul) CP5032 firmware info . This server is ready for OS install. ` System BIOS Version = 1.7.5 Firmware Version = 6.00.30.00 ` Note it looks like PSU1 is not plugged `... [16:13:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqsin: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10RobH) >>! In T323094#8395125, @Volans wrote: > I would also like to understand why eqsin was not immediately depooled when it happened. > The patch was already ready o... [16:17:48] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqsin: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10RobH) Order of operations, all of this was at or just before 5AM UTC. * Jin racks new hosts, and plugs in cp5032 - the port was NOT setup and the server was not in ne... [16:41:22] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqsin, 10Wikimedia-Incident: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10jcrespo) [17:20:37] 10netops, 10Infrastructure-Foundations, 10SRE: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) [17:21:32] Traffic folks if you had a minute I'd appreciate if you could give me some advice [17:22:06] I'm working on bringing the Spine switches in row E/F live, which ultimately will mean moving the cables from our LVS's in eqiad from where they currently land to those devices [17:22:15] https://phabricator.wikimedia.org/T322937 [17:23:01] Relevant diagram of how it'll be cabled (matches current just moving device): [17:23:01] https://phab.wmfusercontent.org/file/data/daw74l5a4ac7lc2dkxkz/PHID-FILE-ktmrdudoqf77vz7dd7vk/LVS_Direct_Extension_NEW_RACKS.png [17:23:42] The difference between the Leaf where they currently land and those Spines is the Spines only have QSFP ports [17:24:01] Which means we need to use a 4x10G QSFP module if we want to land 10G link from the servers [17:24:03] No problems with that [17:24:40] It does, however, mean we could potentially say terminate the link from lvs1017 and lvs1018 on the same optic (as they can each to up to 4 x 10Gs) [17:24:51] This would save money and ports [17:25:10] But obviously if that module fails then both LVS's cannot reach devices in those 2 rows. [17:25:43] I expect that means we would be better landing each on a dedicated port, but it'd be interesting to get your thoughts [17:43:07] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS buster [17:46:10] topranks: "ideal" would be landing each LVS as independently as we can (e.g. different cables, optics, switches), obviously. But it's not necessarily always worth it. [17:46:29] but we can sort out the options based on the scenarios [17:47:03] yeah understand [17:47:08] so: we have 4x active LVSes in eqiad, right. 17, 18, 19 are each the primary choice for 3 different sets of traffic [17:47:18] 20 is the backup we fail over to for all 3 sets of traffic [17:47:33] (in the case of an lvs/pybal failure/depool/whatever) [17:48:01] yeah that's as I understood it from the last time. [17:48:16] So there is no "shared fate" between 17 and 18 then, for instance? [17:48:53] not really. I mean technically there could be in a very edge case where we're already deep in the weeds (e.g. if we had two different LVSes offlined for some reason, we might reconfigure one of the remaining to do more) [17:49:09] but that's way out there in the probabilities. we'd probably depool public traffic from the site before we got there [17:49:32] ok gotcha [17:49:50] But 19 and 20 probably should not land on the same optic [17:49:57] so, having the connection to Row for all of 17/18/19 share fate is kinda ok. If that optic/cable/whatever fails, even if it had only taken out one of those 3 sets of traffic, we'd still have a problem. [17:50:35] But obviously that puts a lot more pressure on 20 than if it only had to take over and deal with the traffic volume from just 1 [17:50:37] it just means more stuff gets depooled from that row all at once [17:50:50] yeah 20 is the tricky one [17:51:10] oh ofc, the others would still be working for other rows, so it's the back-ends that are depooled [17:51:15] but if the scenario is an optics failure rather than an lvs server failure, I'm not sure it matters much at that level either. [17:51:23] seems to me it'd be a bad idea to put 19/20 on the same optic [17:51:45] and thus, keeping things somewhat symetrical, we should probably use 2 modules on the other switch, terminating 17/18 separately too [17:51:47] basically if you did 17+18+19+20 on a single 4x10G to a given row, if the optic fails, we have to depool all lvs backends of all kinds, if they live in that row [17:52:32] we could do 17+18+19+20 to a single 4x10G, and then do the exact same again to another 4x10G on a different switch, configured as a LAG/bond [17:52:57] but I'm somewhat reluctant to go down that path as we don't do multi-chassis LAG elsewhere and it's relatively complex to set up [17:53:04] yeah, I don't think we have any active bonding config puppetized for the LVS case [17:53:15] I think we used to years ago, maybe, but my memory is fuzzy [17:53:37] yeah, so lets not go down that road now, but remember it is an option in future possibly [17:54:25] My gut says 19+20 should be separate. So that means using 3 modules. In which case we should probably just get 1 more and do them all separate [17:54:27] yeah or do it in pairs, I guess [17:55:00] 19 does get more traffic than 17/18, and 17/18's services are easier to depool, so that makes sense [17:55:24] it's relatively trivial to depool the public text/upload on 17/18, but much more impactful if we lose all the internal services routed on 19 (or 20, as backup to them all) [17:55:47] hmm ok [17:55:49] so you could do e.g. 17+19 to switchA and 18+20 switchB or something [17:56:17] either way, it's our hope this LVS architecture doesn't last too much longer into the future [17:56:31] (the replacement will do tunneling and won't have these complications) [17:56:36] thanks for the background info. sounds like it does make sense to do it that way (it's not right now, 19+20 both go to lsw1-f1, but we'll change it in the move) [17:56:42] in terms of traffic do we plan to aggreagate the secondary with the hightest bandwidth usage one? [17:56:52] to "equalize" a bit [17:57:08] yeah which is why it's not worth putting too much effort in now, just want to make sure we don't decrease the redundancy [17:57:11] I don't think any of them have enough traffic to warrant worrying about it [17:57:27] better to work on new LB than expend effort making the existing setup more robust [17:57:28] LVS only does the inbound packets, not responses [17:58:14] volans: the spine switches have a lot of bandwidth currently, so even if these 10G links are unbalanced it won't make much difference, they've got 200G each downstream to the top-of-racks [17:58:21] [checking the bandwidth now] [17:59:07] yeah 17/18 peak around ~400Mbps, 19 more like ~1.6Gbps [17:59:14] ack, negligible [17:59:20] if we failed them all over to 20 together, even that would peak ~2.4Gbps then [18:00:31] it's funnier if you use the "traffic-class" names though: [18:01:00] in eqiad, "high-traffic1" and "high-traffic2" are ~400Mbps each, while "low-traffic" is ~1.6Gbps [18:01:41] heh... I first seen those names last week, must dig into what is in each category [18:01:54] (the names probably made more sense way way way back, since the high-traffic classes are uncached public traffic, and the low-traffic side was the cache misses, basically. That all evolved as we gained more internal services in LVS that talk to each other) [18:02:54] basically high-traffic1 is mostly the text HTTPS traffic for the wikis, high-traffic2 is mostly the upload.wikimedia.org image requests, and low-traffic is everything internal (mediawiki and some hundreds of other "internal" services) [18:03:22] cool thanks for the explainer [18:03:57] there are some other little things in ht1/2 though, that got tacked on for various functional reasons, esp if public [18:04:39] cloudelastic, ldap, maps.wm.o are also on ht2 [18:05:05] and wikireplicas [18:05:13] and ncredir is also on ht1 [18:49:26] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS buster completed: - cp5032 (**WARN**) - Removed from Puppet a... [18:58:53] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2042.codfw.wmnet with OS bullseye [18:59:39] fyi: we are reimaging cp2042 with bullseye, our first since we finished upgrading the cp hosts packages to bullseye: T321309 [18:59:39] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [18:59:59] while we will keep an eye out (and it will be depooled), please let us know if you see something broken [19:00:02] thank you [19:51:14] 10Traffic, 10SRE, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2042.codfw.wmnet with OS bullseye executed with errors: - cp2042 (**FAIL**) - Downtimed on Ic... [19:53:28] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) >>! In T322048#8396337, @Papaul wrote: > CP5032 firmware info . This server is ready for OS install. > ` > System BIOS Version = 1.7.5 > Firmware Version = 6.00.30....