[07:12:17] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [07:42:08] XioNoX, topranks what would be a good|easy way of fetching to which rack is connected every interface of the LVS boxes on eqiad and codfw? LLDP neighbor shows the row name but not the specific switch [07:42:41] vgutierrez: LLDP :) [07:43:08] hmm chassis.name and chassis.mac aren't specific enough [07:43:09] netbox API is also an option [07:43:34] let me check, the interface should be there [07:43:45] yeah, I think it should give you the port name on the switch side? The first digit of which represents the particular member switch? [07:43:55] oh, let me retrieve that then [07:44:07] If not I've a bunch of netbox scripts I've been using to prep for today's maintenance, I can run that report off netbox api for you easily. [07:44:25] right [07:44:28] PortID [07:45:17] I think it's also in PuppetDB :) [07:45:39] or we can pull the data from the switches :) [07:48:03] BTW, regarding to T286069. as b.black mentioned on https://phabricator.wikimedia.org/T279457#7038822, every lvs is going to lose connectivity with row D [07:48:03] T286069: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 [07:48:50] vgutierrez: what's the best course of action? [07:49:14] so that isn't a blocker for text or upload, but I don't know if every service living on the low-traffic LVS will be able to survive losing the servers living in row D [07:55:41] I guess that should be already handled by the service owners cause you already listed all the servers being affected (basically all living in row D) [07:56:14] and b.black mention is related to that we should expect icinga alerts on every LVS instance on eqiad as they are going to lose the link in one NIC each [07:57:29] maybe icinga won't have time to alert [07:57:41] we're talking about seconds of outage [07:59:22] so I can't see any other problem.. as T286069 already lists every server in row D [07:59:22] T286069: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 [07:59:45] hmm are there any multi-NIC/multi-row boxes besides lvs living in row D? [08:00:27] cause I see that you only listed lvs1016 there and technically every lvs is affected by that [08:01:11] multi-row no, only the LVS [08:01:27] cool, then we should be OK :) [08:01:28] yeah, it's because only lvs1016 is physically located in row D [08:01:38] (last famous words) [08:02:14] eh :) [08:02:29] BTW, https://wikitech.wikimedia.org/wiki/Service_restarts#DNS_recursors isn't accurate anymore right XioNoX? [08:02:42] I guess we need to stop bird in dns1002 instead of pdns_recursor [08:03:11] or some kind of systemd unit dependency that I'm missing is going to trigger a bird shutdown as soon as we stop pdns_recursor? [08:03:36] yep exactly [08:04:07] also BFD will remove the prefix in less than 1s [08:04:28] so depending on what's acceptable we could just do nothing [08:07:25] quoting you.. let's depool it cause it's fairly easy to do so :) [08:08:52] good :) [08:10:24] which channel is going to be used to coordinate all of this? [08:10:39] topranks: ^ [08:11:25] XioNox: what's your thoughts? [08:11:26] and when we should depool the servers? 17:00 CEST? before? [08:11:33] sre-foundations? [08:11:57] whenever you want. There is no need to be right up to the wire I would have thought. [08:12:32] topranks: no strong preference, -sre might be better for visibility, without having the -operations spam [08:13:00] so.. from the ticket I understand that we will lose network connectivity at 17:00 CEST, so I guess that earlier than that would be wise :) [08:13:04] cool that's the kind of steer I needed. [08:13:18] vgutierrez: sgtm [08:13:49] vgutierrez: yep exactly, sorry if I wasn't clear. Our hope is that at 17:00 we are in a position to roll with the change, so in advance of that yes. [08:14:06] it will probably not happen at 17:00 sharp, but it's good to aim for that so we don't hold everybody [08:14:13] understood [08:15:20] mmandere: ^^ it looks like we should handle our side of things around 14:30 UTC / 16:30 CEST / 17:30 EAT, that works for you? [08:18:19] mmandere: Absolutely that's great. [08:23:03] hmm am I missing messages? [08:24:26] vgutierrez: i don't see anything from mmandere either, if that's what you mean [08:24:37] yup, thanks majavah :) [08:24:41] sry had not had coffee :) [08:25:07] topranks: errr I guess that's not acceptable even in your TZ [08:25:30] vgutierrez: I was replying to your message to m.mandere, like an idiot. [08:25:34] certainly it is not :) [08:25:47] let me brew more ☕ [08:26:20] sigh... c.danis completely messed up with my brain.. damn emojis [08:54:44] 🤣 [08:57:56] 10Wikimedia-Apache-configuration, 10Wikidata, 10Wikimedia-Site-requests, 10wdwb-tech, and 3 others: wikidata.org/entity/Q12345 should do content negotiation immediately, instead of redirecting to wikidata.org/wiki/Special:EntityData/Q36661 first - https://phabricator.wikimedia.org/T119536 (10Addshore) [09:02:24] 10Wikimedia-Apache-configuration, 10Wikidata, 10Wikimedia-Site-requests, 10Patch-For-Review, and 3 others: wikidata.org/entity/Q12345 should do content negotiation immediately, instead of redirecting to wikidata.org/wiki/Special:EntityData/Q36661 first - https://phabricator.wikimedia.org/T119536 (10Addshore) [09:02:56] vgutierrez: That's fine with me [09:32:24] 10Traffic, 10SRE, 10Patch-For-Review: False positives on PyBal IPVS diff check - https://phabricator.wikimedia.org/T286913 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [09:53:00] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10aborrero) [09:54:35] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10aborrero) [11:08:33] 10Traffic, 10DC-Ops, 10SRE, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) eqiad: | Host | Row | Host iface | switch iface| | lvs1013|**A**|enp4s0f0|xe-7/0/34| | lvs1014|A|enp4s0f1|xe-4/0/18| | lvs1015|A|enp5s0f0|xe-2/... [12:34:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10ayounsi) Pushing the following (and similar on cr2) should do the trick. As it's only for a few days, and it would not be trivial... [12:38:50] volans: sre.hosts.downtime can parse a NodeSet definition like cp[1087-1090].eqiad.wmnet? [12:39:48] yes any cumin query [12:40:33] cool, thx [12:48:04] vgutierrez: how dare you asking these questions, with cumin and spicerack you can also brew coffee if needed [12:48:08] :D [12:48:39] but it would allow me to keep track of the pressure, temperature and brew time? [12:49:24] vgutierrez: for that you just need to nerd-snipe Riccardo [12:49:55] (some comments about how the coffee doesn't taste right will suffice) [13:15:10] We need to get a Prometheus exporter clearly [13:15:20] There is always RFC2325 but SNMP is so old hat [13:15:21] https://datatracker.ietf.org/doc/html/rfc2325 [13:49:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10Datacenter-Switchover: Record traffic flows in and out of eqiad during switchover - https://phabricator.wikimedia.org/T286038 (10cmooney) Looks good to me @ayounsi if you want to commit. I would totally agree btw, Netflow is probably handled in silicon, o... [13:59:58] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ArielGlenn) [14:03:13] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [14:33:17] 10Traffic, 10SRE, 10observability, 10Sustainability (Incident Followup): Per-country Frontend Traffic dashboards - https://phabricator.wikimedia.org/T286554 (10lmata) [14:37:29] 10Traffic, 10SRE, 10observability, 10Sustainability (Incident Followup): Per-country Frontend Traffic dashboards - https://phabricator.wikimedia.org/T286554 (10fgiunchedi) The idea LGTM overall, something to lookout for though is that geo country in metric labels (if that's the implementation) could potent... [14:40:58] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 4 host(s) and their services with reason: eqiad row D maintenance ` c... [14:43:19] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Vgutierrez) [14:48:42] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: eqiad row D maintenance ` d... [14:49:42] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Vgutierrez) [14:51:20] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 1:00:00 1 host(s) and their services with reason: eqiad row D maintenance ` l... [14:55:38] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [15:04:56] Folks just a reminder we're kicking off T286069 now shortly - reconfigure eqiad row D switch buffers. [15:04:57] T286069: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 [15:05:10] Will update in #wikimedia-sre, hopefully be uneventful be be advised. [15:05:12] we're ready for you :) [15:06:16] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [15:11:44] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [15:12:25] cool thanks :) [15:52:23] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) All works complete, no signs of any issues really, I had no ping loss on 16 pings towards 2 hosts connected off each member switch. Ver... [15:53:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [17:08:52] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Andrew) [17:53:39] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Bstorm)