[11:14:17] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10Majavah) [11:14:48] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10valerio.bozzolan) It seems to me that the problematic gateway is maybe `ae2.cr2-esams.wikimedia.org` so maybe #ops-esams is interested. [11:37:08] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Hi @valerio.bozzolan thank you for the report. For the affected users can you confirm the source IP they are coming from? I want t... [11:49:41] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Also @valerio.bozzolan you should feel free to email the IPs to noc@wikimedia.org if you wish to avoid putting them here wh... [11:51:14] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10valerio.bozzolan) [11:51:28] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10valerio.bozzolan) I've added all the details in a nice private Paste visible to you (P22947) and added it in the Task description. T... [11:56:06] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10valerio.bozzolan) [12:20:20] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Thanks for the info @valerio.bozzolan It seems the return traffic to that address was routing out of our network to Telia... [12:45:24] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Ok I've emailed Seabone/TI NOC now, hopefully they come back with something meaningful. There isn't a whole lot more we ca... [13:32:02] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) @valerio.bozzolan the affected users are direct Telecom Italia customers is that correct? It certainly wouldn't hurt if th... [14:27:11] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Hmm ok. I can see in the traceroute it now makes it a few hops further: ` cmooney@re0.cr2-eqiad> traceroute wait 1 no-reso... [14:29:38] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Hmm ok. I can see in the traceroute it now makes it a few hops further: ` cmooney@re0.cr2-eqiad> traceroute wait 1 no-reso... [15:01:00] hi team! I have a quick question about upgrading Wikidough hosts (Ganeti VMs) from buster to bullseye: I am wondering on the best path forward before I undertake this endeavour and to make sure I am not missing anything [15:01:42] is there a cookbook I can make use of? should I manually just update the hosts? should I deprecate the existing hosts and spin up new bullseye instances? [15:02:00] there is no state on the Wikidough hosts (other than the relation with homer but that is just updating the IP addresses) [15:02:40] (I have already built the packages for bullseye so that part is already done) [15:02:43] thanks! [15:07:12] sukhe: hey, so the reimage cookbook is still only compatible with physical hosts (ENOTIME but should be done at some point) [15:07:48] usually, to prevent any unwanted behaviour and ensure that the puppetization works fine creating new VMs and deleting the old ones is the preferred path AFAIK (but I'll let others correct me on this) [15:07:55] thanks volans! [15:08:10] for that you can use the decommission cookbook (that works on all hosts) [15:08:31] and the makevm cookbook to create the new ganeti VM, this is if you don't need to keep the same IP or other weird state [15:08:50] in-place upgrade is also an option, when the above might not be feasible. [15:09:12] yeah, I think there are a couple of options but not exactly sure which is the right (or wrong one) [15:11:01] none of what you said is wrong :) [15:11:50] new VM makes sure that puppetization still works from a clean host and that it doesn't just work because it was upgraded in-place and had some resource already created with the current OS [15:12:33] yeah that definitely feels more cleaner [15:13:32] I was secretly hoping there is a cookbook that would work for this that I wasn't aware of, not gonna lie :P [15:13:37] at the cost of a bit more work, sorry (missing automation there) [15:14:02] yeah we have 24 hosts (12 Wikidough + 12 durum) that need to be updated [15:14:03] yeah I should tackle the makevm cookbook at some point to also do the installation [15:14:10] sukhe: what's the timeline? [15:14:33] it might help to prioritize that work maybe... [15:15:22] 24 it's a lot of VMs [15:15:26] hah [15:15:58] it's 10% of all VMs [15:16:05] there is no timeline as such and even the Wikidough project itself has not been decided but I wanted to get to it before it is approved and before the ocmmunity release [15:16:30] I also wanted CAP_BPF, which is only available in kernels 5.8+ and I guess that's my main intention [15:16:40] (for eBPF filtering on the doh* hosts) [15:16:57] so yeah, not urgent in any way but just something that needs to be done at some point [15:19:10] ack, 10% of all VMs seems a good enough incentive for me tbh, if I would tell you that say in a month or so from now there will be a cookbook... would you wait for that/ [15:19:14] ? [15:21:21] you are too kind :P [15:21:44] I think we can wait yeah but I also don't want to put any pressure in any form on you or the rest of the team [15:25:24] also, slightly related, would uou create the new ones and then removing the old ones? [15:26:25] I was thinking I will depool doh1001 and then recreate it (traffic would still go to doh1002 or some other place) [15:26:31] haven't thought it out completely [15:35:40] ack [15:38:19] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) The code to do db switchover is https://github.com/wikimedia... [15:42:11] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) [15:45:22] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) That's the main thing and what {T196366} also needs. The di... [15:45:45] sukhe: if lmata agrees, I'll try to priotize that work a bit more and maybe we can get a working version in time for your upgrade. No hard promises though. [15:47:05] i'll catch up and report back in a bit sukhe [15:47:20] :-) [16:23:17] volans: thanks, I think this is most certainly not urgent but might be helpful for the other cases as well. we should do it at some stage though I don't think it requires immediate attention [16:23:21] lmata: :) [16:28:23] to be clear, the intention of my question was definitely not to steer you into doing it but more like what I should be doing to do the upgrade :P [16:30:27] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) Arzhel and I discussed this a bit, and we're going add a few more countries manually for now before proceeding with the esams-resiliency... [16:31:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) [16:40:02] :D [17:33:50] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10valerio.bozzolan) Maybe totally unrelated, but maybe yes: https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thr... [17:55:49] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10RhinosF1) That wasn't sent until way after your issues started nor were fixed. [20:29:01] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Jclark-ctr) cloudstore1010 B7 U41 port12 cableid #5014 cloudstore1011 C4 U1 port23. cableid #20220273 [20:29:23] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Jclark-ctr) [20:29:55] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson