[08:45:49] I'm tracking down uses of bullseye-backports (which will soon be archived/removed since backports is only supported for the lifetime of regular Debian releases). the last one to investigate/fix is dynamicproxy::api, which is cloud only [08:46:17] IIRC there's a cloud internal cumin instance where I can check if the class is still used with bullseye, where can I find this? [08:53:08] moritzm: there's cloudcumin1001/2001 which can access all VMs but there's no cloud-wide PuppetDB to easily query that [08:53:45] but that class is indeed still used on bullseye, I've been meaning to combine the bookworm upgrade with T379175 but the IPv6 project is still not ready :/ [08:53:46] T379175: Enable IPv6 for the Cloud VPS web proxy - https://phabricator.wikimedia.org/T379175 [08:56:18] that's good to know, thanks! then I'll simply import python3-flask-sqlalchemy to the "main" component of bullseye-wikimedia and will prep a patch to drop the apt::package_from_bpo in dynamicproxy:.api [13:48:40] I added (well, added back) several networking panels to labtesthorizon. In theory nothing dangerous is exposed unless you have the 'admin' role, but I'd appreciate it if you (or someone) would doublecheck before I make the change in eqiad1. [13:49:04] maybe arturo? The new panels are 'network topology' 'networks' and 'routers' [13:49:51] the fact that for me (as admin) it provides the option to just delete the entire cloud-flat-codfw1dev network makes me nervous, even though of course that same option is available via the cli [13:51:36] I will check next week! (today PTO) [13:52:04] we indeed need to pay attention to this settings now, if we want this migration to move forward [13:52:20] thanks [15:39:17] hey cloud folks... hopefully an easy question [15:39:39] what is the purpose / difference between the "wikimedia.cloud" and "wikimediacloud.org" domains? [15:40:04] reason being I am adding IPv6 addresses to the cloudsw to enable it for the cloud realm networks in eqiad [15:40:29] the dns for this doesn't really matter for much, but right now it's a little inconsistent [15:40:41] I feel we should be using one or other for the network infra reverse entries [16:11:29] topranks: the theory is that .org is for infrastructure that is reachable publicly, .cloud is for internal infrastructure [16:11:33] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS [16:12:04] taavi: ok thanks that's a good rule of thumb [16:12:57] though might be a little complex where the equivalent IPv4 address on an interface is private space, and we are using v6 GUA on it for the v6 [16:13:17] still it's not that important I can review and keep it relatively consistent which is the important thing [16:13:18] thanks! [16:19:09] I think wikimediacloud.org makes more sense for the cloudsw infra, the wikimedia.cloud ranges are delegated to the Openstack authdns [18:42:27] topranks: you doing things? Many of our servers seem to have just now fallen off the internet [18:42:55] eh... yes... let me reverse what I did sorry [18:43:42] Oddly this might not be user-facing... [18:43:54] but prometheus (and I) can't talk to any virts or osd nodes at the moment [18:45:59] ok [18:46:02] it's all been reversed now [18:46:22] ok, yep, ssh is working again [18:46:23] fwiw internet reachability was not afffected from the cloudgw [18:46:27] https://www.irccloud.com/pastebin/fL8OgZYl/ [18:46:40] which was sort of my barometer for things being ok, clearly that's not the case [18:46:51] Yeah, so I guess it was only the private network that was wrong? So probably users didn't notice much if anything. [18:47:02] Certainly ceph's internal status (and openstack's) was totally happy [18:47:35] but alert manager, very unhappy! [18:47:41] seeing some recoveries now [18:47:48] yeah I'm not really sure, it'll be tricky to try and work out what happened [18:51:06] I think the impact was hopefully quite small, I was checking the "headline" things and it seemed ok [18:51:21] for reference what I did was enabled OSPF on a bunch of interfaces on the cloud switches [18:51:46] however BGP was still running.... and unchanged [18:52:05] so only change should have been some destinations started using a route learnt with the OSPF protocol instead of BGP [18:52:18] For your notes: https://phabricator.wikimedia.org/T389672 [18:52:23] ok thanks [18:52:25] I'm going to merge the other outage tickets into that one [18:52:52] ok thanks [18:53:11] I'll take stock and try to see if I can work out what happened next week [18:53:32] sounds good. As far as I can tell things are working fine for now. [18:53:48] And I agree that the impact was small. Just loud :) [18:54:12] yeah... it's gonna totally wreck my head [18:54:22] BW usage was steady across the network - i was keeping an eye on that [18:54:50] anyway it's instant karma for me deciding this was so risk-free I could just go ahead without making a big deal [18:54:53] lesson learned [18:55:54] It's kind of a nice demonstration of how little we use the private network [19:04:04] this terminology is confusing.... cos the network that seems to have had issues was the "wmf production realm" one [19:04:28] and the one that seems to have been ok, the cloud vrf - has a bunch of networks/vlans called "cloud-private" :P [19:04:57] but yeah the network we use for SSH, reimaging, access to apt.wikimedia etc was the one affected I think [19:05:17] also used for prometheus scraping etc [19:06:01] I also didn't lose SSH to the cloudgw, so it is a bit of head scratcher [19:06:04] sorry for the hassle! [19:22:37] no prob!