[00:56:01] * bd808 off [11:01:16] XioNoX: topranks: can you review https://gerrit.wikimedia.org/r/c/operations/homer/public/+/974501/? this is blocking my homer run to update/fix the cloudcontrol1006 definition [11:02:28] taavi: https://netbox.wikimedia.org/extras/changelog/146774/ [11:02:43] that's why it's now generated by Netbox [11:02:59] +1 [11:03:05] thanks [12:47:15] taavi: hey, just catching up on a few bits [12:47:28] the labs-filter patch looks fine to add (https://gerrit.wikimedia.org/r/c/operations/homer/public/+/973769) [12:49:56] for my curiosity I'm trying to understand this traffic flow, the idea is clients in the cloud-private vlan will make connections to the cloudlb, which will load-balance to clouddb "realservers" from its 10.x addressing in the prod. realm? [12:50:49] My instincts are that the clouddb's, on normal WMF 10.x space, are best sitting behind the normal WMF LVS load-balancers, but I may be missing the full picture [13:03:26] topranks: so the current access flow for the wiki replicas is cloud vps vms --(neutron vip)--> clouddb-wikireplicas-proxy-*.clouddb-services.eqiad1.wikimedia.cloud --(haproxy backend connection, via LVS)--> dbproxy1018/9.eqiad.wmnet --(haproxy backend connection)--> clouddb*.eqiad.wmnet [13:04:06] and we were thinking of making that cloud vps vms --(bgp-announced vip)--> cloudlb --(haproxy backend connection)--> clouddb*.eqiad.wmnet [13:06:13] it's a service offered from prod hardware to exclusively cloud vps vms, so I think it fits the cloudlb model better than using prod LVS. the vlan which clouddb hosts will be in the end is still open, and we should talk to data persistence about that at some point [13:07:31] the reason why we want to do these load balancing changes now is that it allows automating some really annoying maintenance workflows a bit more easily than doing it on the already live dbproxies [13:20:06] taavi: ok. I can see where your going with it [13:20:36] I think it probably makes sense to move clouddb into the cloud racks longer-term, and have the cloudlb -> clouddb comms go over the cloud-private vlan [13:21:15] and let the clouddb boxes be the ones doing the "cross-realm" traffic, i.e. answering queries from clients (via cloudlb) on the 172.x on one side, and communicating with whatever they need to in WMF over the 10.x on the other [13:23:40] No objection for now with this approach, but I think in general the cloudlb should not be load-balancing to things outside the cloud realm, so we should probably try to re-work it when we can [13:43:08] topranks: yep, I agree this should not be the end state. same thing applies to cloudelastics [13:43:13] thanks [13:43:27] I spent way too much time today before finally figuring out that the reason my ruby test app's migrations weren't working is because --mount defaults to none for buildservice based jobs nowadays. I have updated the docs on wikitech to hopefully make it more obvious [13:44:56] blancadesal: can we please have the example use toolsdb instead of telling people to use sqlite on nfs? [13:48:43] XioNoX: homer shows me a diff for https://gerrit.wikimedia.org/r/c/operations/homer/public/+/974472 on the cr-codfws, ok to deploy that? [13:48:54] taavi: yep [13:49:01] thanks [13:49:01] I'm rolling it but it's taking ages [13:52:48] taavi: a job run against a buildservice image could be non-db related too, e.g. process and save some files to nfs. I think it's still relevant to be clear about needing to mount the tools dir in those cases. About not using sqlite in sample apps/tutorials – that makes sense if the goal is to discourage nfs use [13:53:41] what needs to happen for toolforge tools to have object storage? [13:56:20] taavi: sorry everyone pressuring you, could you +1 this if you get a moment? [13:56:21] https://gerrit.wikimedia.org/r/c/operations/dns/+/974534 [13:56:58] blancadesal: yeah, there will still be cases where tools need nfs access, but sqlite has never been one of those given the horrible performance it has :-P I think somewhere in our docs it directly says "do not put SQLite or any other databases on NFS" [13:57:07] topranks: looking [13:58:04] and related question, with the change we discussed with cloudlb talking to clouddb direct does that remove the need for the "dbproxy" hosts? [13:58:41] I notice the _cloud-support1-c-eqiad_ vlan still exists, but only has LVS and dbproxy1018 on it [13:58:41] https://netbox.wikimedia.org/ipam/prefixes/110/ip-addresses/ [14:00:59] the task number in that patch seems odd, otherwise seems ok [14:01:36] taavi: heh perhaps that answers my second question then :P [14:01:52] topranks: and yes, moving wiki replicas to cloudlb would mean that those two specific dbproxies would no longer be needed in that setup [14:01:54] is there an appropriate phab task for removal of cloud-support1-a-eqiad? [14:02:51] i mean, it'd be an appropriate task for removing the other cloud-support1-c-eqiad, but not this one [14:02:58] let me see if I can find somehting [14:25:35] taavi: I ended up merging that patch as is cos we needed an emergency depool in esams and it was blocking [14:33:38] taavi: is there still an issue with cloudcontrol200*-dev i notice that puuppet is still disabled on cloudcontrol200[45]-dev [14:34:03] andrewbogott: you may also be able to answer (as i just saw your online) [14:34:08] jbond: still super broken at least when I went to bed last night [14:34:52] do you know if its ok to enable puppet and migrate to puppet7 or would you prefer i leave it? [14:35:03] topranks: Possibly of interest, here's me running the decom script on a bunch of hosts on the .private subnet: T351010 <- probably the first time that's happened, you might want to double-check the cleanup [14:35:05] T351010: decommission cloudvirt1025-cloudvirt1030.eqiad.wmnet - https://phabricator.wikimedia.org/T351010 [14:35:33] jbond: It's as broken as it's going to get, go ahead and enable and migrate as long as you aren't expecting clean puppet runs [14:35:46] andrewbogott: ack thanks [14:37:50] andrewbogott: thanks yes good call [14:38:04] not at all impossible it won't remove the IP on the cloud-private, I'll double check [14:38:47] andrewbogott: I see all the cloudvirts have been reimaged except the "local" ones. I have a few meetings today so it's not the best moment, but maybe we could reimage those tomorrow if you're around? [14:39:27] dhinus: sure. [14:39:40] I don't remember, do the wdqs ones also need reimaging? [14:41:13] hmm they didn't come up in my cumin search for openstack packages [14:41:14] but I'm not sure [14:41:40] * andrewbogott checks [14:41:41] looks like they're already on bookworm [14:42:17] I think taavi reimaged those in T346948 [14:42:18] T346948: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 [14:42:30] yep, they're bookworm already [14:42:38] so only three left (plus one broken one) [14:43:21] yep [14:45:45] I've updated the description at T345811 [14:45:46] T345811: [openstack] Upgrade eqiad hosts to bookworm - https://phabricator.wikimedia.org/T345811 [15:38:24] jbond: will you let me know when you're done with cloudcontrol200x-dev? [15:40:12] andrewbogott: im done [15:53:31] Great! Now I can get back to failing to fix galera [16:38:57] bd808: re: soup dumplings, social media keeps trying to sell me these frozen soup-dumpling-by-mail products, you ever tried any of those? (I assume they involve boxes of dry ice which might be reason enough to order) [16:40:05] andrewbogott: no, I haven't tried such things. Once very, very long ago I bought some Omaha Steaks and that cured me of mail order food. :) [16:41:20] It's possible that one of my local groceries has them anyway, I should probably start there. [16:42:51] I just save my cravings up for trips to SF. That used to be a pretty often event, but not so much in recent years. I did get salt & pepper crab on my SF trip in September, but no soup dumplings. :( [16:43:39] The R&G Lounge takes reservations now which makes that much easier than standing in line for 3 hours on a Friday evening. :) [16:45:03] oh nice! [16:45:09] * blancadesal is googling soup dumplings [16:46:01] is that the chinese ones? [16:48:40] blancadesal: yes, https://en.wikipedia.org/wiki/Xiaolongbao [16:49:59] hm... reading that article I guess Xiaolongbao means different things in different places, I'm thinking of the Shanghai kind I guess [16:52:09] for some reason different types of Chinese dumplings have become extremely popular in Milan in the past ~10 years, but I'm not sure I ever tried the Shanghai kind [16:52:16] definitely keen to test the SF ones :) [16:53:17] tbh they get called 'shanghai dumplings' in English but I've mostly had them at din tai fung which is a Taiwanese chain. So don't believe anything I'm saying about regional specifics. [16:53:43] And the place Bryan and I have gone the most is called Xian and Xian is pretty dang far from Shanghai so ??? [16:54:17] https://maps.app.goo.gl/35d6qoNkJoJ2YRev8 [16:54:53] getting hungry now... these are great too: https://en.wikipedia.org/wiki/Jiaozi [16:55:46] * andrewbogott ate dumplings in actual Xian many years ago but not soup dumplings, the gimmick was that they brought them one at a time and every dumpling was a different shape and/or color [16:58:05] Oh, I have pictures of the Xian dumplings! https://bogott.net/unspecified/?p=1315 Wow, phone cameras really were not so good back then [17:00:01] heh. My 22 year ago digital camera shots also suck. It was the consumer grade sensors generally, not just phones. [17:03:58] ohh they definitely look very nice even at low resolution :) [17:14:20] taavi: when you wrote https://gerrit.wikimedia.org/r/c/operations/puppet/+/974175 did you spend any time "how did this ever work?" [17:14:56] I'm 90% convinced that that patch is unrelated to what I'm seeing now but I'm a bit worried that galera is secretly still using 3306 for something [17:15:21] andrewbogott: I just assumed I'd broken something when moving things behind cloudlb. it's not entirely clear to me if that rule is even needed in the first place [17:16:25] when did you move things behind cloudlb? Is that https://gerrit.wikimedia.org/r/c/operations/puppet/+/971241 ? [17:16:51] I guess that sort of fits with when things stopped working [17:18:18] no, https://gerrit.wikimedia.org/r/c/operations/puppet/+/971211 [17:18:55] hm, pretty sure things were still working after the 2nd [17:19:12] the patch you linked was a no-op [17:19:31] yeah [18:29:35] jbond: is it fine if I try migrating a role or two in codfw1dev to puppet 7? [18:37:00] taavi: sure thing, i have allready migrated db, control and backups [18:37:04] see th tracking task [18:37:12] T349619 [18:37:13] T349619: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 [18:37:36] taavi: what might me more usefull is to do some canary host sfrom the other roles [18:37:47] im happy to do the rest of the codfw1dev ones [18:38:13] once all of them are donw i.e. canaries and codfw1dev then id ask wmcs to test and then we can migrate the remander [18:40:45] jbond: sure, I can start from cloudgw for example [18:41:08] taavi: sgtm thanks [19:06:40] taavi: andrewbogott: fyi i have migrated all th codfw1dev roles now if you could give them a quick check at some point mak sure evrythng is all good and then i can move on to the eqiad ones [19:18:35] * bd808 lunch [20:33:42] jbond: do resource paths work differently in puppet 7? I see for example [20:33:43] Error: /Stage[main]/Openstack::Cinder::Service::Antelope/File[/etc/cinder/resource_filters.json]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/openstack/antelope/cinder/resource_filters.json [20:34:14] But modules/openstack/files/antelope/cinder/resource_filters.json exists [20:35:12] andrewbogott: was it a one off? if so its probably related to T350809 [20:35:13] T350809: Sporadic puppet failures - https://phabricator.wikimedia.org/T350809 [20:35:26] jhathaway shold hav a fix today/tomorrow [20:36:04] * andrewbogott tries again [20:37:54] yeah, seems intermittent so I'll ignore for now [20:38:46] yes its probably T350809 then. tl;dr we get some errors if puppet agnt is running during puppet-merge