[07:04:21] morning [07:05:02] morning! [07:40:20] quick review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/962355/? [07:43:12] done :) [07:43:12] [07:43:46] thx [08:00:53] morning [08:05:02] arturo: I'm trying to install the OS on cloudcontrol1006 but it's not booting into d-i. any clues where to start looking? [08:05:20] I'm logged into the management console and it just shows a "boot manager" screen [08:21:46] derp. it seems like I forgot to run the sre.network.configure-switch-interfaces cookbook [08:22:21] yeah that did it [08:26:33] :-) [08:27:13] remember to also allocate the . private. address [08:27:25] I did remember to do that! [08:30:34] if that step is not in the docs please add it for next time! [08:32:44] fwiw without that cookbook the switch port is down, so the server is going to fail the DHCP part of PXEboot [08:34:13] the virtual serial output never shows much either - there is a round-about way to reach virtual vga port which gives more clue about what’s happening [08:48:12] fyi. I've started draining the ceph nodes for rack D5, it should not have an impact (doing 2 osd daemons at a time), but let me know if you see issues [08:49:03] dcaro: thanks a lot!! [08:49:20] Take your time with it and let's keep things stable [08:49:49] I'm wondering for when that is done what our plan is for the switch upgrade? [08:49:59] We have the following 'cloudvirts' in the rack: [08:50:03] https://www.irccloud.com/pastebin/cmWNMZOs/ [08:50:21] Should we try to migrate instances running on those to other hosts? [08:51:47] The other hosts in the rack I believe we should be ok for, they will fail over. [08:52:09] 1 instance each of cloudbackup, cloudcontrol, cloudgw, cloudnet and cloudservices [08:55:04] ideally we would drain them, at least any sensitive instances on them [08:59:26] that's 15 out of 35 hypervisors (3/7 of the fleet), not sure we can drain the whole of it [08:59:53] we might want to shuffle some of them around first, arturo what was the plan the first time around? just a downtime? [09:03:41] taavi: can you give a loot at https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/24 ? it's blocking a couple other things [09:03:45] mmm [09:03:53] dcaro: I don't remember [09:05:15] maybe we can declare that the hypervisors would survive the switch operation. After all, the data lives in ceph. Is not like they will get corrupted or something [09:05:20] +1 to relocate sensitive VMs though [09:05:26] (for example, any tools) [09:06:38] topranks: the switch reboot was only a few seconds right? [09:07:09] (best case scenario of course) [09:07:11] dcaro: is there a specific feature you need? I'm happy with the first three commits in that MR, but have doubts if the last one (as_json support) is the right approach [09:07:35] dcaro: no it's more like 20-30 minutes [09:07:39] taavi: the streaming api itself, is needed for the `toolforge build logs -f` feature [09:07:54] topranks: ok, okok, then it's not going to be a short network blip xd [09:08:22] nah it's a full upgrade, it flashes firmware and stuff on the reboot so it takes a few mins [09:08:42] ack [09:17:06] dcaro: ok, let me have a look [09:18:32] taavi: note my comment about avoiding the overloads and extra code and just having a single stream method [09:19:18] that seems like a smart approach to me [09:36:34] dcaro: still need to test this, but wdyt about something like https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/merge_requests/34? [09:38:45] taavi: looks simple :), nice [09:40:49] hmm, what about making the api simpler, and in the logs function use the get or stream based on the follow parameter? (instead of pushing that logic to the stream function) [09:41:33] that way, if you want to stream, you use stream, and if you do not, you use simple get (instead of having the option of using stream too) [09:58:01] dcaro: hmmm [09:58:53] I feel like this kind of optional streaming is a fairly common pattern, so IMO it makes sense to provide some helper for instead of making every caller implement the two options [10:00:08] it makes no sense to call a function called `stream` to not stream no? [10:01:01] `get_lines()`? [10:02:41] hmm, not sure, what makes me feel uneasy is get returning json, but this not [10:03:49] (and calling both `get`) [10:03:53] or get* [10:04:15] xd [10:05:02] get_raw_lines? [10:05:17] works for me too [10:05:49] I don't see a usage for the non-stream version there though [10:06:56] `toolforge webservice logs` only does streaming if you pass `--follow` [10:09:46] hmm, wouldn't it work anyhow? [10:10:37] (as in, that streaming is for the client side only, to be able to start reading the response before getting the whole body, but on the server side, doing the follow or not is a parameter to the api) [10:11:45] * dcaro lunch [10:11:52] brb [11:32:08] dhinus: what is the status of codfw1dev? [11:32:52] I get: [11:32:55] $ ssh bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org [11:32:55] ssh: Could not resolve hostname bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org: Name or service not known [11:43:08] I think my ISP may be caching the NXDOMAIN [11:48:43] arturo: I'm trying to understand it myself :) most hosts have been reimaged, but I bet some things in openstack are not working correctly [11:49:13] dhinus: this feels like a DNS mishap somewhere, it may not be related to openstack [11:49:17] I already checked some of the basic bits [11:49:19] designate is up [11:49:27] the pdns DB contains data, et [11:49:29] etc* [11:52:48] dcaro: updated, is that what you had in mind? [11:55:05] mmm [11:55:25] what are these records? [11:55:27] https://usercontent.irccloud-cdn.com/file/pTc18veO/image.png [11:57:54] they go to k3s.wikifunctions.eqiad1.wikimedia.cloud. [11:58:37] they 547 similar records [11:58:46] they have* [11:58:58] anyway [12:02:04] dhinus: I checked the delegation, it should be fine [12:02:17] https://www.irccloud.com/pastebin/TFtFe2Bj/ [12:03:00] bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org resolves to 185.15.57.2 for me [12:04:34] it doesn't get resolver on cloudcontrols either [12:04:50] arturo: those 547 records might be a Catalyst thing, kindrobot mentioned they used k3s [12:06:23] I think that's how DUCT exposes its deployments [12:06:50] ack [12:08:52] ok I think I know where the problem is [12:08:55] MariaDB [pdns]> select * from domains where name like 'bastion*'; [12:08:55] Empty set (0.001 sec) [12:09:05] the pdns data wasn't correctly restored after the reimage cc dhinus [12:09:30] and since both cloudservices2004-dev/2005-dev have been reimaged, I don't think we have a pdns DB now :-) we will need to restore from a backup [12:09:43] or force openstack to recreate them somehow (unknown procedure for me) [12:10:49] ouch [12:11:20] maybe a.ndrew will have ideas when he comes online [12:11:30] the mariadb wildcard is `%`, not `*` [12:12:00] good catch, is the result looking better with %? [12:12:22] yes [12:12:26] nice [12:12:57] https://phabricator.wikimedia.org/P52801 [12:13:46] ouch! [12:13:54] I had created a phab ticket already [12:14:05] T347856 --- marking as invalid [12:14:05] T347856: codfw1dev: we lost the PDNS database content - https://phabricator.wikimedia.org/T347856 [12:15:42] ok, what looks different then is the 'master' field [12:16:18] https://www.irccloud.com/pastebin/GkdkfToP/ [12:16:54] oh I pasted 2 times the same thing, to make it all less confusing :-( [12:17:04] in a nutshell: cloudservices2004-dev has 185.15.57.25:5354 185.15.57.26:5354 172.20.5.9:5354 172.20.5.8:5354 [12:17:20] cloudservices2005-dev has 172.20.5.6:5354 185.15.56.162:5354 [12:25:18] shouldn't data be synced automatically between 2004 and 2005? [12:26:40] in an ideal world, sure! but apparently the setup doesn't support it [12:31:45] can't get the hardware servers to resolve `bastioninfra-codfw1dev.codfw1dev.wmcloud.org` :-( [12:31:48] not sure what else to test [12:35:13] wait, it works no? [12:35:26] https://www.irccloud.com/pastebin/SMw8RlYY/ [12:35:33] a bit weird though [12:37:13] what does dig say? where is it resolving from? [12:39:26] dig is also giving confusing answers LOL, depending on which server you ask [12:52:10] * andrewbogott excited to spend another day on DNS [12:52:56] I can fix the master records there, not sure if that's the actual cause of whatever it is you're seeing though [12:53:34] btw the master records should NOT be the same between the two hosts, thanks to the fact that when a cloudservices node routes to itself it uses a different route than other hosts :( [12:55:42] T347861 [12:55:43] T347861: [codfw1dev] DNS fails to resolve some addresses - https://phabricator.wikimedia.org/T347861 [12:55:45] * arturo back later [12:59:20] andrewbogott: have you changed something? it looks better now (at least from my laptop) [12:59:32] I have not, although I'm about to [12:59:43] dig consistently resolves that hostname now [12:59:57] it used to give different answers based on the @server I was using [13:00:15] ok, there, I updated the master records [13:00:25] which I would not expect to affect dig [13:01:10] I wonder what fixed the issue (assuming it's really fixed) [13:02:04] seems like cache expiration, although I don't know why that would've changed [13:02:10] unless the ttl was very very long [13:07:07] dhinus: you're unstuck for now? [13:09:10] I think so, I'm trying to figure out what else needs to be upgraded (if any) [13:09:29] I also have a problem with the package wmfbackups, but that's not a hard blocker (see T347740) [13:09:30] T347740: wmfbackups packages for Debian Bookworm - https://phabricator.wikimedia.org/T347740 [13:16:01] I've just noticed that cloudlb hosts are missing the cloud_cumin key, but they should have it I think [13:20:41] seems like, unless we consider them network nodes and not control plane [13:23:12] arturo: yup, those are DUCT records. We haven't been cleaning them up (sorry). Though we could if it's causing too much noise/pollution [13:23:44] kindrobot: please do clean up unused records [13:25:22] andrewbogott: I think it's an easy fix (they're missing profile::base::cloud_production), I'll create a patch [13:25:39] OK. I'll clean them up manually and put something on the backlog to clean them up automatically [13:57:09] taavi: there's some alerts about keystone keys sync that started this morning, I deemed them to be from the cloudcontrol moving back to service, are you working on them? (I'll investigate if they should not be there) [13:57:28] dcaro: yep those are me, I'll deal with them [13:57:42] okok, no rush [14:00:39] dhinus: what did you do to fix the DNS? [14:02:32] arturo: nothing at all :D [14:02:44] andrewbogott was suggesting maybe some cache expired? [14:03:14] just a guess. The fullstack test still isn't working properly so something is amiss [15:36:03] taavi: I'm getting "Error: Unable to delete proxy: aw-890514.wmcloud.org" when I try to delete that (or other) proxies [16:00:53] * arturo offline [16:13:10] kindrobot: using the horizon UI? [16:13:13] (openstack) [16:19:54] I filed T347883, will have a look tomorrow [16:19:54] T347883: Web proxy deletion fails - https://phabricator.wikimedia.org/T347883 [17:04:30] * dcaro off [17:16:46] dhinus: dns looks good to me in codfw1dev now. I just restarted some things. [17:20:17] thanks, I'm trying to run tf-infra-test (have you tried already?) [17:20:34] and the network tests as well [17:26:50] they mostly seem fine, now I'm waiting for the Magnum test to complete [17:27:06] the network tests found an error for which a.rturo already created a task: T347880 [17:27:07] T347880: codfw1dev: git tree out of sync - https://phabricator.wikimedia.org/T347880 [17:27:27] I can fix that tomorrow but feel free to claim it today if you have time! [17:36:15] dcaro: yes [17:37:56] andrewbogott: all tests looking good except that one! [17:38:05] (and except the Postgres one that was broken already) [17:38:22] good news! [17:44:37] * dhinus off