[09:39:58] klausman: OK to merge on puppet? :) [09:40:03] yes [09:40:07] letsgo! [09:53:48] marostegui Amir1: as a heads up, the cronjob for ipoid's daily updates should start running at 11:08 UTC, which means you'll see some increased DB activity on m5 shortly after. The previous import was on Friday and afaik didn't cause issues. This time there will be UPDATE and DELETE queries involved. [09:54:09] Thanks kostajh I just asked: https://phabricator.wikimedia.org/T305114#9329638 [10:23:38] Hi Amir1! I was circling back to codesearch, and couldn't find data-engineering/airflow-dags in the repo list. Have you redeployed it since the CR was merged? Thanks [11:39:44] brouberol: it should be deployed automatically, let me check [11:58:12] Thanks [14:08:28] anything I can try if a reimage gets stuck waiting for reboot? the com2 console is completely blank [14:09:52] racadm serveraction powercycle (or just serveraction powercycle in some newer cases IIRC) to force a reboot [14:10:20] thanks [14:10:31] it could be shutting down [14:10:36] depending on the host [14:10:49] have you checked if you can still ssh? [14:11:14] ssh does not work [14:11:17] the host is cloudvirt1046 [14:12:10] the reimage cookbook did the first reboot and started the debian installer, apparently [14:12:59] then it's been "waiting for reboot for about 1 hour [14:13:23] actually 30 minutes [14:20:41] if the console is blank,it can be worth hitting a couple of keys to see if that wakes it up [14:21:14] I tried but it remains blank :) [14:21:26] I've just tried "serveraction powercycle", still blank [14:25:37] maybe similar to T351171 [14:25:48] T351171: cloudvirt1043 + cloudvirt1044 reimage failures - https://phabricator.wikimedia.org/T351171 [14:26:31] but you should see that [14:26:48] if you want I can try via redfish, not that it would change that much [14:27:28] have you checked the hw logs in the idrac to see if the iniial reboot action was logged? [14:30:21] why is bast3007 missing from https://config-master.wikimedia.org/known_hosts? [14:31:33] nothing relevant-looking in SAL [14:33:01] volans: checking hw logs [14:36:19] getraclog shows "System CPU resetting" at 13:15 UTC (cookbook), then again at 14:20 UTC (my powercycle request) [14:38:03] taavi: i see it in there [14:38:10] the reimage cookbook output is here https://phabricator.wikimedia.org/P53419 [14:38:39] hmm, it's back now [14:43:07] sorry there were some missing lines in the Paste, now I fixed it [14:43:34] the first reboot attempt by the cookbook is successful: "Found reboot ... Host up (Debian installer)" [14:44:09] then it goes again into "wait_reboot_since" and hangs... but the raclog shows only 1 reboot [14:47:34] hello on-callers, I am rolling out some changes to change prop and change prop job queues, everything should lead to a no-op but if you see anything weird ping me [14:47:50] (for exaple, excessive backlog in some job queue etc..) [14:49:04] dhinus: the second reboot is performed by debian installer after the install completes [14:49:10] the cookbook just polls for it [14:49:44] elukey: thanks for the heads up (cc fabfur) [14:49:59] cc brett cwhite ^^ [14:52:09] thanks elukey [14:55:25] dhinus volans could be that we're hitting the PXE "bug"? [14:55:58] no because it did reboot into d-i [14:56:57] at the first try / [14:56:59] ? [14:57:04] very lucky indeed.. [14:57:21] fabfur: :P [15:00:41] that's how it usually works :-P [15:03:04] sukhe: I was thinking of deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/973782/ - I'll disable puppet on P:bird::anycast and then slowly start re-enabling and running on WMCS hosts first and then moving to the rest, ok? [15:03:18] taavi: thanks, that sounds good [15:03:24] -s30 -b1 just to be really sure [15:03:34] at least on dns* [15:04:04] sure [15:06:11] thanks [15:09:33] ran puppet on cloudlb2*-dev, no bgp or bfd flaps as expected. now doing the rest of the cloud* hosts [15:09:46] cool! [15:16:22] ok, live on cloud* hosts with no issues afacis. I'll do durum* next, ok? [15:17:06] dhinus: which OS? first time installing that OS on those hosts? might be firmware [15:17:49] yeah thans [15:17:56] taavi: yeah thanks, keeping an eye here [15:20:06] <_joe_> sooo... serviceops will need to convert mediawiki appservers to kubernetes nodes progressively as we move traffic to mw on k8s [15:20:07] volans: reimaging from bullseye to bookworm [15:20:23] <_joe_> but the procedure to rename a host while reimaging is... way too complex [15:20:30] first time running bookworm on that host so yeah could be firmware [15:20:32] <_joe_> https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging [15:20:47] <_joe_> so I don't think it's a viable amount of work for us [15:20:53] <_joe_> we'll just keep the old names [15:21:25] <_joe_> volans: is there any chance that procedure can be simplified? it's definitely not good for mass renaming [15:21:37] <_joe_> especially if you keep in mind we'll need to do this in small batches [15:22:12] it surely can, but ofc we can't simplify the physical re-labelling of hosts :D [15:22:23] <_joe_> that can happen async tbh [15:22:27] <_joe_> I don't really care [15:22:36] <_joe_> but yeah, I think we're keeping the current names [15:22:39] sure, it's still the same amount of work I' saying for dcops [15:22:49] <_joe_> after all it's just a label and the amount of work isn't justifiable [15:23:02] <_joe_> I have to admit it's disappointing, renaming a host used to be simpler. [15:23:14] keeping the same names seems a bit dangerous to me [15:23:19] <_joe_> why? [15:23:34] people running things on kubernetes* or mw* thinking to do the right thing [15:23:45] instead of using the proper alias [15:23:46] <_joe_> well "people" will learn NOT to do that [15:23:59] <_joe_> it's in general not a good idea except in maybe one case [15:24:08] <_joe_> when you're changing something in mediawiki::common [15:24:26] <_joe_> I don't really see that as a substantial risk [15:24:36] <_joe_> I'm a bit worried about people getting confused OTOH [15:24:45] <_joe_> but tbh not worth the effort [15:24:54] <_joe_> it's more than 500 servers [15:24:57] the hosts will not change location right? [15:25:08] agreed, operating on hostname globbing will also cause a lot of issues elsewhere, I don't think it's a real concern [15:25:10] <_joe_> if we have to spend half a day to reimage one [15:25:20] <_joe_> it's not viable [15:25:27] <_joe_> I can't justify the effort. [15:25:31] <_joe_> volans: nope, none [15:25:34] the clear thing is that if we do rename them we need to simplify the procedure, period [15:25:36] just stick with existing mw names, they will grow out over time [15:26:41] <_joe_> yep [15:30:54] _joe_: that said this *is* the occasion to get traction on that and streamline the rename as part of a normal reimage, having the cookbook do the missing bits [15:31:10] if we don't do it now, we'll probably never do [15:31:12] it [15:33:12] I'm pretty sure there'll be future opportunities... [15:33:13] <_joe_> oncall: we're moving traffic from mobileapps to mw on k8s [15:33:22] <_joe_> there will be a big surge in requests, it might page [15:33:29] <_joe_> let me know if it happens [15:33:31] <_joe_> :) [15:33:36] ok tnx [15:52:44] <_joe_> basically any alert on mw-api-int, ping me [15:52:54] taavi: going fine? [15:53:55] sukhe: yeah, almost done with doh*, that leaves just dns* remaining [15:54:06] cool [16:24:03] sukhe: patch is live everywhere [16:24:33] taavi: thanks for deploying it! [16:26:08] renaming> I'm going to want to rename a bunch of moss-* nodes (but they're not in current-prod), is it going to be Real Pain? [we've decided that the service should not be called moss] [16:35:48] Emperor: what does it mean current-prod? are they already racked and labelled? [16:39:17] I mean that the nodes I want to rename aren't in service; they are racked and labelled, however [16:40:40] ok, so to clarify, the above linked documented procedure is the general case one, that takes account of all possible cases (rename + renumbering + relocation) [16:42:25] the simplified case of rename in place without renumbering and without relocation can be achieved with a subset of actions and without the decommissioning, but as the rule was that we usually don't rename, no standardization/automation effort has been made in that direction [16:42:47] if suddenly we need to rename a lot of hosts things change and we should prioritize that work IMHO [16:44:18] FWIW, there are about 10 moss nodes that need renaming; they don't need relocating or renumbering [16:45:12] Emperor: timeline? [16:47:14] they should be renamed for a KR for this quarter [16:47:41] (but that can run into the no-prod-changes window from my POV as they're not in production yet) [16:49:27] Emperor: if you're willing to be a beta tester I can probably draft a simplified procedure and then fine-tune it with the 10 hosts, that would also simplify the steps to automate the missing parts [16:50:59] my current idea for the simplified one would be to power off the host, do the changes (netbox, dns, hiera, switch, idrac) and then start the actual reimage from the host powered off, so it will reboot directly into PXE [16:52:02] so that can potentially fit into the reimage cookbook later [16:52:31] XioNoX: ^^^ if you have an immediate feedback by any chance :) [16:53:52] volans: will beta in exchange for less toil, certainly :) [16:56:43] great! I'll try to ping you next week, hopefully I can squeeze this in [22:14:31] Is there anyone around that knows off the top of their heads, how you move between the consoles in d-i when attached to the serial console via a drac? [22:16:19] Oh, it looks like it uses screen keybindings [22:19:33] Y, I think the Red Hat installer also uses screen or tmux