[00:34:20] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [00:34:46] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [00:37:31] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye [00:44:48] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye executed with errors: - cp1112 (**FAIL**... [00:45:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye [00:50:16] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye executed with errors: - cp1112 (**FAIL**... [00:50:28] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye [01:03:02] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye [01:10:15] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye executed with errors: - cp1114 (**FAIL**... [01:10:32] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye [01:15:55] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye executed with errors: - cp1114 (**FAIL**... [01:16:05] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye [01:20:51] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye executed with errors: - cp1114 (**FAIL**... [01:21:46] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye [01:27:23] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye completed: - cp1112 (**PASS**) - Remov... [01:27:45] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [01:58:45] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye completed: - cp1114 (**PASS**) - Remov... [02:01:48] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [09:11:59] XioNoX: if you want we can do a bit of troubleshooting for the PXE issue [09:12:05] (just read your comment) [09:12:19] fabfur: cool, yeah, give me 5min [09:12:26] np [09:31:11] fabfur: alright [09:31:22] hey [09:31:41] fabfur: which device are we working on? [09:33:16] yesterday vola.ns tried to reimage cp1108, it worked after 2 cookbook relaunches [09:34:23] and with a tcpdump on the install server he didn't catch any request from the new server [09:34:30] until the very last time [09:34:31] fabfur: is there a server currently showing signs of issues? [09:34:45] or should we try cp1108 again? [09:34:53] the server we're currently reimaging are from cp1102-1115 [09:35:02] we can try to reimage any of those [09:35:17] usually they fail until the second or the third try [09:36:06] I would go with 1109 or 1110 [09:36:45] cool, let's do that [09:37:03] I can launch the cookbook when you're ready [09:37:22] give me 5 real minutes this time :) [09:37:32] no prob, I'll prepare all [09:37:57] fabfur: the server is currently working fine, right? [09:38:09] it's "just" a ""regular"" re-image? [09:38:45] yes, once reinstalled the servers works as expected [09:42:30] fabfur: when was it re-imaged for the last time? [09:43:07] cp1109 has an uptime of about 9 days [09:43:19] I can be more precise, just a sec [09:44:37] 31/10/23 22:19:48 [09:45:13] last reboot was the last re-image? [09:45:26] nic firmware is up to date too [09:45:36] yes, those hosts hasn't been rebooted since reimage [09:45:47] ok so quite recent [09:50:56] fabfur: alright you can go for it anytime now [09:51:17] ready [09:52:17] fabfur: are you on the server's console too? [09:52:28] opening just now on cumin2002 [09:53:10] on cumin? [09:54:04] yes [09:54:25] cookbook launched [09:54:39] not sure the console is reachable from the cumin hosts [09:54:40] server is rebooting [09:54:58] I usually use it from cumin1001 or cumin2002 [09:56:11] I can re-launch it in screen so we can follow along [09:57:22] I connected directly through the idrac [09:57:28] ok [09:58:33] so, usually the cookbook blocks at this point [09:58:47] yeah switch port is up [09:58:50] (we've waited until timeout) [09:58:52] but not receiving any traffic [09:59:10] switch is not even learning the MAC [09:59:21] now, relaunching the cookbook (if we're lucky just one more time) should go on [09:59:45] if it's ok for you I can interrupt and relaunch the cookbook [09:59:56] I'm wondering if it's not trying to PXE from the wrong port [10:00:26] the one I tried yesterday was having the correct one, also it it was the wrong one it would never boot [10:00:31] not boo every X attempt [10:00:34] *boot [10:01:03] or if it chooses randomly the port... [10:01:17] anyway I can stop the cookbook and relaunch [10:01:19] ignoring the config? :D [10:01:26] so from the switch side, the server doesn't want to talk [10:01:38] there is outbound traffic but nothing inbound [10:01:52] fabfur: yep go for it [10:02:00] ok stopped [10:02:26] and launched again now [10:04:20] rebooting... [10:04:40] if we're lucky this time should go on [10:05:01] if we're lucky it won't work so we can reproduce the issue :) [10:05:42] it happened in the past that we had to re-launch the cookbook even 3 or 4 times [10:05:49] but usually 2 is enough [10:06:17] same issue? [10:06:20] ok, we're stuck again at the same point [10:06:23] yeah [10:06:27] let's wait some sec [10:06:51] nope, definitely stuck [10:07:00] let me try to bounce the interface [10:07:39] I've also set a `tcpdump` but I can imagine nothing received from the host [10:08:27] do you want me to stop the cookbook? [10:09:21] fabfur: yeah go for it [10:09:31] done [10:09:37] third time? [10:10:14] yup [10:10:18] ack [10:10:35] started [10:14:30] mmmm [10:14:34] ... [10:14:36] stuck again [10:14:52] that's strange, usually at this point should proceed [10:15:33] is it possible to double check the PXE settings? [10:16:25] I think yes, here's from the cookbook logs [10:16:27] https://www.irccloud.com/pastebin/yFEVNyRg/ [10:16:44] I mean in the bios setting [10:16:54] but I'm not an expert there but I can have a look [10:17:09] oops sorry relaunched the cookbook another time :( [10:17:32] (E_NOTENOUGH_COFFEE) [10:18:20] it's ok, I'll try to catch it and go in the bios [10:18:40] you can just check it with redfish [10:19:54] is there doc? [10:20:12] or run the provision cookbook hat wil tell you it's all ok [10:20:37] ok now it's going on [10:20:45] did you do something XioNoX ? [10:20:47] yeah... [10:21:10] fabfur: no, just looked at it with more threatening eyes [10:21:11] 'LegacyBootProto': 'PXE', on NIC.Integrated.1-1-1 [10:21:14] :) [10:21:38] so yeah the issue is that it doesn't even try to boot the dhcp client [10:21:41] and "hangs" [10:21:43] PXE settings must be correct otherwise it won't never boot [10:21:45] 'LegacyBootProto': 'NONE', on all the others [10:22:01] but as I said it would not work at all if it was not correct [10:22:30] this time it took 3 or 4 cookbook run? [10:22:55] 4 [10:23:04] on the 4th it succeded [10:23:29] usually it takes 2 but we've observed the same behavior (with 3 or 4) also in the past [10:35:38] volans|off: how did you get that info? 'LegacyBootProto' [10:36:01] maybe we can check if the other cp servers have the same issue [10:36:13] and change it for the next one to see [10:36:38] maybe the idrac got upgraded to a too recent version [10:37:45] via redfish, from a spicerack shell, why? [10:38:48] why what? [10:39:54] what do you want to know :D [10:40:08] a magician never reveals his tricks :-P [10:43:10] afk for a moment [10:43:10] maybe we can check if the other cp servers have the same issue, and change it for the next one to see. maybe the idrac got upgraded to a too recent version [10:43:14] volans|off: ^ :) [10:43:26] which issue? [10:43:27] I'm lost [10:43:46] LegacyBootProto configured on the faulty server, but not on the working one maybe? [10:44:10] I didn't change anytthing, I was confirming it was correctly set [10:44:35] and AFAIUI sukhbir checked all of them [10:44:44] ok, that's what I didn't get [10:45:11] we can check them all again, but if they are misconfigured the should never boot via pxe [10:45:23] not randomly doing so [10:46:12] next is maybe compare the idrac/pxe versions [10:46:31] for the above one (cp1109) I checked that PXE was set only on one interface and is the first one on the external card [10:46:52] that usually is the one with the cable for those hosts :d [10:49:42] XioNoX: I can recall suk.he checked for the NIC firmware version and they are all the same [10:50:23] not the same as you asked but it's something [10:50:34] yeah I checked the NIC too [10:50:45] but haven't check the idrac [10:51:08] is something we can do with redfish? [10:51:26] yes [10:51:35] https://doc.wikimedia.org/spicerack/master/api/spicerack.redfish.html#spicerack.redfish.Redfish.bios_version [10:52:25] bios==idrac? [10:52:34] >>> r.bios_version [10:52:34] [10:52:34] >>> r.firmware_version [10:52:34] [10:57:04] the nic: 'FirmwarePackageVersion': '21.85.21.92' [10:57:38] anything else? :) [10:59:41] I have the pcap file from the install server but don't know if it can be useful at all [11:00:41] nah [11:00:41] when I tried there was no packet at all for the MAC [11:00:52] when it was not working [11:01:00] there was no packet at all sent from the server [11:01:17] so there is a bug on the NIC/PXE/iDRAC [11:02:55] in the task was mentioned that cp4052 was also affected, that one has: [11:03:14] , , 21.85.21.92 [11:03:26] idrac, bios, nic [11:03:29] yes [11:05:01] so there is absolutely no common factor between the two [11:05:05] hmmm... [11:07:30] nice [11:07:35] *nic [11:08:10] same nic, same firmware [11:09:03] yeah but that NIC version is used everywhere [11:10:48] there are 4 newer version for that firmware, were all discarded by dcops because of other issues? [11:11:03] they are all marked optional on dell [11:12:45] well, we have to reimage those servers because they needs to be rotated into production soon, it could happen to other servers too maybe [11:13:12] afk a bit [11:13:52] volans|off: dunno, afaik the first 22 release was tested extensively (and didn't work). [11:14:15] dunno for the others, maybe some servers came with them and didn't work so it had to be rolled back [11:15:02] last one is from Feb. 2023, so possible [11:15:28] we will have to adapt out PXE at some point too :) [11:35:04] XioNoX: thanks for helping debug this! as another data point, we observed this was not confined to only eqiad but to ulsfo as well [11:35:24] old(er) hardware than eqiad and where we had no issues PXE booting before. the host in question was cp4052 [11:39:59] I have to reimage cp1115 now, if you want to troubleshoot I think it will have the same issue [11:41:14] sukhe: yeah we discussed it above too [11:41:23] so it's clearly puzzling [11:42:00] not sure what else can be looked at or tinkered with so far to fix it [11:42:06] just to add my 2 cents: dc-ops first reimaged those hosts (with the `insetup` puppet role, but this shouldn't matter at all) and they didn't noticed anything about [11:42:37] yeah it so low level that I don't see how it could be related [11:44:03] I can't really understand which kind of bug shows just for the first 2-3 times [11:44:19] going to reimage cp1115 [11:44:41] to me it seems like an issue in the PXE software, somehow it does try to send the dhcp requests, but why?! [11:46:13] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [11:50:47] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*... [11:50:54] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [11:51:07] second try on cp1115... sigh... [11:56:32] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*... [11:56:45] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [11:58:44] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*... [11:59:21] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [12:04:05] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*... [12:04:20] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [12:08:47] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*... [12:08:55] fyi cp1115 is at the 5 failed reimage [12:09:20] could be useful for troubleshooting? [12:12:30] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [12:28:21] 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10LSobanski) @BCornwall For operations/software/varnish, looks like it should just be archived and not migrated? Let me know if that's the case. [12:59:37] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye completed: - cp1115 (**PASS**) - Remo... [14:23:58] 10Traffic, 10Data-Engineering, 10SRE: Add a rolled-up cache_status field to druid webrequest_sampled_128 - https://phabricator.wikimedia.org/T319344 (10lbowmaker)