[09:25:01] ah, good, rclone upstream took my (docs) patch [09:57:32] Nice! [10:00:30] Bother, ms-be1057 won't do remote ipmitool. [10:01:10] I've tried all the steps (short of a management card reset, since I think it might have been that that bricked ms-be1059) at https://wikitech.wikimedia.org/wiki/Management_Interfaces#Troubleshooting_Commands [10:01:29] so I'm going to leave it for now. Cursed eqiad swift hardware :( [10:02:13] [because if a card reset bricks it, that blocks the entire upgrade for 2 weeks] [10:09:17] oh for crying out loud, ms-be1058's IPMI won't talk either. [10:16:16] volans: are you around? I've reached the next Dell on the upgrade hitlist, so now would be a natural time to try your new disk-fettling cookbook [10:16:32] Emperor: hey, sure I'm around [10:17:01] have you checked ipmi works first? :D [10:17:11] as it happens, I have ;p [10:18:42] I'll try "sudo cookbook sre.swift.convert-ssds ms-be1060" [10:19:12] SGTM, if you want me to run it that's also ok [10:19:18] it's the first time, might have some stupid bug [10:19:22] sadness [10:19:41] btw, fwiw, ms-be1057.mgmt.eqiad.wmnet responds to Redfish [10:19:54] volans: it finds nothing to do - want to run it yourself, or shall I make a paste [10:20:03] let me run it [10:21:02] 1057> yeah, I can ssh into the mgmt just fine, and all the run-from-the-host-OS commands work OK, but 'sudo ipmitool -I lanplus -H "ms-be1057.mgmt.eqiad.wmnet" -U root -E chassis power status' fails 'Error: Unable to establish IPMI v2 / RMCP+ session' and so e.g. the reimage cookbook can't work [10:21:32] Emperor: ok, and why not restarting bmc? I gather you've already tried the other troubleshooting commands [10:21:36] (same failure for 1058) I'd try a reset of the management card, but I don't want another bricked system right now, so I think I have to leave them 'til I've upgraded everything else [10:21:49] for ms-be1060 [10:21:52] Skipping non virtual drive Disk.Bay.24:Enclosure.Internal.0-2:RAID.Integrated.1-1 [10:21:55] Skipping non virtual drive Disk.Bay.25:Enclosure.Internal.0-2:RAID.Integrated.1-1 [10:22:06] it looks like it was already converted... is that possible? [10:22:07] volans: ms-be1059 is a brick, and that co-incided shortly after a previous reset of the management card, so now I'm paranoid [10:22:16] got it [10:22:28] briks are useful to build houses :-P [10:23:10] volans: sda and sdb both appear as still-rotational ; [10:23:24] mmmh uptime is 453 days [10:23:28] ok let me dig [10:23:50] OK, thanks [10:24:31] [i've not looked at web-IPMI view] [10:28:48] ok it's a stupid bug [10:28:50] fixing it [10:28:53] give me 10 [10:28:57] Thank you :) [10:50:53] Emperor: so basically I was wrongly checking the name of Virtual on the Drive "fqdd" instead of it's linked volume. Patch incoming [10:52:53] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/803888 [10:55:50] +1 thanks - LMK when it's available on cumin2002 and I'll have another go :) [10:56:15] * volans waiting for jenkins :) [10:56:17] will do [11:00:27] FYI this will shutdown the host, so if you need to do any prior action now it's the right time :) [11:00:33] Emperor: all yours [11:01:16] Found 2 Physical disks to convert to non-RAID: ['Disk.Bay.24:Enclosure.Internal.0-2:RAID.Integrated.1-1', 'Disk.Bay.25:Enclosure.Internal.0-2:RAID.Integrated.1-1'] [11:01:20] Found 2 Virtual disks to delete: {'/redfish/v1/Systems/System.Embedded.1/Storage/RAID.Integrated.1-1/Volumes/Disk.Virtual.1:RAID.Integrated.1-1', '/redfish/v1/Systems/System.Embedded.1/Storage/RAID.Integrated.1-1/Volumes/Disk.Virtual.0:RAID.Integrated.1-1'} [11:01:34] that part should be fixed [11:03:56] here we go... [11:04:45] * volans tailing the logs [11:05:01] [pro tip] for hosts in eqiad is slightly better to run it from cumin1002 :-P [11:06:24] I think I switched to cumin2002 for upgrades when cumin100x got rebooted and haven't switched back :) [11:07:11] so, the first virtual failed... but dell doesn't tell us why [11:08:51] that is quite an unhelpful error message indeed :-/ [11:09:40] from dell docs online on PR21: [11:09:41] Detailed Description The specified job did not complete successfully. [11:09:44] Recommended Response Action Check the Lifecycle Log for more details associated with this failure then retry the operation. [11:09:48] thank you Dell [11:11:14] and this second one too seems to be bound to fail [11:12:10] I wonder if waiting for something or the other that failed... I *love* hardware [11:12:30] Mmm, I wonder if the waiting task is now just wedged indefinitely. [11:13:19] there is only one task right now [11:13:19] 'Members': [{'@odata.id': '/redfish/v1/TaskService/Tasks/JID_547076376240'}], [11:13:33] and the status is 'Message': 'Task successfully scheduled. [11:13:51] yeah, jobqueue shows me one successfully completed job, one failed job, and one schedule job [11:14:03] successful was the shudown? [11:14:21] no, a raid job. [11:14:48] volans: https://phabricator.wikimedia.org/P29523 [11:15:23] $DEITY knows how long ago that was, given the obviously-false start time [11:15:35] 'CompletionTime': '2020-11-19T17:21:25', [11:15:54] so unrelated [11:15:57] Mmm [11:18:58] Emperor: can I re-try the delete of the firt VD manually? [11:19:37] you certainly can - shall I ^C the cookbook? It'll give up in another couple of minutes [11:19:52] it will timeout in 1 minute [11:19:52] 28/30 [11:19:58] so let's wait and see [11:20:03] also the failure scenario :) [11:20:24] TBD if we want it to restart the server anyway also on failure (it's easy to add if needed) [11:20:44] Unable to perform configuration operations because a configuration job for the device already exists. [11:21:17] that's presumably the job still sat in the queue [11:21:42] yep, let me try to delete the job and see [11:21:48] 'k [11:25:01] if I find how to do it... [11:25:31] there's jobqueue delete from the ssh mgmt, if you like [11:25:43] if you can please delete the scheduled one [11:26:01] the DELETE on the task in redfish is not permitted and I'm not finding right away the proper incantation [11:26:21] racadm jobqueue delete [-i] [11:26:23] should do it [11:26:43] "RAC1032: JID_547076376240 job(s) was cancelled by the user." [11:27:00] so you should be GTG now [11:27:07] <3 [11:30:20] this is not promising [11:30:20] https://www.dell.com/community/PowerEdge-Hardware-General/Dell-R720-iDRAC-7-jobs-won-t-run/td-p/7686288 [11:30:36] so far is staying at 0% [11:31:35] * volans checking the Lifecycle Controller [11:35:51] Is this system an older one that the ones you tested on, then? [11:36:11] question or statement? [11:37:01] we could upgrade the idrac, yes, that's an option, but doesn't help with making the process for you smoother :/ [11:37:01] question (I was just wondering; ISTR two slightly different board IDs) [11:37:16] volans: Mmm. [11:38:28] no, it's the same board ID for 1060-1067, 2057-2065 [11:38:45] and idrac version? [11:41:12] Ah, no. 1060 is bios version 2.9.3, 2065 is bios version 2.11.2 [11:41:23] (from the puppet dmi fact) [11:41:31] ok [11:42:15] 2.11.2 bios_version is 1064-1067, 2062-2065 [11:42:17] FYI downtime will expire in ~1h20m [11:43:04] I dunno at what point I should do this one by hand (via ^R and the setup UI)? [11:43:41] wanna do this by hand and then we try 1064 with the cookbook? [11:44:30] Sure. [11:44:46] sorry :/ [11:44:54] No worries! hardware is terrible [11:45:06] can I delete your queued job? [11:45:16] yes, I was about to tell/do it mysel [11:49:52] RAID config updated, powercycling (to hopefully boot back into stretch) [11:53:02] booting [11:53:35] re-enabling puppet [11:54:06] fiuuuu at least it booted :D [11:54:14] lol [11:54:26] if not, it's due a reimage anyway [11:54:31] $ cat /sys/block/sda/queue/rotational [11:54:31] 0 [11:55:11] cool. Shall we try 1064? [11:57:11] volans: I've stopped swift on ms-be1064 so am ready to try the convert-ssds playbook if you're happy? [11:59:03] (if you want to pause, I can break for lunch and come back to this after my post-lunch meeting, which will be ~14:00 UTC) [11:59:08] I was about to get something for lunch, but sure, what could possibly go wrong :D [11:59:28] as you prefer [12:00:00] let's give it a go :) [12:00:46] huh, redfish error [12:00:54] * volans tailing [12:01:01] wrong password? [12:01:03] "Message": "The authentication credentials included with this request are missing or invalid.", [12:01:27] I thought I'd pasted the right one, evidently not. [12:01:33] this time looks more hopeful [12:01:37] :D [12:02:01] I wonder if the power off might have an effect on the job scheduling [12:02:40] oh, you might need to schedule the task before the power-down? [12:02:52] Alas, first job failed, so I expect the second is wedged. [12:03:08] not sure... running them with the OS running causes I/O issues, so I'd rather avoid those [12:03:42] Mm. [12:04:15] and then you reboot into bios manually basically is a similar thing [12:04:36] Mmm. I'm inclined to delete the probably-wedged second job (given the first failed), bring the system back up and go get lunch...? [12:04:52] Emperor: +1 this looks like exactly the same of 1060 [12:04:59] Ack, doing so. [12:04:59] so idrac version doesn't seem the issue [12:05:24] we can try to fix it later, if you can keep this host with swift down I might play with it after lunch [12:05:29] to find a solution if you want [12:06:14] Sure. it's booting (at least partly to check it still can), then I'll make sure puppet & swift are stopped for you. [12:06:33] perfect, thanks a lot! [12:06:42] I'll get some lunch and resume the fight! [12:06:59] thanks, I appreciate it. [12:07:17] sorry for the disappointment so far [12:07:50] NP. If it all turns out to be a doom-fest, I can always fallback to at least poking by hand while I do upgrades [12:09:21] it's back up, and swift is stopped, so you're good to go whenever [12:14:59] perfect, thx [13:32:24] I'm looking again at T309171 since we're short on space on centrallog2002 again, I'm tempted for now to trim only swift logs to say the last 30d, thoughts ? [13:32:25] T309171: syslog / centrallog log volume growth - https://phabricator.wikimedia.org/T309171 [13:32:34] Emperor: ^ [13:32:53] 30 days arbitrary, could be longer too [14:13:39] Emperor: for when you're back I have some good and bad news... [14:16:58] * Emperor is sort-of here, but also talking to someone on slack [14:21:35] Emperor: first a question, is it possible I need to hit restart multiple times before getting the host back because of the disk order issue? console at some point becomes just empty and I'm not sure if I should wait or not [14:21:51] s/empty/blank/ [14:25:38] volans: after messing with the disks, it seems to take longer for the POST to complete to the point of getting output on the virtual console [14:26:03] ...but I've always found the system does eventually boot (unless I forgot to set a new boot device in the controller mgmt screen) [14:26:21] ok, I'm giving it some more time [14:29:20] O(2-3 minutes) [14:30:51] we're way over here... ok checking config and eventually will check bios directly [14:40:51] Emperor: ms-be1064 is back up and running, all yours (puppet still disabled) [14:42:33] the load is quite high, does swift restarts automatically even with puppet disabled? [14:43:33] md0 shows as [_U] [14:44:33] volans: yes, I think so. [14:44:48] md0> I saw that the other day, let me finish committing to privaet-puppet and I'll have a look [14:44:54] ack [14:45:48] we can syncup on the cookbook later when you're less busy. TL;DR we need to adjust some things in the cookbook and spicerack to hopefully get it working all on its own [14:47:11] Jun 8 14:37:35 ms-be1064 kernel: [ 7.522549] md: kicking non-fresh sda1 from array! [14:47:36] So that's something about the process removed sda before a reboot, and now it's stale; I'll re-add it and resync. [14:48:01] I saw it on a codfw node that I think we were testing on [14:48:42] (ms-be2066) [14:49:19] volans: cookbook> OK, cool, thanks. Are you OK with me bringing be1064 back into service? [14:50:36] Emperor: all yours, up to you if waiting to get it back to normal loads before putting it back into service [14:51:01] load is high 'cos of swift (which starts up automatically on system start) [14:51:21] Probably a bit of catch-up from swift changes when it was down. [14:55:11] ack [14:55:37] Cool; hopefully I can sneak one more upgrade reimage through today... [15:11:02] I'm going ahead with trimming swift logs from centrallog2002 FYI [15:21:37] oh, sorry, dropped that ball [15:22:05] godog: yes, that seems OK to me [15:23:10] Emperor: sure no worries, yeah normally the log volume isn't a problem, I suspect between the tegola deletes and the rebalance that volume is up [16:10:17] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed [20:10:31] (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed