[07:26:52] hi folks [07:27:14] re: https://phabricator.wikimedia.org/T338566 - mw1492's mainboard was replaced, do you think that we need to reimage? [07:27:22] I don't recall the procedure in those cases [07:27:50] the host is inactive but in theory ready to get traffic [07:51:44] 10serviceops, 10MW-on-K8s, 10wikitech.wikimedia.org: Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707 (10Marostegui) Any ETA on when this could happen? [07:58:57] in theory it should be a matter of running scap pull and checking if the interface name is unchanged, but who knows what kind of subtle changes came along with the mainboard replacement (like is it exactly the same revision as the original part etc.) and given how simple our reimages are with the cookbook that seems preferable [07:59:38] yup, +1 on re-imaging. It's low cost and hands-free enough. [08:01:07] ack I'll kick it off [08:07:33] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) 05Resolved→03Open [08:07:52] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) Remaining steps: reimage the node [08:10:29] 10serviceops, 10MW-on-K8s, 10wikitech.wikimedia.org: Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707 (10akosiaris) We will soon be in next quarter OKR/planning season, we 'll post an update then if we manage to schedule it for next quarter. Just so that we have all data that we nee... [08:12:42] elukey: thanks! [08:12:44] 10serviceops, 10MW-on-K8s, 10wikitech.wikimedia.org: Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707 (10Marostegui) It is not a blocker for us per se, but we got hit again: https://phabricator.wikimedia.org/T237773#8930311 [08:14:33] 10serviceops, 10MW-on-K8s, 10wikitech.wikimedia.org: Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707 (10akosiaris) >>! In T292707#8930381, @Marostegui wrote: > It is not a blocker for us per se, but we got hit again: https://phabricator.wikimedia.org/T237773#8930311 Cool, noted. T... [08:19:55] elukey: Thanks <3 [08:56:01] of course ipmtool doesn't work :D [08:56:10] but I get the mgmt console [08:59:24] oh no [09:00:04] the funny thing is that if I try to do https://wikitech.wikimedia.org/wiki/Management_Interfaces#Did_you_do_a_reset_but_still_getting_IPMI_connection_failed_(when_using_the_reimage_cookbook)? it fails, since the new iDRAC doesn't support it [09:00:21] (already cold reset bmc locally, didn't work) [09:00:29] maybe with the new motherboard something changed [09:00:38] better call volans ? [09:00:51] he's on vacation [09:00:54] yeah :D [09:01:02] yeah, just saw the |off thing and I remembered [09:01:03] dammit [09:01:16] it should be a matter of figuring out the new syntax, I am checking [09:03:24] elukey: set a timer to get yourself out of the rabbit hole. When we said "re-image", we assumed it would be a 5m job, not a trip to wonderland [09:03:57] yes yes if I don't find a solution I'll ask to dcops [09:08:38] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) @Jclark-ctr I tried to reimage the node but `ipmitool` didn't work, so I tried to reset the bmc locally but still no luck. I can access to the mgmt console and run racadm commands,... [09:08:42] done :) [09:23:05] \o/ [09:55:19] 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request 'iPoid' - https://phabricator.wikimedia.org/T325147 (10Marostegui) [12:23:49] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) @elukey ` racadm>>racadm config ERROR: RAC1281: Unable to run the command because an invalid command is entered. The command "racadm config" entered is not supported o... [12:54:26] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) @Jclark-ctr What I meant was if we have an alternative, or if this is the first time that the issue comes up :) If it is we'll need to find an alternative command for iDRAC 4.40+, i... [13:23:19] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) @Papaul have you run into this before? i do see ipmi is enabled for idrac [13:32:02] 10serviceops, 10Machine-Learning-Team: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10akosiaris) The other thing that I just noticed is that this service [consumes 0.4% of the resources it is allocated](https://grafana.wikimedia.org/d/Y5wk80oG... [13:41:15] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Papaul) @Jclark-ctr check firmware version if old upgrade. If you can not access the IDRAC to check the firmware, reset the IDRAC first [14:14:04] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) @papaul firmware was a very old version i did update. idrac is reachable and has been the entire time. the issue is ipmitool is not working [14:48:10] 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request 'iPoid' - https://phabricator.wikimedia.org/T325147 (10jijiki) Currently iPoid's database, `m5-master.eqiad.wmnet` is anchored to eqiad for writes, while in can read from `m5-master.codfw.wmnet`, that means that, the service... [14:48:56] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Papaul) @Jclark-ctr can you check that it is enable in the idrac [14:50:53] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Jclark-ctr) @Papaul it is enabled in idrac [15:17:08] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Papaul) @elukey racadm config no longer works you need to use racadm set [15:23:03] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) @Papaul yep yep, I was wondering if dcops had any sussgestion about racadm set, never used it and I don't find an alternative command to use.. [15:25:02] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Papaul) @elukey what was it you was trying to do? I see you last comment said re-image failed and you was going to reset the password on the node I am right? [15:26:38] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10elukey) @Papaul basically try https://wikitech.wikimedia.org/wiki/Management_Interfaces#Did_you_do_a_reset_but_still_getting_IPMI_connection_failed_(when_using_the_reimage_cookbook) to see... [16:56:21] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: mw1492.eqiad.wmnet is down - https://phabricator.wikimedia.org/T338566 (10Papaul) @elukey @Jclark-ctr this is fix now. I set the IDRAC to factory and run the provison cookbook. You can close the close if nothing else is left to do. Thanks ` pt1979@cumin1001:~$... [19:37:59] 10serviceops, 10iPoid-Service, 10Kubernetes, 10Patch-For-Review: Create helm chart for iPoid - https://phabricator.wikimedia.org/T336163 (10jijiki) 05Open→03Resolved Dear #anti-harassment for the time being, the chart is ready! What now? * Ensure that the liveness endpoint exists as described in [[ht... [19:38:06] 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request 'iPoid' - https://phabricator.wikimedia.org/T325147 (10jijiki) [22:52:03] 10serviceops, 10Observability-Alerting, 10observability: Port openapi/swagger checks/alerts to Prometheus - https://phabricator.wikimedia.org/T320620 (10colewhite) 05In progress→03Resolved The new alerts are in place and the old checks have been removed from Icinga.