[06:14:42] What are we supposed to do with alerts like: ipmiseld.service Failed? There's no wiki page associated to it and also...the IPMI seems to be working fine [06:26:11] marostegui: I did "db2180:~$ sudo service ipmiseld restart" and it does show " ipmi_sdr_cache_create: internal IPMI error" [06:26:24] but still show as running [06:27:09] yes, that's my point [06:27:23] What are we supposed to do with those? [06:27:43] ipmi-chassis keeps working just fine [06:28:06] marostegui: I found https://phabricator.wikimedia.org/T302639 [06:28:26] So I'd recommend opening a task for I/F and maybe o11y [06:28:40] I thought it might be this: https://phabricator.wikimedia.org/T305147 [06:29:33] good point! [06:29:54] seems fine to piggyback on that task [06:30:07] Yeah, I will comment there [06:30:19] and maybe ask for some doc to be written so we're not lost next time it happens [08:07:30] apergos: sorry, got in a meeting and never got out xd, I saw the error again today, but as before it went away when I hit 'sign in', the dashboard afaik has no special permissions :/, we can ask o11y for help [08:07:55] ok good to know [08:10:51] sent a message in their channel, we can open a task if it's more than something silly to fix (maybe a dashboard setting somewhere) [08:12:20] ok thanks a heap [08:29:03] apergos: it seems it was a silly permissions issue xd, can you retry? [08:29:20] probably butchered them somehow while changing things [08:34:35] got the link handy? [08:34:46] dcaro: [08:35:18] apergos: here https://grafana-rw.wikimedia.org/d/000000568/wmcs-dumps-general-view?orgId=1&from=now-90d&to=now [08:35:46] works great now, thanks! I'll let hannah know too, she had the same issue. [08:41:00] 👍 [08:41:12] feel free to change/modify/add/remove whatever from that dashboard [12:50:46] we are getting a bunch of NRPE timeout alerts, is anything happening or is it just us? [12:55:03] all of them are from the same host, probably something specific to us [12:55:31] or maybe network? [12:56:07] ping seems to work though [12:58:18] try to see which layer is not working [12:59:09] andrewbogott: has console and looking, might be load [15:10:36] if a host runs out of RAM it's typical behaviour that NRPE process suffers first.. the OOM killer likes to kill it first [15:10:47] then if you restart the nagios-nrpe service all the NRPE checks should recover [16:49:44] db2165 went down [16:49:48] I'm not next my laptop [16:50:03] can someone depool it? from cumin1001 dbctl instance db2165 depool [16:50:16] dbctl config commit -m "depool db2165" [16:50:23] and create a task so I can check later? [16:50:32] marostegui: it's a master according to puppet [16:50:37] shit [16:50:53] I'm flying [16:50:56] let me get my laptop [16:50:59] marostegui: https://github.com/wikimedia/operations-puppet/blob/f10572681e40035bc3e1c289e1b2285a881ad356/hieradata/hosts/db2165.yaml#L1 [16:51:06] And there's replication alerts flying [16:51:33] it's bazk [16:51:37] I'm pulling my laptop out [16:52:35] RhinosF1: can you create a task please [16:52:43] seeing this now, I can help by making a ticket [16:52:43] marostegui: yes [16:52:48] or ok [16:52:48] mutante: I will [16:52:51] ack [16:52:52] Ack the page though [16:52:52] can someone check the hw logs? [16:52:54] I am starting mariadb [16:52:58] page is acked [16:52:59] checking hw logs [16:53:02] I am descending to 3000ft too XD [16:53:12] pilot speaking there, heh [16:53:24] started mariadb but still read only while i check [16:54:02] logstash2002 also complained of ping at the same time [16:54:04] https://phabricator.wikimedia.org/T336072 [16:54:11] might have been a short network outage [16:54:24] (both are in rack C5) [16:54:29] Description: The input power for power supply 1 has been restored. [16:54:38] Description: The power input for power supply 1 is lost. [16:54:43] cwhite: no, it rebooted [16:54:51] marostegui: should i power cycle? is up? [16:54:57] it is up [16:55:01] oh shit wait, we are in codfw [16:55:07] ok. so looks like it lost power and came back [16:55:07] so it is fine to be read only [16:55:09] that is good [16:55:21] papaul: hi, you still around? [16:55:43] that's right, logstash2002 also rebooted [16:56:03] we are out of the woods [16:56:09] all is fine now replication recovered [16:56:18] I am going to close my laptop and check this when I get home [16:56:34] marostegui: good to hear, alright [16:56:37] if someone can post the hw logs on the task, that'd be great [16:56:46] interesting the uptimes don't match between db2165 and logstash2002 [16:57:02] it's been a pleasure seeing you fly a plane and triage a db [16:57:17] herron: only 3 minutes difference, so it could be the same thing [16:57:20] let me take over [16:57:30] although there may be not much else to do [16:57:31] jynus: thanks, so far we are out of the woods, it is all ok [16:57:39] mutante: i am here but no onsite [16:57:42] but if you can take a general look at the data, that'd be great [16:57:43] but Jennifer is [16:58:00] herron: I am actually with a student so not technically flying myself :) [16:58:05] ok, will back to this later [16:58:09] thanks everyone for the fast response [16:58:10] papaul: I think it just got downgraded from UBN to High.. but it does seem like there was a physical event with power cable [16:58:22] papaul: but not emergency anymore [16:58:37] yeah, we should probably do a switchover and then get the cable looked at [16:58:38] Amir1: ^ [16:58:48] a codfw master switchover is easy to do anytime [16:58:48] marostegui: RhinosF1: done. can be included with {P47780} https://phabricator.wikimedia.org/P47780 [16:58:54] I will do any potential (theoretical) work for a master switch, but just in case (don't intend to do it), just to have everything ready if it happens again [16:58:55] as codfw isn't primary anymore [16:59:16] jynus: use https://switchmaster.toolforge.org/ that will generate the task and patches for you [16:59:17] unless Amir wants to lead that [16:59:25] ok i have to vanish [16:59:27] thanks everyone [16:59:36] cheers marostegui thank you [16:59:36] I will start pre-preparations in any case [17:00:00] SEL on logstash2002 is flapping power supply redundancy lost since 13:57 [17:01:22] Amir1: I've hit schedule, but probably needs a "password"? [17:01:34] other hosts in the same rack are also flapping PS redundancy [17:01:46] maybe low on power? [17:02:40] now also see -dcops channel [17:02:53] we can always depool codfw if needed [17:02:59] like we did with the switches maintenance [17:03:37] that's not out of the question, apparantly there are like 6 masters on row C [17:03:44] hi, I just got back [17:03:45] sorry [17:03:59] yeah, I can share the password or I can do the switchover [17:04:15] yeah let's do it asap [17:04:20] Amir1: no switchover at this time, but I would like to prepare it just in case [17:04:21] sure. You go marostegui [17:04:28] Amir1: no, I can't :) [17:04:29] i dont think you need to depool codfw [17:04:30] Amir1: manuel is flaying a plane [17:04:30] I'm flying [17:04:36] it was 2 servers, both close to each other [17:04:39] and there was cable replcaement [17:04:48] I mean, get out of here marostegui. Go fly or whatever [17:04:48] it does seem to be a general power issue [17:04:49] Amir1: I did the initial triage [17:04:58] JennH: in here please [17:04:58] does NOT [17:04:59] which section is it? [17:05:03] but I can't do the switch over with this unestabke connection [17:05:08] 18:03:26 I was replacing the black power cables with the blue and red. I thought I was being careful but looks like I goofed one. my bad [17:05:15] ^ this, I think we know why [17:05:17] Amir1: C5 in the datacenter, s8 in mw [17:05:17] Amir1: host was s8 [17:05:19] hi, I was replacing old cables for a refresh in C5. it was my fault [17:05:19] and no need to do more [17:05:35] awesome. I do the switchover now [17:05:35] JennH: any work left on that rack? [17:05:42] no I am done now [17:05:45] Amir1: maybe not if there's no more work pending [17:06:04] are PS redundant now? [17:06:09] JennH: thank you [17:06:35] we can see in hardware logs too it was just disconnected and connected again. the latest log message for the db server was "power supply restoed" [17:06:40] restored [17:06:47] T336075 is there ready [17:06:48] T336075: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T336075 [17:06:56] so maybe we don't have to do the switch if the PS are back and no more work is expected [17:07:32] ^ sounds like it to me, no reason to expect more problems. just check logstash2002 is not still flapping either [17:07:51] then let's not do the switch I would say [17:07:55] Amir1: thoughts [17:08:28] let me check something [17:09:00] checked "getsel" of logstash2002, looks the same, power lost, power restored.. ended a while ago [17:09:20] I think let's not do it, the candidate is not even rebooted so it's basically useless [17:09:23] marostegui: yea, I was only proposing preparing it when I thought it could be an unfixed hw defect [17:09:51] if we had the candidate rebooted, we had to do the switchover anyway, let's do it now but since that's not the case, let's just ignore it [17:09:52] sounds good then yeah. let's hold it [17:09:57] it is not the first time that a power suply fails partially [17:09:59] sounds good [17:10:07] but that seems not to be the case [17:10:18] yeah it was good it was due to onsite work [17:10:23] Thanks for the very fast response everyone [17:10:27] rather than hw breaking [17:10:38] so we are all good then [17:10:47] thanks everyone [17:12:02] I closed the switchover task too [17:12:25] cheers [17:15:26] Is the page fully resolved on the VictorOps side? [17:15:35] Don't want any scares late on a friday [17:17:21] all incidents marked as resolved, thanks all! [17:18:00] thanks:)