[05:48:43] is phabricator super slow or is it me? [05:51:48] just you I think, seems pretty responsive to me [05:52:35] <_joe_> same [05:57:23] yeah, looks like it is back [05:57:26] thanks [11:34:45] not sure if related, but I'm having difficulty reaching wikitech a fair bit. got a few timeouts, a 503, and 502 Next Hop Connection Failed from ATS cp3050, maybe 1/10 getting through. Having said that, I'm on terrible train wifi, but other things seem fine, so take that with a grain of salt. [11:52:29] <_joe_> Krinkle: I am currently browsing wikitech just fine though [11:59:03] _joe_: try post reqyests, e.g. special:prefs and make some dummy change and save and change back save again etc. [12:00:59] <_joe_> Krinkle: ack sorry I am now able to reproduce [12:01:21] <_joe_> it is slow [12:04:11] <_joe_> Krinkle: no errors on the backend though for POSTs [12:04:44] <_joe_> Krinkle: so I had some slowness, but now it's gone [12:04:55] <_joe_> can you still reproduce? [12:17:07] not a lot of sample data, given the bad connection I have to begin with, but it's been a few minutes since one failed with a edge-server error [12:17:22] Looking in Grafana stats, esams doesn't seem to have a 50x spike [12:17:37] looking in Logstash, there seems to be an unrelated but odd spike in 50x errors for event-intake [12:17:46] 12,000x HTTP 503 in the last 4 hours [12:17:53] for that endpoint [12:18:16] that's with ?hasty=true which is meant to be difficult to make fail [12:18:33] ( https://logstash.wikimedia.org/app/dashboards#/view/Varnish-Webrequest-50X ) [12:18:41] anyway, I'm off now. [12:18:41] o/ [14:41:36] arturo, XioNoX, topranks, thoughts about cloudgw vs pxe booting? T296906 [14:41:36] T296906: reimage/pxe boot failing on cloudvirt1028 - https://phabricator.wikimedia.org/T296906 [14:46:58] andrewbogott: I'll take a look now shortly [14:47:06] thank you! [15:27:27] andrewbogott: sorry, I wont have the chance to look today [15:27:36] ok [15:28:15] but I honestly doubt cloudgw has anything to do [15:28:27] in any case, the switch [16:19:40] andrewbogott: looking at things nothing jumps out at me as to why this would fail. [16:20:03] The cloudsw are configured correctly to insert the required info into DHCP requests from cloudvirt1028 [16:20:25] And the install servers have the right config snipped to match that data and return the correct IP and pxelinux path. [16:20:53] I'm thinking volan.s suggestion to re-try and do a full packet capture on the install server might be an idea, to try to see exactly what data is being received and what is returned [16:21:39] the dhcp logs that v_olans put seems to suggest that the lease is lost somehow (unknown lease), I've seen that when having more than 1 dhcp server running unknowingly in the same host, but does not seem to be the same issue (same pid in the logs) [16:21:44] my 5 cent [16:23:12] dcaro: thanks yeah. [16:23:19] 100% that is often the case of that. [16:23:57] Here I am not so sure, due to how we have the networking / layer-2 set up the install server will get 2 copies of every DHCP packet from the end host. [16:24:57] So, looking at those last 3 lines, I'm not sure if the first and the last are identical requests, and the install server has renewed the lease between receipt of each (line in the middle), and thus has that error beside the last packet (original lease is gone by then). [16:25:43] The reason I think they are likely duplicates of each other is one has come from cr1-eqiad (10.64.20.2) and the other cr2-eqiad (10.64.20.3) [16:26:03] that felt weird yep, is that expected? [16:26:12] boith within the same second. So the broadcast frame the host sent has probably hit both CR routers, and been forwarded twice. [16:26:34] I'm not an expert here, but thinking it through I assume that is happening on all our Vlans. [16:27:11] All the switches have 2 uplink ports in each Vlan to the CR routers, so any DHCP braodcast from a host will hit both, who will in turn forward it (unicast) to the install/dhcp server. [16:27:33] that sounds tricky [16:28:12] indeed. Glad to say our newer model with top-of-rack switches acting as L3 gateway, won't have that quirk. [16:28:49] DHCP will be relayed from top-of-rack switch direct to install server, so just 1 packet will get forwarded on to the install server. [16:29:40] I'm surprised that has not caused issues so far xd [16:30:26] yeah it's not ideal. [16:33:33] do you think the multiple requests is what's breaking my install, or is this just a side curiosity? [16:34:04] I can only assume that's been happening normally for every install forever, so no I think it's probably a curiosity. [16:34:16] ok [16:34:36] I would like to re-run the reimage cookback and do a capture on install1003 to see what info is being exchanged though. [16:35:11] do we have the infra to capture on the switch side? [16:35:15] topranks: ok, want to collaborate on that now? I can certainly re-run the cookbook, would have to dig a bit to do the capture part [16:35:23] (I know it's tricky most times) [16:35:50] * andrewbogott is now at the ready to relaunch the reimage [16:36:03] I can do the capture yep, give me a minute or two to get it set up [16:36:21] great, thank you! [16:36:28] dcaro: it's not impossible but not quite as easy as on a Linux host. [16:36:38] as we can see the packets getting to the install box I'll start there this time [16:39:05] andrewbogott: good to go whenever you can kick that off [16:40:29] now cruft from previous failures is blocking things... [16:40:47] https://www.irccloud.com/pastebin/QjE2Xk0I/ [16:40:57] I guess that means the record is there so I can just do a boot by hand [16:41:15] but probably there's a way to clean that up but no idea what that is... [16:41:15] gtg, good luck! [16:41:19] * andrewbogott waves [16:41:57] * andrewbogott reboots by hand for now [16:42:22] yea.... I think that makes sense, certainly what is needed for the dhcp config is already there [16:42:27] it seems that there's some leftover config in the dhcp server [16:42:27] https://www.irccloud.com/pastebin/SggzKXfe/ [16:42:42] my guess is that's the file it's finding already there and bombing out on [16:43:16] * dcaro_away really out o/ [16:43:24] seeya! [16:43:38] topranks: want to just rm that file and I'll try again? [16:44:06] I'll save it somewhere locally, knowing my luck doing that manually will break wikipedia globally :) [16:44:09] but yep [16:44:10] one sec [16:44:43] ok it's not there anymore, you can try the reimage cookback again [16:45:18] looking better this time! Should get to dhcp in a few... [16:45:25] ok! [16:47:49] anyone know if the victorops vcard is available anywhere, without downloading their app? [16:47:57] No DHCP in the capture yet, the cookbook has recreated the file I moved ok though. [16:48:39] topranks: ok, now I see the blank console that presages a pxe timeout [16:49:52] and now it's showing a grub prompt [16:50:01] (which I take to mean it gave up) [16:50:21] no DHCP message hit the install server at all. [16:50:31] cloudvirt1028 login: [16:50:42] so either there's a networking problem or the host isn't actually trying to pxe boot [16:50:46] I can ping the server [16:50:51] it definitely says it's going to on startup [16:51:26] The first thing it said on startup was 'IPMI: Boot to PXE Boot Requested by iDRAC' [16:51:45] last words are: [16:51:47] https://www.irccloud.com/pastebin/LGEexINp/ [16:51:57] ok... this is an existing host your trying to re-image right? [16:52:03] topranks: that's correct [16:52:19] so failure = just falling back on the existing hdd boot [16:52:25] which is why you can ping it [16:52:41] yeah makes sense [16:54:13] actually wait. [16:54:29] I do see some DHCP logs for that host... but didn't appear in my capture.. [16:54:33] give me a moment [16:54:40] sure thing [16:56:04] my bad, I was filtering for the device's MAC address, but that won't be in the source packet header (DHCP request is relayed by the CR router). [16:56:22] ok, so want me to just try again? [16:56:26] I've it running again now... can you try the reimage again? [16:56:30] yep! [16:56:30] yeah please [16:56:39] I've removed the conf file to overcome that again [16:57:04] host is rebooting, wheels are turning [16:57:17] (I think if I do a proper 'abort' in the cookbook it cleans up, I maybe disconnected or something last time) [16:57:26] ok let's hope I get it this time. [17:00:29] ok see it there [17:02:20] Ok well I can see the it's getting the DNS server IP in the reply, and the PXElinux URL in option 43. [17:02:37] ok, it's falling back on hdd now [17:02:52] yeah, I guess the odd thing is that it does request multiple times. [17:03:00] if it's getting the reply and the url then we have a firmware issue maybe? [17:03:09] The duplicated packets are one thing, but I'd expect only one request/response cycle. [17:03:17] Yes. [17:03:32] But I wonder if it is getting the response [17:03:33] (robh upgraded firmware a couple of days ago but I don't know if he did idrac) [17:03:52] i didnt but that has nothing to do with dhcp stuff [17:03:54] I could, for science, drain 1028's neighbor 1027 and see if it behaves differently [17:04:01] idrac that is [17:04:09] as it required a step up its so old [17:04:12] makes sense [17:04:31] I'm not sure that'd help too much right now. [17:04:33] i can update if you like but it wont affect dhcp stuff [17:04:36] (draining the other one) [17:04:37] promise ; D [17:05:13] robh: I believe you :) [17:05:49] i hesitate to update only cuz i know there is a 'bad' version between the one on there and newest and if you accidentally hit it, you have to have a crash cart roll it back ; D [17:05:54] idrac that is [17:06:23] i think the .66 version introduces an ssl error into the interface ssl cert which makes it unreachable by modern browsers heh [17:07:24] that's terrible! [17:11:48] topranks: anything else I can try? It sounds like you're leaning towards 'definitely not a network issue' [17:12:01] no not really tbh. [17:12:16] I've checked everything I can think of nearly, and I can't see why there would be a network issue. [17:12:32] But it does sort of look like the DHCP response is not received by the host. [17:12:36] ok. so i guess I drop this back on dcops [17:12:51] thank you for your help! Do you mind writing some notes on the ticket? [17:13:08] oh, you did already :D [17:13:59] no probs, yeah I'll add more if I find anything. I'll also chat with Arzhel when he's back Monday to see if he has any ideas. [17:14:08] thx [17:14:13] I'd love to see for certain the reply DHCP packets from install1003 are making it to the host. [17:14:47] Certainly their leaving install1003, and normally that host can ping cloudvirt1028, so the data-path should be ok (although these are relayed via the CRs). [17:15:18] So it should be getting them and starting the PXEboot, but instead it just re-tries DHCP a few times, and eventually boots from disk. [17:15:26] Is it safe to assume that other hosts have been imaged recently? That this isn't broken dc-wide? [17:15:58] No I don't think so, certainly Moritz was doing a ganeti reimage during the week for instance. [17:18:39] topranks: can you respond with what you know about switch history in -dcops? thx [17:19:29] jhathaway: looks like some reorg of their documentation ate the list of numbers... [17:20:07] cdanis: thanks for the confirmation my google-fu was turning up empty [17:20:42] jhathaway: I emailed you a vcard from my phone's contacts [17:21:14] cdanis: thanks, much appreciated [18:04:14] I did multiple reimages over the course of the week, but they were all in codfw [18:09:23] I reimaged a host in eqiad today