[02:26:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:26:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:28:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:28:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:30:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:33:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:33:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:35:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:37:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:13] 10netops, 06Infrastructure-Foundations, 07Documentation: https://wikitech.wikimedia.org/wiki/Out-of-band_network out of date - https://phabricator.wikimedia.org/T379465#10311213 (10ayounsi) 05Open→03Resolved a:03ayounsi Updated :) [08:02:45] 10Mail, 06Infrastructure-Foundations, 06Trust-and-Safety: Emails from wikimediats.zendesk.com fails DMARC policy - https://phabricator.wikimedia.org/T378285#10311216 (10JAbrams) @revi Many thanks for the additional insights, noted! Meeting with SRE today, I’ll keep you posted. [08:37:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:42:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:50:57] Hi, I registered a developer account "twinkle-i18n-bot" for use with Gerrit, about 18 hours ago. The activation was successful, but I'm still not able to sign into Gerrit (getting "Authentication failed"). [10:12:00] 10netops, 06Infrastructure-Foundations, 06SRE: Extend sre.network.configure-switch-interfaces cookbook to add sflow and qos config - https://phabricator.wikimedia.org/T379549#10311548 (10cmooney) I discussed this briefly with @ayounsi on irc and while this is probably a good idea it won't, as things stand, p... [10:13:29] slyngs: --^ [10:14:08] sd0001: I'll take a look. Can you sign into idm.wikimedia.org? [10:15:14] Seems to be create correctly at least. [10:17:33] slyngs: able to login there as twinkle-i18n-bot, but... this is weird, after login it shows me the details of my main developer account! [10:17:55] I had used the same email id [10:18:44] gerrit probably associated the login base on your mail address [10:18:51] Oh, that shouldn't be possible, but probably not the issue [10:18:57] one option is to log into idm.wikimedia.org and change the mail address for your bot [10:19:31] to something like siddharthvp+twinklebot@gmail.com (these +foo suffixes are built into gmail) [10:22:07] Gerrit is just LDAP, so regardless that shouldn't be an issue, unless Gerrit needs to email to be unique. [10:22:27] is anyone using sretest2001 for anything right now? [10:22:52] I got curious about some linux netdev naming stuff and wanted to use it for a test if possible [10:27:28] moritzm: changed the email - still can't login, though I guess I need to "allow for a few minutes for the change to propagate" ? [10:27:32] sd0001: I think the emails got mixed up. Do you need me to go in switch them account? [10:28:02] slyngs: sure, thanks [10:28:41] (yeah it ended up changing the email of the main account, as that's what was showing on the UI) [10:29:06] Yeah, it's the OpenID Connect account linking, it really likes using the email as a key [10:31:39] sd0001: Okay, both account looks correct now. [10:32:49] And the password works. Otherwise you won't be able to sign in on idp.wikimedia.org (idp not idm) [10:34:45] slyngs: ahh, login working now on gerrit as well, thanks! [10:34:55] Excellent. [10:34:55] btw, what's the difference between idp and idm? [10:35:21] idp is our central SSO for web services [10:35:32] and idm the identity manager where you manage your account [10:36:09] and we haven't decided on the other 24 combinations of idX yet :-) [10:36:39] got it haha [10:37:14] sd0001: You didn't get a error saying that your email was in used, when signing up? [10:38:47] nope [10:39:43] aw, I would really have liked it to be a "Sure, but then I did this thing you didn't expect". I got multiple complains that we don't allow email reuse and you just did it :-) [10:42:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:44:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:06] elukey: any idea what the "Intel Ethernet Gadget" is on sretest2001? [11:04:37] https://phabricator.wikimedia.org/P71016 [11:05:43] sorry not Intel - "Insyde Software Corp" [11:07:55] topranks: no idea, first time that I see it :D [11:08:03] ok! [11:08:06] seems something shady :D [11:08:17] there is no issue - i was just curious :P [11:08:17] how did you end up finding it? :D [11:08:21] indeed seems shady [11:08:39] the system sees it as an ethernet interface, showed up in "ip -br addr show" [11:08:46] ah wow [11:09:15] some virtual usb nic or something?? I don't think it should ever matter to us - hopefully! [11:09:39] now that you mention USB, it may be it.. lemme inspect the BIOS config [11:13:00] I see these [11:13:01] FrontUSBPort_s_ Enabled [11:13:01] LegacyUSBSupport Enabled [11:13:01] RearUSBPort_s_ Enabled [11:14:14] and [11:14:16] BootOption_3 UEFI USB Hard Disk [11:14:16] BootOption_4 UEFI USB CD/DVD [11:14:16] BootOption_5 UEFI USB Key [11:14:16] BootOption_6 UEFI USB Floppy [11:14:19] BootOption_7 UEFI USB Lan [11:14:27] that UEFI USB Lan looks weird [11:15:25] yeah I suspect that may be it [11:16:06] could simply be some in-built support for USB-based network interface. UEFI may also be playing a role [11:18:58] elukey: non-urgent, but I did come across a Dell BIOS toggle which is relevant in what hw data is exposed to the OS which is relevant to our network device naming problem [11:19:05] https://phabricator.wikimedia.org/T347411#10311664 [11:19:21] don't need you to do anything, but if you are playing with the supermicro's and come across anything similar let me know [11:21:06] topranks: ack I'll keep it in mind.. I tried a quick lookup for acpi and didn't find much in the bios [11:21:35] topranks: if you want I can try to disable those usb configs and reboot, so we can check if the interface disappears [11:21:37] ok thanks! [11:22:09] I don't think the additional interface is an issue so no need to spend time on it I think [11:23:33] sretest2001 should be rebooting now [11:23:42] ha ok well let's see [11:23:44] > no need to spend time on it I think [11:23:49] sorry :P [11:23:55] classic sentence from a nerd-snipe [11:23:58] just saying [11:24:00] lol [11:24:02] :D [11:24:27] while we're waiting - I failed to find the acpi info through the redfish api for the Dell's before [11:24:34] though that's not to say it's not in there somewhere [11:24:57] it's basically a firmware ID or something that's exposed to the OS, modern linux if it is present will use it to name the network interface [11:25:08] if it's not exposed the name is based on pcie location [11:25:35] I guess what we need is 1) consistency and 2) a way to determine/predict it for a given system [11:26:37] are there any Dell systems online that booted with UEFI? [11:29:21] I think sretest1001 but not 100% sure [11:30:07] sretest2001 is up [11:30:37] but I think the interface is still there [11:33:12] yeah [11:33:28] probably the bios toggle controls whether the system will try to pxeboot from it or not in the boot order [11:33:54] anyway thanks for that! I am done rabbit holing on this for the day I think :P [11:36:56] :D [11:37:34] those options may probably be added to the provision cookbook, do we need to let any usb functionality enabled? [11:40:10] I mean technically the boot options we should probably not allow any external boot device (cdrom, usb etc) [11:40:47] but I'm not sure it matters that much, if someone has physical access they can kind of do what they want or change those settings even [11:44:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:03] topranks: ack, we cannot change the Boot order settings on supermicro X12 sadly :( [11:49:07] via redfish I mean [12:37:54] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 3 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10311981 (10cmooney) @Jgreen @Dwisehaupt I think we have broadly two options for how to proceed today: **O... [12:46:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:31:56] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 3 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10312547 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fd1b13c3-25ae-42de-a138-bb1a39... [14:51:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:53:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:03:46] elukey: how are things in EFI land? I saw your patch to use the continuous flag. Do you have thoughts on next steps? [15:04:09] jhathaway: o/ [15:04:32] I wanted to work on ms-be2088 but I think something config-wise is wrong, I pinged dcops about it [15:04:33] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 3 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10312686 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=6d3e8237-b81b-47ec-a63c-afd9f7... [15:04:43] I wanted to have a chat with you about your lead related to the EFI partitions [15:05:04] very ignorant about it so I can brainbounce but I am not 100% sure about what/how to fix it [15:05:34] the "continuous" flag is probably good long term so we don't accidentally trigger PXE etc.. [15:05:43] but nothing more :( [15:21:41] sorry elukey disappeared into meeting, happy to jump on a call, or discuss here [15:24:24] jhathaway: what do you think about a meeting tomorrow? I can send an invite, I am a bit into the wikiswarm aws stuff and my head is half fried :D [15:25:46] sounds good [15:53:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:56:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:23:58] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 3 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10313129 (10cmooney) Migration work is now complete, bastion and all hosts are reachable again following th... [16:44:13] hi all, just got of a call with perforce and some puppet community members. some things that were not in the recent announment. [16:44:36] * There will be more puppetlabs repos going private [16:44:58] * the ruby gem will stop being published (this kills CI for anything recent) [16:45:25] * there is allready work to provide community built packages [16:45:33] * there are mumbelings of a fork [16:45:58] * from what i could tell debian/freeBSD may not consider the new changes OpenSource and could pull the packages [16:46:23] hey John! [16:46:25] sigh [16:46:57] there will be a town hall meeting with more of the Perforce team in th comming weeks which should help clarify some of this and i think at that point some of the above decitions will start happening e.g. fork, pulling packages [16:47:18] hi John! [16:47:22] yes its a bummer, im not sure if this community still has enough active members for a fork [16:47:33] but there are some ex puppetlabs people working on it so who knows [16:47:40] hi cdanis elukey [16:48:06] o/ jbond, thanks so much for the added info [16:48:11] indeed, thanks! [16:48:36] noprobs, jhathaway do you want me to ping you when i know the data ofthe town hal meeting? [16:48:41] I would love a healthy fork, but I share the concern that the community has shrunk too much to support a ruby + jruby + closure behemoth [16:48:52] jbond: yes please [16:49:13] will do, there is https://www.puppet.com/community/calendar but im not sure its correct and definetly some of the links are dead [16:56:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:58:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:02:02] topranks: o/ still around by any chance? [17:02:14] elukey: yep [17:02:33] topranks: can I bother you with a question on PXE? [17:02:48] I am working with jayme on an issue with wikikube-ctrl1001 [17:02:54] you're certainly free to try - not something I'm an expert on [17:02:55] ok [17:03:16] so the first attempt failed, since the PXE boot failed with "No media present" [17:03:39] so we checked the BIOS (this is a dell, no uefi etc..) and we discovered that the NIC set for PXE wasn't the right one [17:03:58] the host got new 10G interfaces a while ago, and maybe we missed some manual config [17:04:06] https://phabricator.wikimedia.org/T379629 [17:04:10] ok [17:04:11] the host booted into PXE/debian-install [17:04:24] we were happy etc.. but then reimage failed for another reason [17:04:39] I've seen that where the pxe nic wasn't right and changing it in the bios fixed it [17:04:47] the port is up on the switch.... [17:04:59] do you know why the reimage failed in d-i? [17:05:03] so we kicked it off again, but this time there seems to be another issue, namely when PXE is forced we just get to a blank screen hanging [17:05:16] rechecked all the settings, they look good [17:05:18] are you using tftp? [17:05:26] yes we tried also that one, no luck [17:05:52] but then Janis mentioned something - Kamila in the past had a similar issue with the same host, but it got "fixed" by itself (namely reimage working as expected) after a day [17:05:53] reimage failed because we manually modified the DHCP config and the cookbook has a sanity check before removing the config (which makes it fail in case of manual modifications) - so unrelated [17:05:55] stick with tftp anyway I would say, given it means we don't need to worry about the nic firmware [17:06:14] on the "screen sitting at prompt" are you looking at the virtual serial console? [17:06:15] so now I am wondering - could it be somehow related to dhcp lease expiring? [17:06:37] on install1004 the last entry for the host is: [17:06:39] Nov 12 14:59:13 install1004 dhcpd[1558832]: DHCPREQUEST for 10.64.48.45 (208.80.154.74) from 00:0a:f7:ef:f8:31 via 10.64.48.3 [17:06:44] that is pretty old [17:06:46] we stuck with tftp after the first issue, just tu be sure [17:06:48] we tried even 10 mins ago [17:07:24] not really, the dhcp server, if it has an existing lease, but gets another DHCPDISCOVER from the same host, I believe will just scrap the first one [17:07:24] the only thing that I can think of is that for some reason the host tries to DHCPREQUEST but it is left hanging [17:07:42] or DISCOVER yes [17:07:59] could there be anything cache-related on the TOR causing this? [17:08:36] you could tcpdump on the install server [17:08:55] we could do that yes, didn't have the time yet [17:08:59] 'screen sitting at prompt' is from the virtual (VNC) console of idrac. The mgmt console is just blank (not even showing the "Booting from ... ") [17:09:00] but it seems really weird [17:09:17] very strange [17:09:18] most notably when we fixed the boot order it worked perfectly [17:09:24] elukey: not on the switch in this case, _possibly_ on the CR (which is doing the relaying for this connected to legacy switch stack in eqiad row d) [17:09:27] I granted somebody acces to the VNC console :) [17:09:30] but that's unlikely a bit [17:09:34] jayme: that was me thank you : ) [17:09:57] so... that looks like host is attempting DHCP [17:10:01] can I give it a reboot? [17:10:04] sure [17:10:05] yes yes [17:10:57] usually https://usercontent.irccloud-cdn.com/file/QpzmtCN8/image.png [17:11:10] ^^ usually right under this it shows the "trying dhcp" bit [17:11:20] * jayme nods [17:11:42] the install server has the config file correct for the host [17:11:48] so it should be responding properly [17:13:33] if there is something to respond to... [17:14:37] So there is an outside chance the CR is not relaying the DHCP packets properly [17:14:48] but I'm doubtful of that as it's been stable for ages [17:15:01] No DHCP requests are making it to the install server (I did tcpdump) [17:15:21] One thing that does look odd to me is I see that port 2 on the 10G NIC is the one with the switch connection/link [17:15:39] is the BIOS PXEboot set to use port 1 of the NIC as normal? [17:15:43] perhaps that's the issue [17:16:23] https://usercontent.irccloud-cdn.com/file/rGI5f7DE/image.png [17:16:48] oh, wow...look elukey - it's in the web-ui :D [17:17:21] we explicitely configured port 2 ... but maybe the provision cookbook did not do the right thing [17:17:32] oh ok [17:17:43] but before we ran provision, we manually set port 2 - with 4 eyes [17:18:21] perhaps the provision reset it to port 1 [17:18:22] but even now we should have the right port in the boot order IIRC [17:18:39] I'm rebooting just to check if I see anything there [17:19:28] I selected port 2 during provision [17:19:51] and 3B01 (from https://usercontent.irccloud-cdn.com/file/QpzmtCN8/image.png) is port 2 [17:20:22] this smells like a firmware bug [17:20:43] especially from what Kamila experienced during the last reimage [17:23:58] I'd live some audio commentary to my VNC bios live stream :D [17:24:02] *love [17:25:08] haha [17:25:19] well the TL;DR is everything looks right to me [17:25:32] bummer [17:25:34] port 2 does seem to be the only one set for pxe [17:25:47] topranks: thanks a lot for checking! <3 [17:25:50] I seen sometimes it was wrong in there and our setup didn't fix, but only with really old idrac [17:26:10] Is UEFI worth a shot?? [17:26:18] * topranks runs to hide under a rock [17:26:23] :D [17:27:01] I was about to ask it we should ask dcops to switch the sfp to port 1 - which sounds like a weird idea as well [17:28:39] well our standard is to always use port 1 - so that's good just for consistency [17:28:59] and while I agree this does seem like a firmware bug, perhaps the card won't hit that bug on port 1 [17:29:09] it does souind like what Kamila got before yes [17:29:27] I also bounced the switch port just in case resetting the interface would do some magic [17:29:30] but no magic :( [17:29:49] so for consistency we could ask to dcops to move the cable today, and then tomorrow we retry and/or we upgrade the firmware [17:30:06] jayme: what do you think? [17:30:19] sgtm...at least there is some 'doing' involved :) [17:30:28] we can probably ping someone now in the dcops chan and ask for the cable switching [17:30:47] if it doesn't work, we upgrade the firmware (I'd do it anyway) [17:31:30] yeah, last resort I'd say (as phab states we should stick to the old firmware) [17:31:38] jayme: I need to step afk, we can restart tomorrow morning if you are ok (maybe if you have time try to ping dcops now, Valery/John may be able to sort it out quickly during our night) [17:32:00] I think the phab suggestion was before the --tftp-only flag [17:32:00] will do [17:32:09] yeah, that might be true [17:32:17] IIRC now we shouldn't hit any bug, at least this is what Papaul told me a while ago [17:32:18] thanks for all your help people o/ [17:32:22] o/ [17:32:25] thanks topranks! [17:32:45] and sorry for the extra nerd snipe late in the afternoon :) [17:58:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:47] jayme, elukey: fyi, when I ran into it, I tried several firmware versions including newest (at the time) [18:25:24] thats reassuring :) [18:42:46] It's a BCM57412 card and the firmware is "known good" 21.85.21.91 version [18:42:57] so yeah I'd not really expect much from trying to downgrade or upgrade it [18:48:48] topranks: dcops just switched the cable [18:49:16] it does not show link in the webui, but it does in the bios [18:54:42] kamila_: ack thanks! Very weird then.. [18:54:48] trying to reconfigure it anyways... [18:55:00] jayme: should we try to change the boot order and force pxe? [18:55:17] elukey: done so [18:55:50] fingers crossed [18:55:59] if it works it is really a random thing [18:56:07] most probably something BIOS/firmware related [18:56:13] who knows if this happens with UEFI [18:56:56] it's getting crowded under toprank.s rock [18:57:58] worked :p [18:58:19] heh [18:58:27] what the.... [18:59:07] mmm I don't see anything in the com2 console though [18:59:16] ah no yes it is bootstrapping d-i [18:59:46] what a day... [19:00:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:00:27] so it was a misconfig, but at this point TIL that on dells 10g interfaces we should always use the first slot [19:00:50] for $unknown_firmware_feature [19:01:44] oh, okay, that'd explain why it was only those hosts I suppose '^^ [19:01:54] I'm inclined to reimage once again after the reimage...but given this is kind of a critical system I will not [19:02:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:02:33] kamila_: did you have the same issue on all the ctrl nodes? If so we may have to check the nic configs on 1002 and 1003 [19:03:21] elukey: _almost_ all, IIRC there was one or two exceptions [19:03:40] so 1002 has port 2 connected as well :/ [19:03:52] wouldn't it be funny if one or two of them used port 1 [19:04:12] 1002 had the same issue [19:04:22] so that matches [19:04:42] 1003 uses port 2 as well [19:04:53] so we need more mainteance to safely reimage again [19:05:06] most likely yes... :/ [19:05:37] in codfw too I believe [19:05:40] guess we'll see tomorrow [19:05:43] going back to my baby duties, glad that things got unblocked [19:05:50] have a good evening folks! [19:05:58] thanks elukey <3 [19:05:59] evening! [19:09:41] in codfw it's just 2002 that uses port 2, the other two are potentially fine [20:02:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:05:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:05:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed