[00:05:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:15:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:17:12] topranks: new gnmic release ! https://github.com/openconfig/gnmic/releases/tag/v0.39.0 main useful feature for us is "add gNMIc prometheus metric gnmic_target_up" to be able to monitor which devices are not reachable on their gNMI interface [07:53:56] hey folks! re: UEFI and Supermicro, today I'd like to try these tests [07:54:31] 1) a couple of reimages of ms-be nodes already reimaged correctly, to verify that the double d-i entry doesn't happen [07:54:50] 2) force bios-legay + UEFI again on one node, + reimage [07:55:52] I need to verify if the issue happens on the first reimage post-provision, or if it also happens post-successful-reimage, because it may be some weird state in which the host is in after provision [07:56:09] who knows maybe there is the need for an extra chassis reset [08:00:41] elukey, slyngs: I'd go ahead and remove the legacy irc.wikimedia.org service later, unless you think we should wait more? I think at this point there's no further need to potentially compare things to the legacy service, everything seems working just fine [08:00:44] ah Jesse already did a lot of tests https://phabricator.wikimedia.org/T371400#10302344 [08:00:57] I'll also move irc.w.o to 1.0.1 in a bit [08:01:06] moritzm: +1, let's nuke the old stuff [08:03:22] https://www.youtube.com/watch?v=aCbfMkh940Q (obviously) [08:03:35] ahahahah yes [08:06:18] We got surprisingly few complaints, so either it's "Just Working(tm)" or no one is using it all that much. So I think it's safe to remove. [08:11:07] we had reports (like the huggle thing), so things are clearly working and being used [08:11:36] That's not a lot though [08:12:18] it's just ~ 10 bots, there's not a whole lot of people who would complain [08:12:50] and if things broke with the legacy crap we heard quickly about it since people stop getting notifications for COI notifications [08:12:57] and if things broke with the legacy crap we heard quickly about it since people stop getting notifications for COI violations [08:13:09] I terms of percentage it's actually pretty bad then :-) [08:16:41] still, the percentage for bug affecting the actual broadcast service is 0% which is quite okay :-) [08:25:29] Aren't you going to miss spending a week trying to make 17 year old code run on a modern Linux? [08:26:30] very much, and also the joy of fiddling with the Py2-ircecho service! [08:27:17] irc.wikimedia.org runs ircstream 1.0.1 now [08:28:06] It's strange, there's all the old tinkering with old computers, restoring old C64, programming in Assembler. Never once have a I heard someone say that they'd like to touch old Python. [08:28:11] Wuhuu [08:54:10] the legacy installation is now inactive, I'll keep the VMs around until Monday, then I'll drop them as well (and the unused Puppet code) [09:22:27] XioNoX: o/ I am logging all my tests for UEFI and double d-i in https://phabricator.wikimedia.org/T371400#10302344 (onward) [09:23:00] alright, let me read it up [09:25:03] so far it seems something happening only after provision, once, and then nothing more [12:05:43] I posted some updates to the task, so far no luck [12:06:16] I don't have any good lead about the double d-i thing, but the good news (sort of) seems that it only happens the first reimage after the first ever provision [12:06:24] after that, all works [12:06:34] * elukey lunch! [13:01:39] 10CAS-SSO, 06Infrastructure-Foundations: Enable Redis backed Ticket Registry for CAS / IDP - https://phabricator.wikimedia.org/T377728#10303593 (10SLyngshede-WMF) 05Open→03Resolved [13:44:02] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 3 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10303714 (10cmooney) @Jgreen @Dwisehaupt I was doing some prep work on T377996 - looking at step 1 to impor... [13:56:45] elukey: do you think it could be related to the JBOD config? like it skips the HD boot for some reason then move on to the next boot item? Either HTTP if it's there or the shell as last hope? [13:57:54] it could also be worth following up with their support [14:46:29] elukey: we didn't see that issue, but the "stuck" HTTP boot issue, which could hide this one as it required more boots too [14:46:58] I haven't tested it in a "clean state" since we merged all the patches though [14:51:49] XioNoX: okok then I can try that, perfect [15:15:10] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 3 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10304205 (10cmooney) I'd a chat with @Jgreen on irc about the above and he confirmed all those hosts are de... [15:19:48] XioNoX: something interesting - I provisioned sretest2001 with UEFI, and this time the EFI Shell was set as second boot option [15:20:25] no sorry, first one, before debian etc.. [15:20:30] elukey: before the disk? [15:20:34] yeah [15:21:01] I think I noticed also the double d-i after typing "exit" in the EFI shell [15:21:07] but I need to retest [15:21:58] very weird [15:23:24] and https://www.supermicro.com/manuals/other/redfish-ref-guide-html/Content/general-content/bios-configuration.htm#configuring-boot-order-system-bios shows that we can have our own fixed boot order only with X13+ [15:23:33] so new gen servers [15:24:59] in that page there is also the UefiBootNext option [15:25:00] https://www.supermicro.com/manuals/other/redfish-ref-guide-html/Content/general-content/bios-configuration.htm#configuring-uefi-boot [15:25:07] have you guys tried it? [15:25:17] I did not [15:25:23] Hey Jesse! [15:25:26] morning :) [15:25:31] afternoon! [15:26:12] I haven't neither [15:32:34] Its not clear to me how they differ elukey, did you find any good docs? [15:33:14] jhathaway: nope :( I just saw it and wondered if maybe it hits a different config/code path [15:35:08] it would be hilarious if on Supermicro BIOS is more stable than UEFI :D [15:36:39] :D [15:37:06] jhathaway: I added all my tests to the task, if you want to check, maybe it will trigger some good ideas, not sure what to do now [15:37:13] beside testing that uefiboot next [15:37:17] won't hurt [15:38:21] I gave it a quick read this morning, but I'm going to read it again and try to do some testing, Emperor said ms-be2088 was as of yet untouched? [15:38:46] exatly yes, the last one [15:39:35] hopefully it will be the 🪿 that lays the golden egg [15:41:47] "Luca, I havent' got any issue, it is clearly the operator the problem" [15:41:54] pretty sure this will be the final root cause [15:42:22] ha! I wish it was just an operator issue, always easier to blame the user! [15:42:42] we can always blame Riccardo as catch-all [15:42:48] :) [16:03:28] the UefiBootNext options returns me 400 [16:08:14] ok at this point I declare defeat, my brain is fried :D [16:11:39] sorry :( [16:15:47] it is fine, we are changing a big thing, it makes sense that we are hitting some bug/"feature" sigh [16:25:42] hope your head doesn't hurt too much elukey [16:25:54] even reading scrollback is enough to make my own brain fry :P [16:26:02] topranks: there's not much in it don't worry :D [16:26:08] haha [16:33:13] jhathaway: so https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1088590 seems to work [16:33:25] (I lied, the nerd in me didn't stop) [16:34:01] that would also be kinda nice, we force Hdd and we don't allow PXE or other things [16:34:20] until we reimage again, and in that case we set the Once etc.. [16:34:52] I am testing it on sretest2001, but I am not sure if the double d-i is skipped because it is not the first time or not :D [16:35:07] the EFI shell set as first boot option is definitely skipped [16:35:10] interesting [16:36:05] and now if I check BIOS via redfish [16:36:06] 'BootModeSelect': 'UEFI', [16:36:06] 'BootOption_1': 'UEFI Hard Disk:debian', [16:37:09] very nice [16:38:25] I think that is worth trying on ms-be2088 [16:39:20] you can test-cookbook it if you want [16:39:48] one thing to remember - between provision and reimage you'll need to set all the disks to jbod via mgmt console [16:40:19] ok, how do you get in the setup screen on the supermicros? [16:42:38] jhathaway: sorry tooke me a while to find the guide that I followed earlier on https://phabricator.wikimedia.org/T371400#10279452 [16:42:55] basically DEL to enter setup and then you can set the Jbods [16:43:06] save + reset and then you are free to reimage [16:43:56] thanks [18:40:51] 10Mail, 06Infrastructure-Foundations, 06Trust-and-Safety: Emails from wikimediats.zendesk.com fails DMARC policy - https://phabricator.wikimedia.org/T378285#10305076 (10JAbrams) Hi everyone, thank you for raising this issue. We have multiple addresses in our Zendesk instances using the subdomain wikimediats.... [18:50:52] 10Mail, 06Infrastructure-Foundations, 06Trust-and-Safety: Emails from wikimediats.zendesk.com fails DMARC policy - https://phabricator.wikimedia.org/T378285#10305107 (10revi) Just to be clear, using `wikimediats.zendesk.com` isn't a problem by itself; using `wikimedia.org` in From: header without proper/inad...