[06:26:50] 10Mail, 06Infrastructure-Foundations, 06Trust-and-Safety: Emails from wikimediats.zendesk.com fails DMARC policy - https://phabricator.wikimedia.org/T378285#10298424 (10revi) [06:38:43] 10Mail, 06Infrastructure-Foundations, 06Trust-and-Safety: Emails from wikimediats.zendesk.com fails DMARC policy - https://phabricator.wikimedia.org/T378285#10298438 (10revi) Updated description to state DKIM failures, too. "From is not in the signing domain", because, as stated by the [DKIM Verifier](https... [07:23:35] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10298483 (10ayounsi) >>! In T364092#10296958, @akosiaris wrote: >> Upgrades should follow the standard process > > The standard process docs are outdated I fear. > >> Depool site... [09:17:45] hello folks! [09:18:12] UEFI reimages are working but I noticed today that it may have gone into d-i two times [09:39:03] reimage for ms-be2081 is stuck in the "generate puppet cert step" and via install console I see [09:39:06] root@ms-be2081:~# puppet agent -tv [09:39:08] Exiting; no certificate found and waitforcert is disabled [09:39:16] that smells like "d-i" ran two times [09:39:35] elukey: what do you mean by ran two times? [09:40:12] rebooted twice into d-i but the second time didn't had the puppet version injected [09:40:16] like it runs, reboot at the end of the first run, and then instead of booting to the OS HTTP boot again? [09:40:18] that's how I read it [09:40:40] XioNoX: I've seen it happening yesterday while checking the mgmt console - d-i finishes the first time, then upon reboot pxe kicks in another time (IIRC the error was media failure/notfound for the disk part) [09:40:51] ok [09:40:59] why would it work the second time? [09:41:24] not sure, also yesterday I reimaged another time ms-be2083 and I didn't see it happening [09:43:06] my suggestion is to insert some code that checks that PXE override is gone once in d-i like we do for IPMI [09:43:15] so probably something not going smooth with "force_http_boot_once()" [09:46:15] or instead of only instructing it to boot only once using HTTPboot, not trust it and systematically force it to boot from disk [09:46:32] (after/during the OS install) [09:47:09] yeah maybe we can force BootSourceOverrideEnabled to disabled right after we see that we are in d-i [09:47:52] I can file a patch with some temp code on the reimage cookbook, if it works we can integrate it better in spicerack [09:48:03] yeah +1 [10:10:13] I am test-cookbooking https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1088252 [10:16:45] it seems to have worked, but I'll test it more extensively with other nodes [10:16:56] nice! [10:37:00] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: BFD won't esablish between QFX in VRF and host from IPv6 link-local - https://phabricator.wikimedia.org/T374379#10298883 (10ayounsi) If it's a bug on the switch it's probably worth opening a JTAC ticket. Even if it's not fixed on time for u... [10:49:01] thinking out loud, maybe it's the type of reboot issued by D-I that doesn't "qualify" to clear the "Once" [10:58:53] no idea, but if it happens with ipmi too maybe there is something d-i specific indeed [10:59:11] and/or the implementation of the Once is really terrible in both cases :D [11:01:52] in the meantime, ok if I merge https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1088252 ? [11:11:46] elukey: yeah, the logic lgtm and that made me discover this online doc: https://www.supermicro.com/manuals/other/redfish-ref-guide-html/Content/general-content/bios-configuration.htm [11:18:23] XioNoX: thanks for the link! [11:19:13] I assume "BIOS" still is the term for all the 'bios' setting regardless right? Like "BIOS mode boot" is called that but you still use the "bios" to enable UEFI and a host of other things? [11:20:46] yeah [11:21:13] it always BIOS, and then BIOS boots into UEFI boot, or into legacy BIOS [12:30:06] to be clear, the IPMI code was added because every once in a while it was not resetting the override-once [12:30:11] not because it was not working always [12:30:19] not sure if that helps or not :) [12:42:12] I'm trying to run sre.network.configure-switches, but I'm getting an error that the config database is locked by homer terminal (pid 27581) since half an hour [12:42:31] is that some hung process or is actually something really long-running here by one of you? [12:42:46] cathal [12:44:10] what does that pid refer to? it's not a pid on the cumin host [12:44:29] that's on the switch [12:44:33] the lock is on teh switch [12:44:50] I see a "homer lsw*codfw* commit add lvs config" run on cumin1002 [12:44:54] pid 670104 [12:46:47] ok, thanks [12:47:45] it might be there asking the operator to say yes ;) [12:49:04] ITS should send us one of these: https://giphy.com/gifs/HQGzdiNhg52oM [12:49:16] ahahahaha [13:05:38] moritzm: sorry, this happens sometimes.... [13:05:44] what host/switch are you running it for? [13:09:58] the pid is the login session on the actual router/switch, as homer has logged on and locked the config database with provisional modifications, and the device is blocking any other user editing the config until the session with the lock either commits or discards it's changes. [13:10:30] unfortunately my little birdy shifted one cm to the left and was hitting 't', messed everything up [13:15:14] yeah I also had some changes pending to type "yes" and blocking stuff on asw2-c-eqiad [13:21:06] there is discussion in T250415, though I wonder if the "batch" function should have its own task separate to that for parallelization? [13:21:06] T250415: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415 [13:35:07] topranks: thanks! I was able to run it now [13:35:18] this was for vlan changes for new Ganeti nodes [13:36:40] moritzm: ah ok, cool! [13:36:53] I killed the cumin process I was running, must have cleared it [13:37:25] sometimes the session on the switch itself gets "stuck", and you need to log onto the switch and issue a command [13:38:07] "request system logout pid " [13:46:40] noted, thanjs [14:30:35] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10299735 (10Papaul) [14:34:43] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10299748 (10Papaul) @ayounsi thanks for the information [14:34:56] 10SRE-tools, 06Infrastructure-Foundations: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258 (10joanna_borun) 03NEW [14:39:36] 10SRE-tools, 06Infrastructure-Foundations: Outdated cookbooks cleanup - https://phabricator.wikimedia.org/T379259 (10joanna_borun) 03NEW [15:17:08] XioNoX: no luck, I still see the host doing d-i twice :( [15:17:58] maybe more settings are needed [16:50:07] 10SRE-tools, 06Infrastructure-Foundations: Outdated cookbooks cleanup - https://phabricator.wikimedia.org/T379259#10300550 (10Volans) p:05Triage→03Medium a:03Volans [16:50:22] 10SRE-tools, 06Infrastructure-Foundations: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258#10300548 (10Volans) p:05Triage→03Medium a:03Volans [16:56:16] so the mistery gets weirder [16:56:32] the first time that I kick of the reimage, I see the double d-i [16:56:49] but then a subsequent reimage seems to work [16:57:18] define first time :D [16:57:57] still trying to wrap my head around it, but first time of me provisioning + running reimage on a node [17:01:18] try to see if the number of reboots (either cold or warm) since the change bios->uefi has a role [17:03:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:18] could be the culprit, maybe provision requires an extra reboot (for some reason) that messes up the first reimages [17:05:21] should this alert go to serviceops-collab instead of us? ^^^ [17:08:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:12:06] I'll do more tests tomorrow [17:12:11] time to log off o/ [17:28:34] elukey: ok :( Maybe we can try to dump the Redfish config before the first reboot, during the first d-i, during the 2nd d-i, and see how they differ [17:30:48] elukey: happy to help debug as well, if we have a host to test with [17:33:57] for what I understand from -dcops ms-be2082 is the guinea pig [17:34:20] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10300817 (10cmooney) @Jclark-ctr as discussed I believe we should have a load of copper SFPs from T369557.... [17:35:31] XioNoX: ok, thanks [18:21:35] 10netops, 06DC-Ops, 10fundraising-tech-ops, 06Infrastructure-Foundations, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10301105 (10cmooney) [20:59:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed