[07:18:55] 10Mail, 10DNS, 10Infrastructure-Foundations, 10SRE, and 2 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10jcrespo) [07:21:44] 10Mail, 10DNS, 10Infrastructure-Foundations, 10SRE, and 2 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10jcrespo) I created this when I saw someone mentioning it on discord. Ping @Vgutierrez @BBlack (I personally have no thought, I didn't know this was a th... [08:19:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574 (10ayounsi) 05Open→03Resolved a:03ayounsi Awesome, thanks a lot @ssingh I slightly cleaned up the doc (added a mention of the bird2 upgrade) And updated the dashboard at https://g... [08:30:03] 10Mail, 10DNS, 10Fundraising-Backlog, 10Infrastructure-Foundations, and 3 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10greg) The email team in fundraising has interest in this topic as well. [08:35:34] 10Mail, 10DNS, 10Fundraising-Backlog, 10Infrastructure-Foundations, and 3 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10jcrespo) Probably related: T211404 T167337 [09:32:46] FYI I' about to reboot sretest1001 if that's of for you [09:36:42] sure thing [09:44:22] moritzm, slyngs, jbond: I've found a quite important regression in the move of the puppet cron to systemd timer. It doesn't run at boot time! [09:44:33] * volans was just double-verifying it with the above reboot [09:48:39] I just need to check, but maybe we can simply add an extra "schedule, so an "OnBootSec" something, but I need to check [09:50:13] yes, that should fix it but I'm not sure if it will affect the splayed 30m sschedule [09:50:40] for puppet specifically we could add a second timer specifically for the on boot run [09:51:11] puppet is I think the only timer where this is relevant (since the reimage scripts implicitly relied on the cron behaviour) [09:52:55] We should be able to have both OnCalendar and OnBootSec, but I'll check... once I'm done fighting the Puppet linter [09:53:35] ack, thanks, as this is kinda breaking the logic on many automation stuff, that expect and look for a successful puppet run after a reboot [09:54:06] so adding quite some delays to any reboot/roll-reboot/reimage cookbooks [09:54:58] slyngs: couple of things to check: [09:55:31] 1) if we can use OnBootSec=0 (or 1) because all the deps are already ensured by the timer/unit requirements or if we need to add any [09:56:02] 2) what happens when reboot and the splayed OnCalendar are around the same time and the underlying service is already running [10:19:49] tbh i thoght this ws allready part of the timer possibley needs some flag to enable it. [10:20:49] im on vacation today though, i could look at it this evening but im out currently [10:21:48] jbond: don't worry, sorry for the ping I forgot you were out. I'm sure we can take care of it. Enjoy the days off [10:22:00] ack cheers [10:25:35] volans: We're using the same run-puppet-agent script as before. I believe that ensures that we don't run two instances at once, if that was the concern [10:28:11] slyngs: we use /usr/local/sbin/puppet-run that is a different script [10:28:27] it does check the puppet lock yes [10:29:16] Yes [10:29:22] but it might fail because of 2 concurrent apt-get update [10:29:22] Sorry, yes /usr/local/sbin/puppet-run [10:29:35] first script starts, checks the lock, no lock, starts apt-get update [10:29:55] second script starts, checks the lock, no lock yet, try to start apt-get update and fails because APT lock [10:30:48] but I hope that a single timer with both OnBootSec and OnCalendar does the right thing [10:31:02] I would at least expect that from systemd... but then it's systemd... so anything can happen ;) [10:31:26] it would be much better if computers could only do 1 thing at a time [10:31:30] :) [10:31:39] lol [10:31:44] That would solve a lot of problems [10:35:24] Mentally I want systemd timers to work in a way that it doesn't trigger a service, if it's already running... I'll need to test [10:42:47] ack, thx [10:42:59] XioNoX: did you change anything wrt netbox dumps? it just alerted [10:44:02] https://phabricator.wikimedia.org/P30656 [10:54:55] volans: nop [11:20:37] Two trucks from my ISP just pulled up in front of the house... the kind they use to fix fibers. I'm a little concerned [11:32:57] volans: Okay, so: a timer will start a service, starting a service twice does nothing, the service is already running and doesn't need to be startet. So it's "completely" safe to have a service trigger at boot, even if timer will trigger it again right after [11:51:08] great! let's add the onbootsec then, I'm unsure about the value, if all deps are sets correctly the timer should already be activated later on in the boot process that all is set. I'm not sure what's the best way to replicate (or improve) the previous behaviour [11:54:35] I think we're better off with OnStartupSec, though [11:55:21] for VMs and most servers it doesn't make much of a difference but could for some host with lots of disks (think dbstore/labstore etc) [11:55:25] or swift [12:03:03] isn't the same thing for root? [12:03:12] at least from https://www.freedesktop.org/software/systemd/man/systemd.timer.html#Options [12:06:39] the difference is all the kernel init before systemd starts, I guess an average 10s on baremetal [12:07:21] but with one minute we have sufficient head room anyway, OnBootSec should also work [12:10:02] it's safe anyway, since the unit started by the systemd timers have "WantedBy=multi-user.target", so the sytemd is guaranteed to be up anyway [12:11:27] whatever works and you feel is the best option works for me [12:11:32] :) [12:18:45] moritzm: Reading the man page, I think you're right that OnStartupSec might be better. It's a really minor different for most host [12:19:22] I'll just update the patch [12:19:29] ack [12:57:30] volans: https://gerrit.wikimedia.org/r/c/operations/puppet/+/809943 has been merged [13:15:30] slyngs: great, did it work fine rebooting a real host after the patch? like one of the sretest for example [13:16:03] volans: I haven't gotten that far :-) [13:16:27] moritzm suggested that we re-image sretest1001 to test [13:17:10] a reboot-single should probably be enough [13:17:45] it does poll/wait for a successful puppet run on the host [13:18:04] and should be easy to see if it's polling for long or not, but check first at which time the oncalendar is for that host [13:18:25] We should just be able to check the timer and see if that has triggered [13:18:26] to avoid to call it a success when was the oncalendar run instead of the OnStartupSec one ;) [13:19:09] I've actually never run cookbook, so I'm just looking for the documentation [13:22:24] volans: Does this look about right: sudo -i cookbook sre.hosts.reboot-single -t 'Test Puppet timer' sretest1001.eqiad.wmnet [13:27:23] no need for -i in sudo, let me check the specific options of this cookbook, I don't know all of them by heart :D [13:28:08] -r 'Test Puppet timer' (not -t, that's for the task ID, if you have one, but should be optional) [13:28:40] oh, right. I'll try a reboot, now is perfect the timer just ran [13:28:50] the rest is ok [13:28:55] perfect timing :D [13:29:52] rebooting [13:36:22] volans: It works perfectly [13:36:49] great, thanks for fixing it! [13:37:06] No problem... I did kinda break it [13:39:02] :) [13:41:14] I've been here two month now, so I can't really blame Moritz for my mistakes and oversights anymore [13:47:33] slyngs: I've been there for 5 years and I'll blame moritzm for my mistakes any time I feel like it [13:47:45] ;) [13:48:14] :-) [13:51:24] happy to take all the blame :-) [13:54:52] XioNoX: interesting, I was about to suggest to slyngs that after 2 months he can now start to blame you for everything! [13:55:24] I think you're onto something. We should have an oncall-like rotation for whom to blame [14:00:54] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE: Enable webauthn in CAS to replace U2F - https://phabricator.wikimedia.org/T311236 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:11:31] Ganeti 3.0.2 has been accepted as a backport into the next Bullseye point release: https://release.debian.org/proposed-updates/stable.html (which is great since that allows us to obsolete our internal 3.0.1-2+deb11u0 bullseye-wikimedia package) [14:16:02] nice! [15:50:33] 10Puppet, 10puppet-compiler, 10Infrastructure-Foundations: pcc-uploader failing on tools-puppetmaster-02 - https://phabricator.wikimedia.org/T311742 (10taavi) [16:46:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10wiki_willy) [17:14:29] FYI the above issue with the netbox dumps was the new virtualization cluster groups's custom field that links the cluster group to their SVC IP address that is not supported by the current pynetbox that we have in prod that is a bit old. [17:15:01] for now I've fixed the issue not dumping that table (not a big deal, also super small). Arz.hel has already a task to upgrade pynetbox anyway. [17:25:26] thanks! [18:01:13] 10Mail, 10DNS, 10Fundraising-Backlog, 10Infrastructure-Foundations, and 3 others: Consider if to support BIMI for wiki mail - https://phabricator.wikimedia.org/T311685 (10ssingh) p:05Triage→03Medium