[08:22:09] <hashar>	 hi! looks like wikibugs has stopped sending messages to IRC for the last six hours or so.  I imagine the irc container has some issue.
[08:22:09] <hashar>	 It is on toolforge, but I don't have access to the tool :/
[08:22:24] <hashar>	 https://www.mediawiki.org/wiki/Wikibugs has some doc about the different containers/services
[08:27:18] <dcaro>	 👀
[08:29:45] <dcaro>	 hmm, it seems the process exited, but it's still running somehow
[08:30:11] <dcaro>	 I think it might have some async tasks that stopped, but the main process did not die with it
[08:33:29] <dcaro>	 hashar: I restarted wikibugs and a couple of dependent services, things seem not to be crashing now
[08:33:50] <hashar>	 dcaro: awesome, thank you !:)
[08:40:26] <arturo>	 is there a ticket to track last week cloudnet2005-dev failures? seems to be struggling again today
[08:42:06] <taavi>	 T393366 is the general ticket for that kernel bug
[08:42:06] <stashbot>	 T393366: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366
[08:50:47] <arturo>	 thanks
[08:51:40] <taavi>	 i'm rebooting cloudnet2005-dev to a working one
[08:52:21] <arturo>	 ok, I was confused, I thought this was the operation performed last week, and therefore something else was going on
[10:46:11] <arturo>	 taavi: please review https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/22
[12:17:26] <arturo>	 also, what do you all think about this one?
[12:17:27] <arturo>	 https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/236
[12:17:40] <arturo>	 is only missing the gitlab CICD variables with the credentials
[12:19:06] <arturo>	 the patch moves tofu-infra to gitlab CICD workflow, instead of the current cookbook based approach
[12:23:03] <dcaro>	 don't forget to update https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/wmcs-cookbooks/+/refs/heads/main/cookbooks/wmcs/vps/create_project.py#270 too
[12:23:29] <arturo>	 yes, the cookbook will need some tweaks
[12:24:44] <arturo>	 I you all don't think it is a horrible idea, I will put the creds in the gitlab vars, and see the pipelines go green
[12:25:37] <arturo>	 cc taavi 
[12:26:09] <dcaro>	 does it limit the ways we can deploy stuff? (ex. if gitlab is down/runners don't work)
[12:27:23] <arturo>	 well, the automation will be fully dependant on gitlab. That, however, doesn't prevent from running tofu elsewhere in a less automated way
[12:32:10] <dcaro>	 hmm... should https://gitlab.wikimedia.org/groups/repos/cloud/metricsinfra be under https://gitlab.wikimedia.org/groups/repos/cloud/cloud-vps ? (being part of cloud-vps offering)
[12:35:23] <arturo>	 I have thought the same in the past
[12:39:22] <taavi>	 arturo: what credentials would that have?
[12:39:45] <arturo>	 taavi: some robot credentials, maybe in the same account as the tofu in the cloudcontrols
[12:40:17] <taavi>	 i don't have any concrete thing to back this on, but gitlab's history of security issues makes me a bit uncomfortable with it having full access to absolutely everything we run
[12:57:56] <andrewbogott>	 taavi: and/or moritzm, I'm doing an upgrade tomorrow which will involve lots of host reboots. Should I doublecheck that we aren't going to boot into the cursed kernel from T393366 on any of those, or is that definitely solved everywhere?
[12:57:56] <stashbot>	 T393366: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366
[13:00:22] <moritzm>	 the only affected hosts in prod are the ones listed in https://phabricator.wikimedia.org/T393366 and since this specifically affects RAID only it won't impact any VMs either
[13:01:20] <moritzm>	 and regardles, over the weekend Bookworm 12.11 was released which includes 6.1.1357
[13:01:22] <moritzm>	 and regardles, over the weekend Bookworm 12.11 was released which includes 6.1.137
[13:01:25] <andrewbogott>	 Ok, I'll definitely be rebooting either 4 or 7 of the hosts in that list...
[13:01:35] <andrewbogott>	 But those will just go to 137 now when I reboot won't they?
[13:02:42] <moritzm>	 yeah, but make sure to upgrade linux-image-amd64 first to that 6.1.137 gets pulled in
[13:03:13] <moritzm>	 this isn't rolled out fleetwide, since we're waiting for 6.1.139 which also fixes the latest round of Intel side channel leaks
[13:03:33] <moritzm>	 6.1.139 will likely be out on Wednesday
[13:04:48] <andrewbogott>	 hmmm so if I postpone my upgrade to Thursday it may save us some additional reboots
[13:09:43] <moritzm>	 definitely, yes!
[13:09:58] <moritzm>	 unless it's something needed to unbreak things, I'd recommend to wait
[13:11:00] * andrewbogott reschedules until the 28th
[13:14:53] <andrewbogott>	 thx moritzm 
[13:15:15] <moritzm>	 sgtm!
[13:40:02] <dhinus>	 I spotted this interesting edit to the toolforge quickstart: https://wikitech.wikimedia.org/w/index.php?title=Help:Toolforge/Quickstart&diff=next&oldid=2291571
[13:40:42] <dhinus>	 indeed the button is not there.. but what is then the recommended workflow if you already have both a SUL account and a dev account?
[13:43:06] <taavi>	 dhinus: i suggest you click on the "Newer edit →" link on that page
[13:43:15] <dhinus>	 oops :)
[13:45:01] <dhinus>	 I didn't get emails about your edits, for some reason, only about the one I linked above
[13:47:42] <taavi>	 not sure since i have those emails turned off entirely, but i wonder if it's only sending an email for the first unseen change to any given page
[13:48:12] <dhinus>	 yeah that's possible, I think I did receive multiple ones in the past, but maybe only if I click on the first email first...
[14:15:53] <andrewbogott>	 topranks: This is probably not the first or last time that I'll ask you this... should I know how to set up the cloud_private addresses for these new cloudvirts or is that something you're happy to do/automate someday?  T394671
[14:15:54] <stashbot>	 T394671: Service implementation for Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T394671
[14:17:27] <topranks>	 andrewbogott: yeah I'm happy to do it but also happy to run you through it if you want (it's pretty easy)
[14:17:36] <topranks>	 I'm in a meeting now but can look afterwards
[14:17:53] <andrewbogott>	 ok -- you can talk me through it if we happen to have overlapping non-meeting time today
[14:34:07] <topranks>	 andrewbogott: ok I am done now so if you get a window in the next few hours just ping me 
[15:42:21] <taavi>	 andrewbogott: when you have a second: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1147801/
[15:44:12] <andrewbogott>	 that looks a bit dangerous :(
[15:44:45] <taavi>	 I agree
[15:44:49] <taavi>	 I'll test it in codfw1dev first
[15:46:41] <andrewbogott>	 actually, I meant, the bug you are fixing looks dangerous. But also changing it
[16:04:44] <fabfur>	 hello! We're receiving some alerts on -traffic about cloudvirtXXXX instances 
[16:05:01] <fabfur>	 like `FIRING: [7x] PuppetZeroResources: Puppet has failed generate resources on cloudvirt1069:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources`
[16:08:13] <andrewbogott>	 fabfur I think that's just new hardware coming online -- once the netbox dns caches expire everything should be good.
[16:08:15] <andrewbogott>	 Sorry for the noise.
[16:08:26] <fabfur>	 andrewbogott: no prob, thanks
[17:48:29] <andrewbogott>	 I'm 99% sure that those quarry alerts are unrelated to those cloudvirt alerts (which are just growing pains for new hardware)
[17:48:41] <andrewbogott>	 is anyone looking at Quarry or am I alone by this time of day?
[18:02:53] * dhinus paged cloudvirt1073/ensure kvm processes are running
[18:03:55] <dhinus>	 andrewbogott: the page comes from Nagios, so we need to downtime there (if it's expected)
[18:04:18] <andrewbogott>	 it's expected but also over
[18:05:04] <dhinus>	 ack
[18:05:35] <dhinus>	 no idea about quarry, but I saw some quarry alerts during the weekend 
[18:06:47] <andrewbogott>	 sorry, I was keeping on top of those alerts but then needed to eat lunch and they got away from me
[22:50:18] <andrewbogott>	 well that involved about 30x as many alert emails as it needed to
[23:45:16] <bd808>	 Striker just had some blips talking to LDAP. I was logged in, got error pages, dropped cookies, could not auth for a few minutes. Things seem to be working again, but I wonder if the problems from the Toolforge bastions are becoming more widespread.
[23:46:03] <bd808>	 It feels like it has been a number of years since we were fighting to keep ldap working every day. Let's hope it doesn't get back to that ugly state.