[08:22:09] hi! looks like wikibugs has stopped sending messages to IRC for the last six hours or so. I imagine the irc container has some issue. [08:22:09] It is on toolforge, but I don't have access to the tool :/ [08:22:24] https://www.mediawiki.org/wiki/Wikibugs has some doc about the different containers/services [08:27:18] 👀 [08:29:45] hmm, it seems the process exited, but it's still running somehow [08:30:11] I think it might have some async tasks that stopped, but the main process did not die with it [08:33:29] hashar: I restarted wikibugs and a couple of dependent services, things seem not to be crashing now [08:33:50] dcaro: awesome, thank you !:) [08:40:26] is there a ticket to track last week cloudnet2005-dev failures? seems to be struggling again today [08:42:06] T393366 is the general ticket for that kernel bug [08:42:06] T393366: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366 [08:50:47] thanks [08:51:40] i'm rebooting cloudnet2005-dev to a working one [08:52:21] ok, I was confused, I thought this was the operation performed last week, and therefore something else was going on [10:46:11] taavi: please review https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/22 [12:17:26] also, what do you all think about this one? [12:17:27] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/236 [12:17:40] is only missing the gitlab CICD variables with the credentials [12:19:06] the patch moves tofu-infra to gitlab CICD workflow, instead of the current cookbook based approach [12:23:03] don't forget to update https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/wmcs-cookbooks/+/refs/heads/main/cookbooks/wmcs/vps/create_project.py#270 too [12:23:29] yes, the cookbook will need some tweaks [12:24:44] I you all don't think it is a horrible idea, I will put the creds in the gitlab vars, and see the pipelines go green [12:25:37] cc taavi [12:26:09] does it limit the ways we can deploy stuff? (ex. if gitlab is down/runners don't work) [12:27:23] well, the automation will be fully dependant on gitlab. That, however, doesn't prevent from running tofu elsewhere in a less automated way [12:32:10] hmm... should https://gitlab.wikimedia.org/groups/repos/cloud/metricsinfra be under https://gitlab.wikimedia.org/groups/repos/cloud/cloud-vps ? (being part of cloud-vps offering) [12:35:23] I have thought the same in the past [12:39:22] arturo: what credentials would that have? [12:39:45] taavi: some robot credentials, maybe in the same account as the tofu in the cloudcontrols [12:40:17] i don't have any concrete thing to back this on, but gitlab's history of security issues makes me a bit uncomfortable with it having full access to absolutely everything we run [12:57:56] taavi: and/or moritzm, I'm doing an upgrade tomorrow which will involve lots of host reboots. Should I doublecheck that we aren't going to boot into the cursed kernel from T393366 on any of those, or is that definitely solved everywhere? [12:57:56] T393366: Regression in RAID10 software RAID with 6.1.135 - https://phabricator.wikimedia.org/T393366 [13:00:22] the only affected hosts in prod are the ones listed in https://phabricator.wikimedia.org/T393366 and since this specifically affects RAID only it won't impact any VMs either [13:01:20] and regardles, over the weekend Bookworm 12.11 was released which includes 6.1.1357 [13:01:22] and regardles, over the weekend Bookworm 12.11 was released which includes 6.1.137 [13:01:25] Ok, I'll definitely be rebooting either 4 or 7 of the hosts in that list... [13:01:35] But those will just go to 137 now when I reboot won't they? [13:02:42] yeah, but make sure to upgrade linux-image-amd64 first to that 6.1.137 gets pulled in [13:03:13] this isn't rolled out fleetwide, since we're waiting for 6.1.139 which also fixes the latest round of Intel side channel leaks [13:03:33] 6.1.139 will likely be out on Wednesday [13:04:48] hmmm so if I postpone my upgrade to Thursday it may save us some additional reboots [13:09:43] definitely, yes! [13:09:58] unless it's something needed to unbreak things, I'd recommend to wait [13:11:00] * andrewbogott reschedules until the 28th [13:14:53] thx moritzm [13:15:15] sgtm! [13:40:02] I spotted this interesting edit to the toolforge quickstart: https://wikitech.wikimedia.org/w/index.php?title=Help:Toolforge/Quickstart&diff=next&oldid=2291571 [13:40:42] indeed the button is not there.. but what is then the recommended workflow if you already have both a SUL account and a dev account? [13:43:06] dhinus: i suggest you click on the "Newer edit →" link on that page [13:43:15] oops :) [13:45:01] I didn't get emails about your edits, for some reason, only about the one I linked above [13:47:42] not sure since i have those emails turned off entirely, but i wonder if it's only sending an email for the first unseen change to any given page [13:48:12] yeah that's possible, I think I did receive multiple ones in the past, but maybe only if I click on the first email first... [14:15:53] topranks: This is probably not the first or last time that I'll ask you this... should I know how to set up the cloud_private addresses for these new cloudvirts or is that something you're happy to do/automate someday? T394671 [14:15:54] T394671: Service implementation for Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T394671 [14:17:27] andrewbogott: yeah I'm happy to do it but also happy to run you through it if you want (it's pretty easy) [14:17:36] I'm in a meeting now but can look afterwards [14:17:53] ok -- you can talk me through it if we happen to have overlapping non-meeting time today [14:34:07] andrewbogott: ok I am done now so if you get a window in the next few hours just ping me [15:42:21] andrewbogott: when you have a second: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1147801/ [15:44:12] that looks a bit dangerous :( [15:44:45] I agree [15:44:49] I'll test it in codfw1dev first [15:46:41] actually, I meant, the bug you are fixing looks dangerous. But also changing it [16:04:44] hello! We're receiving some alerts on -traffic about cloudvirtXXXX instances [16:05:01] like `FIRING: [7x] PuppetZeroResources: Puppet has failed generate resources on cloudvirt1069:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources` [16:08:13] fabfur I think that's just new hardware coming online -- once the netbox dns caches expire everything should be good. [16:08:15] Sorry for the noise. [16:08:26] andrewbogott: no prob, thanks [17:48:29] I'm 99% sure that those quarry alerts are unrelated to those cloudvirt alerts (which are just growing pains for new hardware) [17:48:41] is anyone looking at Quarry or am I alone by this time of day? [18:02:53] * dhinus paged cloudvirt1073/ensure kvm processes are running [18:03:55] andrewbogott: the page comes from Nagios, so we need to downtime there (if it's expected) [18:04:18] it's expected but also over [18:05:04] ack [18:05:35] no idea about quarry, but I saw some quarry alerts during the weekend [18:06:47] sorry, I was keeping on top of those alerts but then needed to eat lunch and they got away from me [22:50:18] well that involved about 30x as many alert emails as it needed to [23:45:16] Striker just had some blips talking to LDAP. I was logged in, got error pages, dropped cookies, could not auth for a few minutes. Things seem to be working again, but I wonder if the problems from the Toolforge bastions are becoming more widespread. [23:46:03] It feels like it has been a number of years since we were fighting to keep ldap working every day. Let's hope it doesn't get back to that ugly state.