[04:47:41] Already posted in the xtools channel but xtools has been 503ing for about 2 hours according to grafana https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=xtools&var-instance=All&from=1712184446132&to=1712206046133&theme=dark&viewPanel=364 [13:47:33] !log deltaquad@tools-sgebastion-10 tools.stewardbots ./stewardbots/StewardBot/manage.sh restart # RC reader not reading RC [13:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL [14:33:28] wikibugs might need a kick, not seen much gerrit activity [14:33:48] * TheresNoTime can do it in a bit if someone else doesn't get to it [14:38:19] !log samtar@tools-sgebastion-10 tools.wikibugs Restart all the jobs! [14:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [15:00:33] TheresNoTime: "restart all" is almost never the right thing to do for wikibugs these days. I will try to make some time to update the wiki docs. [15:00:59] bd808: ah, darn.. sorry! [15:02:34] (totally on me, I didn't even bother checking the wiki docs before doing that — will bear this in mind in the future) [15:02:37] There are now 5 different processes collaborating to make wikibugs work and there is no way in current Toolforge automation to make these things start in a specified order. [15:03:37] to my knowledge, the thing that still manages to break is the `irc` task's connection to Redis [15:03:57] I hope to remove Redis from the loop sometime tomorrow [15:04:49] TheresNoTime: also no harm, no foul about the "restart all" muscle memory. I just noticed the !log and wanted to make you more aware [15:08:05] ^^ [16:45:08] aside, there seems to be a recentish bug in wikibugs with the colors intermittently getting messed up, or is it intentional logic I don't understand? [16:47:33] The colours mean something [16:47:38] What does messed up mean [16:48:46] I would guess Nikerabbit means T360353 [16:48:47] T360353: Hashar does not like grey foreground color for distinguishing closed status events - https://phabricator.wikimedia.org/T360353 [16:50:55] yep that's it. would never have guessed in my own [18:30:37] I see some puppet issues that were being discussed yesterday, but I'm still seeing it fail for new instances. [18:30:58] I see some puppet issues that were being discussed yesterday got fixed, but I'm still seeing it fail for new instances. [18:31:52] !log lucaswerkmeister@tools-sgebastion-10 tools.bridgebot Double IRC messages to other bridges [18:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL [18:37:59] Hi all, I'm an engineer on the Catalyst project and have some questions about Puppet errors I got last night. The email said "[Cloud VPS alert][catalyst] Puppet failure on patchdemo.catalyst.eqiad1.wikimedia.cloud (172.16.5.70)" as well as other projects I made that use Puppet. Should we be concerned about it or can we just ignore this? I only got one email per project. [18:42:17] taavi: Can I also please get some help with https://phabricator.wikimedia.org/T361517, some stories are blocked on this. Thank you!! [19:08:43] birdcup: thanks for checking in. There was a surprise cert expiration yesterday which caused a bunch of failures and warnings (some of them escalated by my failed attempts to renew). Everything should be working properly now but if you /still/ have VMs with broken puppet please let me know. [19:30:50] Thanks andrewbogott <3 [19:34:51] Since I didn't get any more e-mails I think it's safe to say everything's working fine now. Thank you!! [19:36:45] andrewbogott: I'm still having issues for new instances - PuppetAgentNoResources and PuppetAgentStaleLastRun are showing on grafana - I don't think the initial run is working correctly [19:37:21] JJMC89: that's probably the timeout problem folks noticed yesterday [19:37:41] JJMC89: yeah, I need to build a new base image but there's a chicken/egg issue because I need puppet to work to build them. That is likely https://phabricator.wikimedia.org/T361749 [19:38:08] I am working on it but also receiving a new urgent ping every 10 or so minutes today :( [19:42:58] logs have 'cloud-final.service: start operation timed out. Terminating.' and/or 'Failed to start cloud-init.target' so probably [19:53:15] yep [19:58:53] will the instances be recoverable after the issue is fixed or should I delete them? [20:05:43] you should delete them [20:23:50] bd808: that timeout issue turns out to be caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016345 . Unexpected! [20:25:24] andrewbogott: huh. that does seem like spooky action at a distance [20:25:55] yeah! [20:26:18] * bd808 also wonders what that package breaks for ganeti boxen [21:28:30] JJMC89: you should be able to build Bookworm hosts now. It'll be slow but it works. [21:36:48] success \o/ [21:51:20] psssssst bd808 our mutual frenemy has decided to stop working last night :P ..could do with a kick in the ass :P [21:59:00] stemoc: you're going to have to remind me which sad tool that was... :) [22:00:35] lol FlickreviewR_2 :) [22:01:40] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.yifeibot/SAL [22:03:21] !log bd808@tools-sgebastion-10 tools.yifeibot `kubectl delete pod flr-6d74b958d9-4ztff` after reports of FlickreviewR 2 not working on IRC [22:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.yifeibot/SAL [22:03:35] hopefully that helps. [22:05:00] I will leave this ping for zhuyifei1999_ again too, just in case they notice and have some time to do something fancier than a reboot. [22:24:54] or even if he can't fix the issue, maybe allow admins on commons with the access to atleast reboot it somehow.. [22:35:58] stemoc, you would have to talk to one of the existing maintainers for that https://toolsadmin.wikimedia.org/tools/id/yifeibot [22:38:31] oooh. multichill might be a reasonable person to poke stemoc. [22:44:33] hm, is there a delay for syncing of ssh keys or such? i can get into bastion.wmcloud.org but i'm getting my key rejected trying to go to media-streaming.media-streaming.eqiad1.wikimedia.cloud from there, and i'm not sure which of several moving pieces i've broken ;) [22:45:45] i added keys on wikitech and it seems to use that one for getting me into the bastion [22:45:56] so it may be that the vm just isn't fully set up yet or needs to re-run its sync of pubkeys [22:48:19] bvibber: Puppet needs to run on the instance, and we are actively having problems with that for new instances. [22:48:34] * bd808 peeks into media-streaming.media-streaming.eqiad1.wikimedia.cloud [22:49:37] !log media-streaming Forced a puppet run on media-streaming.media-streaming.eqiad1.wikimedia.cloud [22:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Media-streaming/SAL [22:50:59] bvibber: I'm watching auth.log on the media-streaming.media-streaming.eqiad1.wikimedia.cloud instance if you want to try it again [22:51:21] I don't see your username in the existing log there... [23:06:17] whee [23:06:37] yeah still nothing which may mean i'm doing something dumbass [23:08:21] wait did i gaslight myself into thinking this worked to bastion before? was i wrong hehe [23:09:11] ok i think my ssh config is broken [23:09:57] ba-bam! [23:10:04] ok awesome, that did it [23:10:13] sweet [23:10:19] bd808: thanks for the puppet kick! between that and fixing the config for the bastion i'm good [23:10:27] now into my vm :D [23:10:41] bd808: just want to confirm that new vps hosts are still seeing puppet issues. is that correct? [23:10:59] i'm ok continuing to wait for a fix. [23:11:57] dwisehaupt: I see T361749 is closed. That may mean things work as expected now. [23:11:57] T361749: cloud-init timeout too short on Bookworm - https://phabricator.wikimedia.org/T361749 [23:12:37] ok. i'll try a fresh build. thanks! [23:13:01] AntiComposite, yeah all maintainers are long gone [23:14:17] Multichill edited yesterday [23:14:37] stemoc: multichill is at least around the movement still. I see him now and then in the #wikimedia-hackathon channel (usually on the Telegram side of the bridge) [23:14:41] Stang edited last week [23:15:07] Steinsplitter edited a month ago [23:15:25] WhitePhosphorus edited two days ago [23:15:46] and Zhuyifei1999 still sits on the committee that handles abandoned tools [23:16:35] yeah and all have been directly or via the Flickreviewr talk page have been pinged the last 3 months but neither have responded.. [23:19:20] what makes commons tools so cursed that everyone abandons them? [23:19:48] (for context, https://wikitech.wikimedia.org/wiki/Help:Toolforge/Abandoned_tool_policy is the policy on forcibly taking over abandoned tools) [23:20:00] bd808, commons [23:20:23] (I say this as a Commons admin and bot operator) [23:20:24] AntiComposite: the membership of that committee hasn't changed ever, so take that sign of life with many grains of salt [23:20:58] also would you like to be on the committee AntiComposite? :) [23:21:07] i'll ping them all on the relevant commons thread.. [23:21:58] bd808, given that I'm taking a principled opposition to signing the Volunteer NDA, no [23:22:08] okey doke [23:22:31] I assume I don't want to know what is horrible about the current NDA [23:23:21] ah. i think i see what my issue was. i deleted and recreated a failed vps build and the puppet server still had the old cert associated. probably just need to wait longer for the server to delete the setup. [23:24:32] not interested in doxxing myself to the foundation unless they're going to pay me [23:28:39] dwisehaupt: reusing names can trip over a number of potential leaks. We generally recommend that all instances get a unique name across all time. [23:29:21] monotonic increasing sequences are one way to handle that (mything-01, mything-02, ...) [23:29:25] ah. ok. cool. yeah, i've been testing clean builds of the community-crm host and haven't run across that issue yet. [23:29:32] i can easily adjust. [23:29:34] thanks! [23:30:34] when things work it works, but when things break it can be a pain to manually clean up bascially [23:31:29] yeah. i've got a copy/paste bit i use on our local puppetmaster when things go that way. :) [23:45:53] hrm, wonder if i'm doing something wrong with this web proxy [23:47:09] ugh also apache doesn't like me, but the proxy isn't even passing through to my apache error page :D [23:48:43] eg https://media-streaming.wmcloud.org/index.html