[01:51:00] [wikitech-l] Beta Cluster now lives on beta.wmcloud.org. https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/YDABPV75LADRQCXMJAFWUP256N4EQ25B/ [01:52:05] Bah, I left an incomplete copy-pasta paragraph in there. Too late now :) [04:15:54] !log soda@tools-bastion-13 tools.yapping-sodium soda built and uploaded a new version [04:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.yapping-sodium/SAL [08:27:49] anything up with toolforge at the moment? I get a generic “Wikimedia Toolforge Error” (from tools-proxy-9.tools.eqiad1.wikimedia.cloud) when trying to open any tool (e.g. https://versions.toolforge.org/) [08:28:18] (ok, looks like bridgebot still has a working network connection at least ^^) [08:31:28] My tool is also down on toolforge. I am getting generic error. [08:32:46] But, `toolforge webservice status` gives me `running`. Also, my server is running. I curl'd tool, no response even from the tool-bastion. [08:33:22] seeing the same here [08:36:00] possibly related to T399261 (Widespread instances down in project deployment-prep)? though deployment-prep ≠ tools [08:36:01] T399261: Widespread instances down in project deployment-prep - https://phabricator.wikimedia.org/T399261 [08:39:44] I can SSH into tools-proxy-9, the only failed systemd unit is logrotate which judging by the journal has been broken for a long time, probably not related [08:40:09] load seems fine [08:44:08] I think tools-proxy-9 times out trying to reach k8s.tools.eqiad1.wikimedia.cloud in turn [08:44:18] I can SSH into that one too, no high load there either [08:44:38] I deployed a tool this morning - I hope i didn't break it [08:44:41] looks like most tools are not responsive right now? [08:46:03] lucaswerkmeister: maybe a network issue between tools-proxy-9 and k8s? (just guessing for now) [08:47:22] dhinus: no idea, I’m also just guessing / poking around (read-only so far) until someone more knowledgeable shows up ^^ [08:47:43] nothing recent in https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL fwiw [08:48:41] !status many tools not responding [08:54:08] !status ongoing incident: toolforge tools not responding [08:54:40] Hi. Is the Toolforge experience issues? I see 504 errors at my tool https://curator.toolforge.org/ [08:55:17] DaxServer: yes, there is an ongoing incident. we're on it. [08:55:33] Thanks dhinus [08:55:49] hm, but `nc k8s.tools.eqiad1.wikimedia.cloud 30000` from tools-proxy-9, and then speaking manual HTTP in there, works just fine and responds instantly [08:56:07] ok now a curl there also works, which previously timed out [08:56:16] things seem to have recovered yes [08:56:16] ah, it’s fully working for me again 🤷 [08:56:27] I didn't do anything though :D [08:56:33] me neither [08:58:29] !status ok [09:05:21] seems to have recovered a little now [09:07:36] Phab task to track the incident and find the root cause: T399281 [09:07:38] T399281: 2026-07-11 Toolforge tools not responding, proxy issue - https://phabricator.wikimedia.org/T399281 [09:12:15] Time tunnel activated, next station: 2026 (re @wmtelegram_bot: T399281: 2026-07-11 Toolforge tools not responding, proxy issue - https://phabricator.wikimedia.org/T399281) [09:12:26] xd [09:12:47] LOL [09:13:51] * dhinus does not know which year we are in [09:14:42] i think we are in 2025 so i changed the title [09:15:17] if it's 2026 can chg again tho [09:20:47] :D [13:31:31] Hi everyone, someone asked me about the potential issue with storing personal data for my tool. It currently stores relations between Wikimedia CentralAuth ids with Discord user IDs as an authentication mechanism. It is currently stored on Toolforge. However per WMCS ToU it doesn't seem to be allowed: [13:31:31] https://wikitech.wikimedia.org/wiki/Wikitech:Cloud_Services_Terms_of_use#7.2_If [13:31:33] _this_is_a_Toolforge_Project [13:31:34] In this case should I switch to using a Cloud VPS and draft a privacy statement instead? Is it that ToolsDB (even private tables) has privacy implications that should not be used to store personal data? [13:33:53] I might be wrong, but I don't think that toolsdb private tables are different from tables stored in cloudvps instances [13:38:15] I see why you are confused because that wiki page makes a distinction between Toolforge and non-Toolforge projects [13:52:27] lucaswerkmeister: If beta is down /now/ I don't think that's related to the flapping earlier (since nothing is currently broken on an infra level) [13:52:34] can we try just restarting services there? [13:52:36] Or rebooting VMs? [13:56:00] I wouldn’t know which ones to reboot tbh [13:56:29] ah, ok. I will... reboot some things :) [13:56:33] But I suspect this is unrelated. [14:07:10] !log deployment-prep rebooting instance-deployment-cache-text08 [14:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [14:30:48] Is it me, or did login.toolforge.org AND bastion.wmcloud.org just go down on me? [14:31:19] According to the logs it gets to the key auth stage and just hangs and dies so it seems to be talking to the jump hosts. [14:32:03] ⚙️ Available client authentication methods: publickey,password,keyboard-interactive [14:32:03] ⚙️ Authentication that can continue: publickey [14:32:03] 👤 Authenticating using publickey method [14:32:03] 😨 Connection to "bastion.wmcloud.org" closed with error: end of file [14:33:02] same down [14:33:13] looks like a termius user [14:33:26] Yep. :D [14:33:29] Termius is awesome [14:33:46] !status widespread intermittent outage in progress, repairs are underway [14:33:57] well that didn't do anything [14:33:58] Well that answers that. :p [14:34:57] I'm guessing whatever handles the ssh keys across instances is dead. [14:43:06] things are slowing coming back, but are still unstable [14:49:54] My VMs I don't have already open sessions on just die when trying to start the session. It's able to auth though. [14:50:56] ⚙️ Starting SSH session [14:50:56] ❗ Can not open a new direct-tcpip channel: Channel open failure (connect failed) [14:56:25] I see on several hosts `INFO: task jbd2/sdb-8:389 blocked for more than 120 seconds. Not tainted 6.1.0-37-cloud-amd64 #1 Debian 6.1.140-1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.` [14:57:31] !status ongoing incident: Ceph issues T399281 [14:57:32] T399281: 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures - https://phabricator.wikimedia.org/T399281 [15:12:53] Toolforge down? [15:14:12] Cloud VPS too 😭 (re @Yetkin: Toolforge down?) [15:14:28] https://phabricator.wikimedia.org/T399281 [15:17:03] hmm [15:34:39] My tool is up and running now, thank you :) [15:36:11] RIP. IABot is still dead in the water. [15:36:24] Can't even reboot it. It just hangs on start. [15:43:02] things look slightly better now [15:44:09] I added some links to "useful dashboards" to the incident doc, under "Logs & other data" [15:47:50] My tool is up and running as well 😊 [16:25:52] IABot is waking up now. [17:13:27] !status block storage incident concluded, please ping andrewbogott if you find remaining issues [17:57:43] andrewbogott: beta cluster availability is still hit and miss for me at the moment :/ [17:58:37] I see what you mean. [17:58:46] and yet... I can ssh to all the hosts, so it's no longer in complete lockup [18:00:45] ...or at least cumin can [18:01:59] oh yeah, I can if I spell it right [18:02:28] :D [18:02:48] I was able to SSH into the hostname in the error message, at least :P [18:02:55] Request from [redacted] via deployment-cache-text08.deployment-prep.eqiad1.wikimedia.cloud, ATS/9.2.11 [18:02:55] Error: 504, Connection Timed Out at 2025-07-11 17:56:43 GMT [18:04:04] I rebooted the cache server [18:04:13] although it may not have been the problem anyway, probably it's an app server that's down [18:20:55] Can't find anything obviously wrong, going to assume this is bot dos [18:32:39] !log admin rebooting cloudceph1013 to see if its missing OSD drive reappears [18:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [18:38:21] !log admin it didn't [18:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL