[13:01:30] !log lucaswerkmeister@tools-bastion-15 tools.lexeme-forms deployed 54c6749c45 (l10n updates: fi) [13:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lexeme-forms/SAL [15:49:01] !log copypatrol copypatrol-backend-prod-02 sudo systemctl restart copypatrol-backend-check-changes T411364 [15:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Copypatrol/SAL [15:49:05] T411364: CopyPatrol has stopped December 1, 2025 - https://phabricator.wikimedia.org/T411364 [17:39:10] hrm, horizon seems inaccessible to me from the browser. [17:39:32] yea, down for me [17:41:02] "Node cloudweb1003 is down" in the -feed channel [17:41:28] andrewbogott, dhinus: ^ cloudweb1003 down? [17:41:36] it seems Andrew is provisioning a new cloudweb server [17:41:54] the 1003 down thing is expected (switching to uefi) [17:42:09] the horizon down thing is unexpected, because 1004 is up and running [17:42:15] seems not to be getting traffic [17:43:04] andrewbogott: fwiw please actually de-pool cloudweb* boxes in conftool before taking them hard down [17:44:08] ok -- is that the reason why 1004 isn't getting traffic or unrelated? [17:44:20] labweb lists only 1004 as pooled though at https://config-master.wikimedia.org/pybal/eqiad/labweb-ssl ? [17:44:39] because I just depooled 1003 [17:44:49] ack [17:45:07] taavi@lvs1020 ~ $ curl --connect-to ::cloudweb1004.wikimedia.org https://toolsadmin.wikimedia.org:7443/ [17:45:07] upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refusedt [17:45:27] expected it to be listed just with enabled: False [17:45:52] striker is giving 500s [17:46:04] (the command above is wrong, `curl -v --connect-to ::cloudweb1004.wikimedia.org:7443 https://toolsadmin.wikimedia.org/` gives a better error message) [17:46:27] andrewbogott: it'd saved horizon in this case [17:46:38] striker is failing to talk to memcached [17:47:14] sorry, I'm not following "it'd saved horizon in this case" [17:48:11] * andrewbogott sort of wishes the convert-to-uefi cookbook had the option of reverting [17:48:38] !status Horizon and Toolsadmin unavailable [17:48:56] taavi: I can revert 1004 to bookworm but it'll take a while; do you see an obvious fix otherwise? [17:49:02] (all this is working on trixie in codfw1dev) [17:49:07] striker seems to be trying to talk to memcached on :11212 but there's nothing there [17:49:19] there is no striker in toolsbeta [17:49:23] s/toolsbeta/codfw1dev/ [17:49:55] sorry 'all this' meaning Horizon [17:50:49] these hosts seem to run mcrouter on 11213, why is striker expecting a different port? [17:52:44] I don't know, still digging. But, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1213539 [17:53:48] hrm, I bet 11212 was nutcracker and we thought only wikitech used that and cleaned up those manifests after wikitech was gone [17:53:55] https://gerrit.wikimedia.org/r/c/operations/puppet/+/592276 [17:53:56] yep! [17:53:58] that's why [17:54:02] so my patch should help [17:54:22] yeah, let's do that [17:54:57] and I filed T376277 all the way back then but it wasn't picked up so we didn't notice until now [17:54:57] T376277: Reimage cloudweb hosts to trixie - https://phabricator.wikimedia.org/T376277 [17:55:10] is the horizon issue because the lvm healthcheck is only based on striker? [17:55:28] or do we have two problems? [17:55:35] it's a single lvs service, yes [17:55:57] ok, patch is applying [17:56:20] I applied it by hand to make it faster, but pybal still sees it as down? [17:57:06] needs a docker restart maybe? [17:57:24] oh, puppet just did that [17:57:25] no, this is the "all hosts being down at once confuses pybal" thing I think [17:58:43] meaning it despairs and quits checking? [18:00:19] pybal gets confused with its current internal state and what's in lvs getting out of sync [18:00:43] and you bumped it somehow? [18:00:57] (sites are back) [18:01:35] https://sal.toolforge.org/log/Tx0S25oBvg159pQrPrBD [18:02:04] wow, that's quite a bug [18:02:12] but pybal is on the way out I suppose [18:03:42] !status ok [18:03:54] well I'm fairly sure I had a part of that, by setting 1003 to state=inactive (instead of state=no) while it was pooled (but down) due to the depool threshold which removed it entirely while it was the only thing in the LVS pool [18:05:35] hm, why does pybal still see 1004 as down? [18:06:32] Dec 01 18:06:16 lvs1020 pybal[1612990]: [labweb-ssl_7443 ProxyFetch] WARN: cloudweb1004.wikimedia.org (enabled/partially up/pooled): Fetch failed (https://toolsadmin.wikimedia.org:7443/), 0.078 s [18:06:47] that check is failing [18:07:23] yes but `curl -v --connect-to ::cloudweb1004.wikimedia.org:7443 https://toolsadmin.wikimedia.org/` on the same host shows a 200 [18:07:34] shows a 503 [18:07:58] vgutierrez@lvs1020:~$ curl --connect-to ::$(dig +short cloudweb1004.wikimedia.org):7443 https://toolsadmin.wikimedia.org:7443/ [18:07:58] upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused [18:09:01] hmm, `curl -v --connect-to ::cloudweb1004.wikimedia.org:7443 https://toolsadmin.wikimedia.org/` shows a 200 [18:09:20] is that port number being sent with the request? and has envoy behaviour changed for that to make a difference? [18:10:58] yeah.. pybal is sending `Host: toolsadmin.wikimedia.org:7443` [18:11:12] no idea regarding envoy changes lately [18:11:40] Is that check happening on v4 or v6? The service seems to be listening on v4 and not v6 [18:12:09] :) v4 [18:12:29] there is no such thing as v4 -> v6 for our load balancers [18:13:41] andrewbogott: can I ask you to please deal with that? /me shouldn't be here at this time [18:13:55] yep! thank you for appearing [18:14:04] (divergence from above) noticed the horizon problem while doing reboots. Did a reboot just now and I've got an instance stuck in "hard reboot" is there a workaround for that? Or should I file a task? [18:14:47] thcipriani: unlikely to be related to the horizon thing but please file a task (or just ping me here in a few) [18:15:42] andrewbogott: ack, thanks! docs are saying something about "reset-state" should I try that? [18:16:04] thcipriani: yes but it's just a cli thing so depends on if you have easy access to that [18:16:28] yep, I do have access, if that's safe to try, I can try that first. [18:16:35] thcipriani: one thing at a time please [18:17:00] vgutierrez: I need a bit more context; where are you looking to see that pybal thinks the host is down? [18:17:32] andrewbogott: pybal log.. TL;DR your trixie realserver isn't able to make healthchecks happy [18:17:44] so I'd fix that [18:18:15] vgutierrez@lvs1020:~$ journalctl -u pybal.service --since=-15m --grep cloudweb [18:18:44] and a curl reproducer [18:18:46] vgutierrez@lvs1020:~$ curl --connect-to ::$(dig +short cloudweb1004.wikimedia.org):7443 https://toolsadmin.wikimedia.org:7443/ -v -o /dev/null 2>&1 |grep -i 503 [18:18:46] < HTTP/1.1 503 Service Unavailable [18:19:25] thx [19:20:43] vgutierrez: I'm clearly missing something important. port 7443 isn't just the health check, it's also the service isn't it? [19:20:47] - type: map [19:20:47] target: http://labtesthorizon.wikimedia.org [19:20:47] replacement: https://cloudweb2002-dev.wikimedia.org:7443 [19:21:06] ...so how does it make sense that I can load the page in my browser at the same time that pybal gets a 503? [19:21:33] I guess I'm just restating the same puzzle ta.avi was [19:22:54] andrewbogott: the health check URL is configured independently from the rest of the load balancer params, so the port just needs removing from that URL [19:24:41] ok... [19:25:36] oh, I see it now [19:28:11] (separate health check definition also explains my earlier confusion about both services dropping at once) [20:07:55] horizon is down? [20:09:00] gifti: yes, known. ongoing work to fix it is happening [20:09:21] thx [20:10:45] it's back