[10:59:10] !log tools.yifeibot bump quotas per request in T329350 [10:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.yifeibot/SAL [10:59:13] T329350: Request increased quota for yifeibot Toolforge tool - https://phabricator.wikimedia.org/T329350 [15:25:38] Hi folks o/ I have been having some issues with webservices on toolforge. If I restart a webservice then it cant connect any more and I get 504 erros [15:26:18] here is an example https://wiki-gpt.toolforge.org/ [15:27:16] any help appreciated or even telling me to contact someone else :) [15:29:38] let's see [15:31:10] when did you restart it? [15:31:48] hmm approx 1.30 hour ago.. [15:32:09] if it helps this also happened to the webservice running here https://isaranto-test-ci.toolforge.org/ [15:32:21] ~15 minutes ago [15:36:38] so far it looks like it's an issue with some of the newer toolforge worker nodes we added earlier today, still trying to find the exact problem [15:37:38] aa so when I do a restart on a kubernetes webservice it spawned a pod in the new nodes? [15:37:47] anyway thanks for looking into this [15:39:17] each tool with a webservice runs in a kubernetes 'pod', which is essentially a docker container running on one of the many (~50) worker nodes in the cluster. when a new pod is created it picks a node which is usually the one with the least amount of load at that time [15:39:28] when you restart the webservice, it deletes the current pod and creates a new one [15:41:15] ack! [15:45:55] !log tools reboot tools-k8s-worker-82 to troubleshoot network issues [15:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [15:46:44] taavi: is that the new node? [15:46:50] yeah, one of them [15:47:13] the pod now re-scheduled to -81, which seems to have the same issue [15:49:56] also new? [15:50:29] yes [15:50:46] I kind of remember worker nodes requiring a reboot before putting into service [15:51:07] did you reboot them already? (when you put them into service earlier today) [15:51:38] I think the cookbook does that [15:52:00] uhm, ok [15:52:05] let me know if I can be of any help [15:55:20] if you can figure out what's wrong with these nodes, that'd be great [15:56:19] ok, you mentioned some network issue [15:56:34] is there any specific symptom that you are seeing? [15:57:27] yeah. as far as I can tell, the ingress nodes are unable to connect to service ips on the new nodes [15:57:50] but a manual curl from `webservice shell` to the service ip of these tools work [15:59:19] ok [16:03:58] it works now.. 🎉 [16:36:10] thanks for the help folks o/ [16:36:15] <3 [16:39:46] hi cloud folks. I just discovered the 'X-Clacks-Overhead: GNU Terry Pratchett' http header and it made smile :) [16:39:54] me* [18:30:57] jgleeson: :) https://gerrit.wikimedia.org/r/c/operations/puppet/+/196893 [18:35:22] love it