[10:23:38] !log tools tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand [10:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [10:24:00] arturo: this is the task T302702 [10:24:01] T302702: tools-sgeexec-0916: ToolsGridQueueProblem, Grid queue continuous/task is in state auE - https://phabricator.wikimedia.org/T302702 [10:24:11] ok [10:24:49] dcaro: ok, so I can stop! [10:25:30] you can continue if you want, or at least tell me what's the process to debug (/me started writing the runbook for the alert) [10:30:20] dcaro: there isn't a lot of information on how to debug. I usually check `aborrero@tools-sgegrid-master:~$ sudo grep sge_qmaster /var/log/syslog` [10:30:39] other than that, the node health is the principal source of problems for the grid [10:30:57] taavi mentioned root disk storage space, and that's likely what happened here [10:31:15] my plan was to reboot the nodes, cleanup space, cleanup the queue states by hand, repool the nodes [10:31:20] how do you get the jobs logs/output? [10:32:01] in each tool home, perhaps, IF the tool enabled storing the output [10:33:37] I also detected this T302783 [10:33:37] T302783: exec-manage fails because the grid master is not a submit host - https://phabricator.wikimedia.org/T302783 [10:37:07] another question, where are the alert from https://prometheus-alerts.wmcloud.org/?q= coming from? (which repo) [10:39:43] I which I knew :-) [10:40:01] puppet ? [10:40:16] I have no idea, honestly [10:42:07] * dcaro checking ot this https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Adding_new_projects [10:48:53] Hey, I have problems with the usage of floating IPs on a VM in WMCS. I created a floating IP and assigned it to the VM. Traffic over the floating IP is reaching the VM. However for the VMs it looks like the traffic is coming from the private IP and not the floating IP. So services which bind to the floating IP only won't get any traffic. [10:48:56] I was wondering if ther is some NATing and if it is possible to use floating IPs as a dedicated address without NAT. The project I'm working in is devtools. I also followed https://wikitech.wikimedia.org/wiki/Help:Manage_floating_IP_addresses_assigned_to_Cloud_VPS_instances [10:53:10] jelto: the current floating IP implementation in our cloud deployment is that the NAT (public IPv4 : private IPv4) is done in a cloudnet node, i.e, a router. There is no way your VM can see (or be aware) of such public IPv4 address [10:53:25] you should not bind your services to the floating IP address [10:55:21] arturo: ok thanks for the quick help and clarification [11:15:30] no problem! [11:37:51] !log metricsinfra Adding runbook url annotation to GridQueueProblem alert on DB at metricsinfra-crontroller-1 (T302702) [11:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Metricsinfra/SAL [11:37:55] T302702: tools-sgeexec-0916: ToolsGridQueueProblem, Grid queue continuous/task is in state auE - https://phabricator.wikimedia.org/T302702 [11:38:17] !log metricsinfra Reloading alertmanager to refresh new config (T302702) [11:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Metricsinfra/SAL [11:39:27] and now the runbook link appears in the alert, nice [12:11:10] !log tools Cleared error state queues for sgeexec-0916 (T302702) [12:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [12:11:14] T302702: tools-sgeexec-0916: ToolsGridQueueProblem, Grid queue continuous/task is in state auE - https://phabricator.wikimedia.org/T302702 [12:12:15] * dcaro lunch [13:08:30] I have problems with creating a network port for the private (lan-flat-cloudinstances2b) network. I tried different combinations, with and without attaching it to a DeviceID, with disabled admin state. I get the following generic message with doesn't help for further troubleshooting: Error: Failed to create port "gitlab-prod-1001-eth1". I'm working in project devtools. [13:08:35] Is there some documentation on how to create additional ports in WMCS? [13:29:10] jelto: The official procedure is to create a phabricator task to ask for a cloud vps admin to create it for you. Horizon apparently doesn't check for the required access (can you file a task about that too?) but you're seeing that error because on the backend we've limited most network management to global cloud vps admins only [13:38:31] !log paws deploying pyaudio fix 978fb648dbd1d1f351bba74b6e2aff4023137087 [13:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [13:41:25] !log tools rebooting tools-sgeexec-0916 to clear any state (T302702) [13:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:41:28] T302702: tools-sgeexec-0916: ToolsGridQueueProblem, Grid queue continuous/task is in state auE - https://phabricator.wikimedia.org/T302702 [13:43:40] taavi: thanks I opened T302803 [13:43:42] T302803: Create additional network port in project devtools - https://phabricator.wikimedia.org/T302803 [13:47:42] thanks! I'll try to get that done at some point today [13:47:50] I assume you also want the floating IP mapped to it? [13:56:17] taavi: yes right, if possible please map the floating ip for gitlab (185.15.56.79) to the new port [15:06:48] taavi: horizon limitation is mostly described here T255670 [15:06:49] T255670: horizon: enable neutron port management - https://phabricator.wikimedia.org/T255670 [17:37:32] taavi or anyone else LMK if you have 5-10m today to go over https://phabricator.wikimedia.org/T302732 , I have perms to do it but want to make sure I do it safely for my first time [17:43:09] inflatador: please do not abuse your global root access to resolve your own WMCS quota requests. We have processes and folks for taking care of these things. [17:45:43] bd808 precisely why I'm asking. I also used to be an operator on one of the largest openstack public clouds, so I'd like to get more involved with the WMCS openstack stuff in general, as time permits [17:46:40] Anyway, I'm not taking any actions without explicit permission from someone who owns the infra [17:54:02] ( I also designed processes and tools to standardize and speed up quota increase requests, so maybe I'm a bit too close ;p ) [17:56:21] In the long ago we let lots and lots of folks approve quota and project requests. At the time this led to unmonitored growth of usage of the limited pool of compute and storage that we had. Things started getting better when we added some checks into the system. [17:57:25] We are much better provisioned today as well which might make it ok to loosen things up more. [17:59:32] Understood. Again, I want to make it clear that I'm not going to take any action without permission, I'm still learning the ropes and I don't know what footguns are hidden out there. I'm just blocked on some testing and for reasons cited above, I felt like I could handle it with some supervision. None of this is high priority or anything, so I'm happy to wait if necessary [18:16:17] !log devtools allocated secondary IP for gitlab-prod-1001 per request on T302803 [18:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Devtools/SAL [18:16:20] T302803: Create additional network port in project devtools - https://phabricator.wikimedia.org/T302803 [21:19:28] Hi [21:20:20] Uhh [21:20:22] Help [21:20:56] Guys [21:20:59] how can we help? [21:21:15] I want to get started doing stuff on Wiki Tech [21:21:22] But I don't know how [21:22:03] what are you wanting to do? [21:22:11] Make bots [21:22:27] And also get some badges in phabrocater [21:22:32] I also want to know this [21:22:50] When I put a SSH key into Wikitecb [21:23:00] And then make an account on toolforge [21:23:15] Toolforge will say it's invalid [21:23:22] While Wiki Tech will not [21:23:31] I don't understand [21:23:45] Oi [21:24:41] haven’t we seen you a few days ago [21:24:53] Oi [21:24:58] Anybody home [21:25:43] ANYBODY [21:26:11] O [21:26:15] you are qwertx, yes? [21:26:24] Yes [21:26:31]  but I still need some help [21:27:20] Guys [21:27:33] bd808 or taavi, could you provide the necessary assistance? [21:27:48] OH SHIT [21:27:56] SHIT SHIT SHOI IT TTORLSOSS [21:28:18] NOT BD808 [21:28:35] !kb lollipop [21:28:53] thank you for your assistance ^^ [21:29:32] AntiComposite: do you have any desire to have the ops bit here? [21:31:35] I'd take it if you think I should have it [21:34:25] AntiComposite: bit granted. Don't feel obligated, but you should now be able to use wmopbot's !kb magic here if needed.