[14:17:25] Hi, Cloud! I am having trouble SSHing into a deployment-prep instance. I've updated my keys on IDM and my SSH config seems to be in order. [14:20:21] apine: can you run ssh -vvv and paste output into a pastebin [14:27:43] https://pastebin.com/D7eLgm7b [14:27:48] At this point it just hangs indefinitely. [14:31:29] Ah! I see I didn't have the right ProxyJump configured. Here is the new SSH output: [14:34:08] https://pastebin.com/TrwXdX8V [14:34:34] One thing I notice here is that it isn't trying my key, which is configured as `.ssh/prod.key` [14:53:18] apine: I see "Authenticating to bastion.wmcloud.org:22 as 'corybant'" in those logs, but I cannot find a Developer account with the shellname 'corybant'. I think your shell name is 'apine' [14:56:06] Ah, I had a commented-out `User` in my .ssh/config. Gah. Fixed. Thank you! [14:57:43] You might also want to pin an Identity in your ssh-config so that you don't present every key in your agent to the Cloud VPS bastions. If nothing else it will make the traces easier to read in the future. [14:57:53] Thanks! Will do :) [16:35:09] Another question, and the reason I was trying to get into a `deployment-prep` machine. The Wikifunctions Beta cluster instance is running some backend services, and we're no longer able to make HTTP requests to them inside Beta cluster. I see that the ports are mapped correctly (see pastebin below). We have old versions of the services running at ports `6928`,`6929`, and new versions running at `6938`,`6939`. 692* are [16:35:09] working, but 693* aren't. It's been suggested that this is due to the security mesh. Can anyone offer pointers on why the services at 693* are unresponsive? [16:35:09] https://pastebin.com/jbLm3886 [16:43:50] apine: check the security groups under https://horizon.wikimedia.org/project/security_groups/ [16:44:01] and see if it allows network traffic on them ports [18:45:39] Hmm. It doesn't, but it also doesn't open the port corresponding to the old services (which ARE working). And the service was working up until a recent code change, so I'd be surprised if this were the problem. I did see that we allow port `6927`, so I added a rule allowing `6937`, as well; no change. [18:48:30] apine: what host are you trying to reach? [18:48:39] on port 6938? [18:51:54] well anyway, seems likely that that host is running ferm and you'll need to make a puppet change to adjust it. Just a guess but you can look in /etc/ferm to see what's happening [18:54:09] apine: you can also run "iptables -L" to check the actual firewall rules [18:56:45] on whatever host has 172.16.1.154 (no DNS?) [18:57:51] ah,ignore that last comment. on deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud [18:59:04] since that's docker, it's also possible there is yet another level of firewalling between parent host and docker container [19:08:27] It's all in docker, so it's all the same host (172.16.1.154). Would ferm affect that? [19:09:15] Mmm. `/etc/ferm` doesn't exist. [19:09:29] But yes, I am trying to reach `6938` and `6939`. [19:10:41] https://www.irccloud.com/pastebin/P3oF2X0x/ [19:11:32] how about "netstat -tulpen" to check if it's actually listening on both of the ports [19:11:52] also, since you say it's all been working until a recent code change.. you already know which one that was? [19:19:20] Changing the port [19:19:35] I think [19:19:38] Based on earlier [19:37:22] Yes, we know which code change it was. It's a little tough to roll back (our Beta cluster config just reads from head), but worth trying. [19:37:37] Hmm, which port do I change? And where to I change it? [19:37:43] Where would I run `netstat -tulpen`? [19:38:20] apine: where the services run that should have the port open [19:38:38] On the box itself, `6928,9;6938,9` are all treated identically: [19:38:41] https://www.irccloud.com/pastebin/AtnAK3KM/ [19:43:16] apine: it's not listenin gon 6937 though. was that the one that didn't work for you? [19:45:03] `6937` is only internal to the container; it's mapped to `6938` and `6939` externally. [19:46:09] since/if we can exclude horizon and ferm, I think it's probably in the docker networking settings itself then [19:46:32] you could try to get a shell in a docker container and test it with telnet from there [19:47:18] something like "docker exec -t -i container_name /bin/bash" [19:47:24] to get inside the container [19:47:42] Yeah, I've been testing form inside the containers with `docker exec`. From inside the docker container of the "server" service, I am able to make fetch requests normally. From inside the "client" service, I'm not able to, and I still get ECONNREFUSED [19:48:01] So the server is definitely running, and definitely accepting requests, but it can't be reached from the other containre. [19:50:47] So from the "server"'s perspective, the URL is `http://0.0.0.0:6927/`, and I'm able to fetch normally. From the "client," it should be `http://172.16.1.154:6939/`, and I get `ECONNREFUSED`. [20:43:04] When I create a VM in the deployment-prep project I notice that the only disk configurations offered are 20GB which is inadequate for my needs. Has there been any discussion about adding large root disk choices? (e.g. 40 and 80GB) ? [20:44:12] dancy: https://wikitech.wikimedia.org/wiki/Help:Adding_disk_space_to_Cloud_VPS_instances is the recommended way to get additional storage. [20:45:11] It is possible for WMCS admins to create special instance flavors that would change the root disk size, but we mostly hope that folks can work with volumes instead [20:45:18] Nod. I know how I can add extra storage, however from a user experience perspective, I just want a larger root disk.. [20:46:01] is that because you don't know where the disk will be needed, or just to simplify workflows? [20:46:29] Simplicity. Any other cloud provider allows you to choose the root disk size. [20:46:53] any other clpoud provider also gets revenue for the disk you consume... [20:47:10] apples to bannana ccomparison [20:48:39] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/VM_flavors has more gory details about the history of our flavors and what the current practices on the admin side are