[08:51:56] hi all, I have some problems with a couple of VPS instances [08:53:48] 1. I cannot SSH into an instance and I don't know why [08:54:10] 2. I am not able to mount disks after I have attached it to a VPS [08:54:52] could you be more specific on what's going on? what exact commands are you using? what error messages are you getting? [08:58:22] majavah: from within bastion [08:58:23] $ ssh backend.wikicommunityhealth.eqiad1.wikimedia.cloud [08:58:24] ssh: connect to host backend.wikicommunityhealth.eqiad1.wikimedia.cloud port 22: Connection refused [09:01:28] needless to say, I should have an SSH server listening on port 22 and I am forwarding my agent with the correct SSH key [09:03:44] 2. about the volume I have a volume, called "frontdata" attached on /dev/sdb on a server called frontend, but if I try to mount I get this error: [09:03:47] sudo mount -t auto /dev/sdb /mnt/frontdata/ [09:03:47] mount: /mnt/frontdata: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error. [09:05:34] CristianCantoro: to me that error message sounds like you're missing the required ProxyJump/ProxyCommand rules from your ssh configuration, https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances#Accessing_Cloud_VPS_instances [09:06:01] majavah I shouldn't need that if I am already in bastion [09:06:28] that sounds like a firewall problem, then [09:07:05] but just to be sure, are you familiar with the security implications of using ssh agent forwarding? [09:07:16] yes, I am [09:08:16] my problem is that I have 2 instances that sould be configured in the sam eway (one called "frontend" and one called "backend"), I have no problem logging into "frontend", but cannot login anymore into "backend" [09:08:26] and I shouldn't have changed anything [09:08:28] I can't see the security group firewall rules on your project (project members and admins see them on horizon), but it's possible that the VM is missing from a security group that allows ssh access [09:11:04] the two instances have the same security groups [09:11:47] hmm, weird [09:11:59] also, as far as I know you shouldn't need a security group for SSH access if you jump thorugh bastion [09:12:14] no, security groups apply to all traffic, including that [09:12:53] ok, the default security groups has that [09:12:57] the default security groups allow SSH traffic, but we've seen a few cases where that failed to apply for new projects, although I think that was fixed by now [09:13:08] but it douesn't show up in the security groups of the machine [09:13:48] I am wrong, it shows up [09:14:54] ok, I confirm that both machines are in the default security group and this groups has a rule like this_ [09:15:00] if possible, can you try restarting the VM? maybe sshd has crashed or something [09:15:11] Ingress IPv4 TCP 22 (SSH) - default [09:16:02] Ingress IPv4 TCP 22 (SSH) 172.16.0.0/21 [09:20:08] majavah: I have tried hard rebooting the machine [09:20:11] I can ping it [09:20:38] from bastion [09:20:57] I don't know, either I have changed some config and I compeltely forgot about it [09:21:13] there's no way to get "console access" to an instance, right? [09:23:51] wmcs staff can connect to the console directly from the underlying hypervisor, but otherwise not [09:26:06] majavah: thanks, so I need somebody from staff [09:29:00] majavah: thanks for your help [09:39:41] also, another user and member of the project is not able to SSH in the "frontend" VPS, the only thing that is different in his case is that he is using an ECDSA key [10:39:42] ok, nevermind, this last issue is fixed, it was a misunderstanding [10:54:27] CristianCantoro: hi, staff person here, still having issues to ssh to the instance? [11:00:47] dcaro: yes! [11:01:14] dcaro: the instance is backend.wikicommunityhealth.eqiad1.wikimedia.cloud [11:01:28] ack, I'll have a look [11:02:55] cerated T288069 to keep track [11:02:55] T288069: Unable to ssh to VM backend.wikicommunityhealth.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T288069 [11:08:19] CristianCantoro: were you able to ssh to it before? [11:10:49] it seems it failed to setup the local disk: Dependency failed for [0;1;39mLocal File Systems[0m. [11:11:03] (from the cloud-init logs) [11:12:34] also, can you confirm that I can reboot/stop-start the VM if needed to troubleshoot? [11:15:43] got to go for lunch, will continue debugging later, CristianCantoro do you mind replying on the task itself? thanks! [12:00:31] dcaro: sorry for the delay, I was having lunch (it's 2.00 PM here) [12:00:47] dcaro: yes, reboot at will [12:19:17] !log wikicommunityhealth rebooting backend instance (T288069) [12:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikicommunityhealth/SAL [12:19:21] T288069: Unable to ssh to VM backend.wikicommunityhealth.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T288069 [12:38:03] !log wikicommunityhealth stopping backend instance (T288069) [12:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikicommunityhealth/SAL [12:38:07] T288069: Unable to ssh to VM backend.wikicommunityhealth.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T288069 [12:42:11] !log wikicommunityhealth the server seems to run cloud-init every time it boots (T288069) [12:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikicommunityhealth/SAL [12:42:43] CristianCantoro: is there any data that shouldn't be lost on the backdata volume? [13:16:07] !log wikicommunityhelp migrated the backend vm to cloudvirt1040, same host as frontend, still getting stuck at boot (T288069) [13:16:08] dcaro: Unknown project "wikicommunityhelp" [13:16:09] T288069: Unable to ssh to VM backend.wikicommunityhealth.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T288069 [13:16:31] !log wikicommunityhealth migrated the backend vm to cloudvirt1040, same host as frontend, still getting stuck at boot (T288069) [13:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikicommunityhealth/SAL [13:32:35] hey cloud team, m giving the database as a service a shot on cloud vps! and unable to get into the instance. Im not able to connect and pinging it times out even though status says healthy. [13:33:18] wondering if theres something im missing 🧐 [13:44:44] nikkinikk_: hi \o, awesome. Did you enable the root access? (https://wikitech.wikimedia.org/wiki/Help:Adding_a_Database_to_a_Cloud_VPS_Project#Managing_Trove_Databases) [13:45:38] wait, do you want to ssh to it? or mysql to it? [13:47:53] mysql! just wanted to poke around. i had not enabled root access, and just did so, but still seems to just hang ? [13:48:05] `mysql -h 4mvdjrtnqsg.svc.trove.eqiad1.wikimedia.cloud -u root -p` [13:50:13] but also was trying to ping it with `telnet` on port 3306 and that also seemed to just hang [13:51:33] nikkinikk_: what is the name of the VPS project? do you mind if I connect to the VMs to debug? [13:51:51] also, where are you trying to connect from? [13:51:56] for sure! id appreciate it [13:52:02] project is image-suggestion-api [13:52:20] where, as in, geographically? [13:52:29] nono, as in from which host/VM [13:53:46] just from my local machine [13:54:03] is that the problem...hah [13:55:08] yep xd, by default (and afaik) the trove instances are reachable only from within the VPS project, that is, the other VM instances inside the project [13:56:21] I see that it's not clear from the docs though, I'll add/improve the wordings [13:56:46] ahhhh ok ok! [13:57:19] ill try connecting from inside one of my project's VMs! thanks David ! [13:58:06] well ok no it does technically say `Accessing Trove Databases From Your VM` haha but yes maybe a line explicitly calling that out would be great [13:58:35] 👍 let me know how it goes :), any (more) feedback is welcome too! [14:02:10] will do! and this db service is great and going to be REALLY useful for us right now, much appreciation for it :) [14:18:14] big thanks to andrewbogott, that made it possible! 🎉 [14:24:02] hi folks, FYI I'll be upgrading prometheus on cloudmetrics hosts, small/no impact expected - T222113 [14:24:02] T222113: prometheus: upgrade to >= 2.12 - https://phabricator.wikimedia.org/T222113 [14:33:06] {{done}} [15:15:56] godog: thanks! [15:20:40] dcaro: the two instances are quite empty at the moment [15:20:48] the volumes as well [15:21:44] the volumes are new, in the instances I have just installed docker and a few other things [15:23:01] hi dcaro :-) [15:29:43] dcaro: I am rebuilding the instance [15:33:51] dcaro: it seems that I am getting the same error [15:33:57] even after the rebuild [15:49:42] CristianCantoro: ack, looking [16:00:32] !log wikicommunityhealth rebuilding backend instance to debug initialization process (T288069) [16:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikicommunityhealth/SAL [16:00:35] T288069: Unable to ssh to VM backend.wikicommunityhealth.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T288069 [16:06:56] !log wikicommunityhealth rebuilt backend instance without the attached volume, and the instance is up and reachable, will try with the volume (T288069) [16:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikicommunityhealth/SAL [16:07:01] T288069: Unable to ssh to VM backend.wikicommunityhealth.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T288069 [16:11:41] !log wikicommunityhealth rebooted the VM and it's back up, with prompt on virsh, and reachable through ssh, CristianCantoro can you try and confirm?(T288069) [16:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikicommunityhealth/SAL [16:13:07] gtg, let me know if it does or does not work on the ticket, I'll close tomorrow if everything is ok, cya! [16:15:26] dcaro I confirm I can log into the machine with SSH, thank you [16:15:39] and the volumen is now mounted as well [16:16:14] *volume [16:17:16] dcaro: I still have the issue with the frontdata volume on the frontend machine [16:17:55] $ sudo mount -t auto /dev/sdb /mnt/frontdata/ [16:17:55] mount: /mnt/frontdata: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error. [16:17:56] ```$ sudo mount -t auto /dev/sdb /mnt/frontdata/ [16:17:57] mount: /mnt/frontdata: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error. [16:20:08] CristianCantoro: I remember there was some issue where we provisioned the volumes as vfat... maybe that's the issue? [16:20:33] (instead of ext4), let me know if you get anywhere in the ticket, I'll pick it up in the morning [16:20:50] I can eliminate the volume and recreate it, should I do it? [16:21:27] CristianCantoro: seems worth a shot. It will either get past the problem or show that it is reproducable. [16:22:27] dcaro: which one is the ticket for the volume error? [16:23:08] T287666 [16:23:09] T287666: toolsbeta-sgeexec-1001/2: buster sgeexec apt fails to write to /tmp - https://phabricator.wikimedia.org/T287666 [16:23:23] That was the vfat thing [16:24:14] bstorm: thanks! [16:28:01] lsblk thinks it's ext4 [16:30:13] Ohhh, that's backdata, not the front one [16:30:54] bstorm: I have just deleted and re-created the frontdata volume, attached it to the frontend machine and I still getting the same error [16:31:16] $ lsblk /dev/sdb [16:31:16] NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT [16:31:17] sdb 8:16 0 20G 0 disk [16:31:18] CristianCantoro: you have no filesystem [16:31:22] You need to create one [16:31:26] There's a script to run [16:31:32] https://www.irccloud.com/pastebin/efYgvDgz/ [16:31:55] `/usr/local/sbin/prepare_cinder_volume` [16:32:02] Run that, and it should set you up [16:32:14] When the volume is created, it has no filesystem at first [16:32:18] So this isn't really an error [16:32:23] at least for this volume [16:32:52] You only need to run that on a new volume [16:33:16] But it looks like it was never run because there's no entry in fstab [16:33:49] It's an interactive script [16:34:06] You can probably just accept defaults [16:34:13] bstorm great, thank you. It seems that I visited the page on wikitech with the instructions https://wikitech.wikimedia.org/wiki/Help:Adding_Disk_Space_to_Cloud_VPS_instances [16:34:19] but I completely forgot I needed to run that [16:34:38] No worries. It's easy to miss the step [16:36:51] done, it works and I see it automatically adds a line to `/etc/fstab`, so I should be all set [17:13:23] Great! [19:20:40] !log admin Running deleteBatch.php on cloudweb2001-dev to remove legacy Heira: pages from labtestwiki [19:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [21:12:48] hey cloud team - my team and I are writing up best practices around API Development at WMF and on the topic of deployments, wanted to get an opinion [21:13:14] Is there any kind of "rule" of thumb for when to use cloud services like CloudVPS/Toolforge or taking the time to deploy it in K8s? Would it be generally correct to assume hosting in toolforge/cloudvps is the preferred method for non-prod services, and when the time comes, then to take the time to deploy in k8s? [21:13:51] basically wondering what the "ideal" path would be for a new api to take as far as deploying it goes. [21:16:04] ^ FYI cc'd from -releng where I said you lot would probably have an opinion but production stuff running in cloud vps probably poses a few risks [21:16:26] * RhinosF1 drops a link to https://wikitech.wikimedia.org/wiki/Terms_of_use which makes many of them clear when it comes to privacy and reliability [21:34:22] nikkinikk_: WMCS is a place to test and demo things for sure. It is not recommended as a place to host critical external infrastructure for the wikis. [21:38:06] ok thanks bd808 !