[02:43:40] !log tools manual restart of apache2 on toolserver-proxy-1 to completely pick up renewed TLS cert (alert was flapping) [02:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [11:13:36] !log paws robots.txt to nbserve 1fbc7865aad1ea592d9a852dda3bb3386fc1f29c [11:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [12:08:12] !log baserow deleting VMs and project as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2022_Purge [12:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Baserow/SAL [12:17:17] !log data-engineering deleting project as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2022_Purge#SHUTDOWN_data-engineering [12:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Data-engineering/SAL [12:25:10] !log mobile deleting project as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2022_Purge [12:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Mobile/SAL [12:32:15] what's happening with the gridengine please? currently receiving a bunch of error mails about it [12:32:21] this is one example https://www.irccloud.com/pastebin/FxWVmBQZ/ [12:43:19] ftr, I received ~40 of messages like this in the last ten few minutes [12:48:43] * dcaro looking [12:50:08] I'm fretful that that's somehow related to the project I just deleted but... I can't see how. [12:50:23] andrewbogott: what was it? [12:50:37] oh, I see the logs xd [12:51:06] yeah, nothing like 'tool-grid-monitoring-project' or similar :) [13:10:33] !log tools rebooting tools-sgegrid-master node (T334847) [13:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:10:37] T334847: [wmcs.gridengine] syncron message receive errors - https://phabricator.wikimedia.org/T334847 [13:34:28] tgr_: I just emailed you about the 'mwstake' project... can you tell me what's up with the big 'awaiting transfer...' cinder volume in that project? [13:37:06] urbanecm: things should have stabilized now, let me know if you still see any issues [13:38:15] dcaro: thanks for the info. deleting mails i received, will let you know if i see any additinal ones. [13:38:37] topranks: can you catch me up about the netops-clab project? It shows as abandoned on https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2022_Purge but it looks like someone restarted the VM afer the 'scream test' without updating the purge page. [13:39:53] received one at Mon, 17 Apr 2023 13:15:21 +0000, this time saying "unable to contact qmaster using port 6444 ". probably from the reboot. [13:42:27] There was an outage on Toolforge this morning and the following jobs got stuck and need to be deleted so they can be restarted. [13:42:29]  196752 0.25896 en.pgcount tools.botwik dr    04/11/2023 15:44:30 continuous@tools-sgeexec-10-17     1 [13:42:29]  274753 0.25531 sv.arcstat tools.botwik dr    04/14/2023 01:38:16 continuous@tools-sgeexec-10-17     1 [13:42:30]  339509 0.25228 it.arcstat tools.botwik dr    04/16/2023 01:38:16 continuous@tools-sgeexec-10-21     1 [13:42:37] !log sciencesource deleting project as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2022_Purge [13:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Sciencesource/SAL [13:43:33] GreenC: deleted, sorry about that [13:44:10] no problem it's designed to pick up where it left off after an outage thanks for the deletion [13:45:23] !log services deleting project as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2022_Purge#SHUTDOWN_services [13:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Services/SAL [13:48:47] andrewbogott: my apologies, seems I need to do a little bit of a refresher on policies for cloud [13:49:05] to clarify, the project is very much live and I use the VM quite a bit for labbing up network stuff [13:49:13] topranks: no worries, mostly you just need to read https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/ and/or komla's emails :) [13:49:22] I updated the page to mark the project as used. [13:49:45] ok thanks, yep I will read through the docs and try to take care of anything else that needs doing [13:49:47] cheers! [14:25:24] !log ml-collab-2022 deleting project as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2022_Purge [14:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ml-collab-2022/SAL [15:05:00] !log cloudinfra add firewall rules allowing public (authenticated) access to the enc api - T317478 [15:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Cloudinfra/SAL [15:05:06] T317478: Go library and terraform provider to work with the puppet enc API - https://phabricator.wikimedia.org/T317478 [15:08:08] !log terraform tag terraform-provider-clouvps version 0.2.0 T317478 [15:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Terraform/SAL [17:09:43] taavi: would you mind removing "profile::simplelamp2::database_datadir" from either all or 6 out of 8 projects? So I made an attempt to contact all of the projects, talked with them indiviudally etc.. everyone I talked to has confirmed they dont use it or it _should_ be removed again. Just leaves 2 projects that did not respond. Any opinion if it's better to just remove it from all or leave the [17:09:49] 2 special cases? Either way I would call it closed, tried my best to clean it all up without breaking stuff for anyone. (https://phabricator.wikimedia.org/T329571#8760492) [17:10:48] meanwhile one of them has already asked about it to be remove, wikispeech, they are going to reimage and want to use default [17:50:15] "tgr: I just emailed you about..." <- hi, saw the email but I wasn't involved with whatever that was [17:50:47] do you know what the target project is? [17:54:47] So far I haven't been able to tell. Let me dig a bit deeper... [17:56:11] tgr_: all cinder will tell me is [17:56:12] https://www.irccloud.com/pastebin/imajfMGg/ [17:57:16] wikiapiary says it can't accept a transfer because of exceeded quota [17:57:31] Do we have any way of reaching Mark to ask about this? [17:58:53] I think wikiapiary was eventually moved out of WMCS, but better double-check with Mark. He is on Matrix, I can ping him. [18:00:18] thanks [20:27:37] Hi all, would this channel be an appropriate place to ask about why I am getting errors when I try to scp local files to the project folder of a tool that I maintain? [20:31:16] what error are you getting? [20:31:43] "scp: stat remote: No such file or directory" [20:31:55] "scp: failed to upload directory dist/ to /data/project/wlh/dist/" [20:32:44] can you ssh to the same place and check if /data/project/wlh/dist/ exists? [20:32:45] I have working SSH credentials, and I can "become" the tool in question. But attempting to copy a local "dist" folder to the project workspace via scp is not working [20:33:00] Yeah, that's probably a permissions problem. [20:33:31] EricGardner: that directory is owned by kindrobot:tools.wlh with no group write bit set [20:33:47] this is on toolforge? [20:33:59] i.e. dev.toolforge.org? [20:34:10] yes – the live site is at wlh.toolforge.org [20:34:18] roy649: yes. see my note about permissions above [20:34:38] try copying it to ~/dist/. then ssh, become kindrobot and move it to the right place [20:35:03] kindrobot is a human, so EricGardner no can become :) [20:35:19] ok – kindrobot and I collaborated on this project. In the past I was able to deploy [20:35:50] This sounds exactly like https://phabricator.wikimedia.org/T214966 [20:35:52] it looks like it jsut needs the perms fixed. this sort of thing over scp is way too easy to break [20:35:53] I had hoped that by running "become wlh" (which I can do), I could delete these files and re-upload [20:36:31] EricGardner: you can try `become wlh; take dist`. That should fix the permissions I think [20:37:15] `take` is our home grown chmod+sudo wrapper [20:37:30] ah ok, I didn't recognize that command [20:37:50] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Tool_Accounts#Take_ownership_of_files [20:38:08] I can run it w/o any error messages, is there a unix command I can use to check the dir permissions? [20:38:24] ls -lhd $HOME/dist [20:38:28] ls -l [20:39:07] (Sorry, this somehow feels like it's my doing. D:) [20:40:01] Hey kindrobot! I had just noticed that the site was down and wanted to re-deploy; I was able to restart the server via ssh but then realized that deployment wasn't working. Now I can see the permissions of that dir are set as "drwxr-xr-x 3 tools.wlh tools.wlh 4.0K Mar 18 04:08 /data/project/wlh/dist" [20:40:53] And the sites back up! Good work. :) [20:41:32] I wonder why it went down to begin with. 🤔 [20:42:35] I was poking around for logs but didn't see much. The API was still up when the front-end was down. But the front-end is just static files on a webserver... [20:43:06] Also, after "taking" that folder I can SCP successfully, thanks bd808 for the suggestion [20:45:41] bd808: how many times did I used to break permissions and have to take? [20:45:45] actually, nvm – still getting the same error [20:45:50] That became my favourite command [20:46:20] EricGardner: https://wlh.toolforge.org/pages/NOFX works for me? [20:47:17] EricGardner: are you using https://gitlab.wikimedia.org/repos/abstract-wiki/wlh/-/blob/main/scripts/deploy_to_toolforge.sh. It should handle the taking for you. [20:48:03] bd808: yes, was able to use the "webservice" command to stop and re-start. But I wanted to test out the deployment process too – without SCP I can't push any updates here. kindrobot: I was using that script (with my username as the argument), but kept getting those SCP errors. [20:49:00] ah. carry on then EricGardner :) [20:49:10] it looks like after running "take" I could delete the old dist/ folder, and then I could scp up a new one from local [20:49:13] EricGardner: you might have to delete `dist` first. It sounds like it got into a wedge state. [20:49:19] yeah exactly [20:49:35] do we have docs on wiki about how to put `become ` into the scp command so that permissions are harder to mess up? [20:49:40] * bd808 looks around [20:50:08] You can do that? [20:51:28] I know we used to have it documented to a windows gui tool. I'm pretty sure I've seen others show how to do it with normal ssh/scp too. [20:52:11] My normal ssh command is: [20:52:12] alias dyk-tools='ssh -t dev.toolforge.org become dyk-tools' [20:53:03] I'm sure you could hack together something similar with scp [20:53:58] You can modify SCP to use a custom ssh command [20:54:10] To add parameters [20:54:15] ok, now when running "take dist", I get this error: "dist: You need to share a group with the file" [20:54:27] for WinSCP we have the doc at https://wikitech.wikimedia.org/wiki/Help:Access_to_Toolforge_instances_with_PuTTY_and_WinSCP#How_to_set_up_WinSCP_for_direct_access_to_your_Toolforge_account to use `sudo -u tools.PROJECT-NAME /usr/lib/sftp-server` as the sftp-server command [20:55:14] rsync -e “ssh ” [20:55:45] `become $TOOL` is roughly `sudo -u tools.$TOOL` [20:55:50] Not sure if that would help [20:56:21] bd808: doesn’t become just have a few sanity checks [20:56:54] yeah. `cat /usr/bin/become` on a Toolforge bastion for the details. [20:57:41] I don’t think I’ve ever ssh’d to toolforge on my new laptop [20:57:47] I should actually try [21:02:17] 16:47:17 EricGardner: are you using https://gitlab.wikimedia.org/repos/abstract-wiki/wlh/-/blob/main/scripts/deploy_to_toolforge.sh. It should handle the taking for you. <-- why not do the build on Toolforge itself as the tool user? then you don't have to worry about scp and taking and permissions at all? [21:03:13] I forget exactly why, I think maybe it was the version of node...? [21:03:13] legoktm: are there any restrictions/concerns about running git or npm commands here? Otherwise that would work fine [21:03:55] only that you probably need to run your commands inside a Kubernetes pod to get non-ancient versions of node [21:04:12] than you just get old versions of node ;) [21:04:51] Yeah, it was the node version. [21:05:01] that and you shouldn't run computationally expensive stuff on the bastion itself (rather use a k8s pod or jsub). Generally no issue with e.g. `pip install` but if `npm install` ends up compiling C code it might start hitting limits [21:05:42] ah yes, we still have node v10 here; we need like v16 [21:06:15] you can use node16 via `toolforge-jobs` [21:06:38] `webservice node16 shell -- npm --version` [21:06:54] npm 8.15.0 in a node 16 container [21:07:35] oh cool – I am a toolforge newb and did not know you could just shell into containers here [21:07:46] all of the software versions on the bastions are ancient. [21:07:51] Ah, didn't know about webservice X shell, is that replacing jsub? [21:08:10] kindrobot: no, toolforge-jobs would be the jsub replacement [21:08:20] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework [21:09:18] Ah, OK. [21:09:37] `webservice --backend=kubernetes $TYPE shell` creates a new pod and attaches your stdin/stdout to it. [21:10:20] it is also possible to do things like https://wikitech.wikimedia.org/wiki/User:BryanDavis/Kubernetes#Attach_to_a_running_pod [21:10:49] I'd recommend toolforge-jobs in this case just because it's usually easier to script [21:11:26] Yeah. Very cool. Good stuff! [21:11:45] no I take that back, I forgot `webservice shell` now lets you pass args to it [21:12:25] yeah, I think we got that mostly fixed. I'm sure there are ways to break it, but the intent is to be able to run a full command from the bastion [21:13:32] when deploying to toolforge, is it better to use git (or scp/rsync/whatever) to get the project source files on to the host and then perform any necessary work there (in our case, "npm install" and "npm run build" to generate some static files)? Or is it better to do that work locally and just push the final files? I just assumed the latter approach was preferred [21:33:49] EricGardner: Whatever works for you is best, but I tend to do code changes in git, fetch from Toolforge, and "build" steps on Toolforge using either interactive shells in Kubernetes or toolforge-jobs tasks. [21:41:41] ^^ same, and mostly automated the deployment part with https://wikitech.wikimedia.org/wiki/User:Legoktm/update-rust-web-tool [21:43:37] !log tools.poty-stuff Updated from 992e383 to 1c6be41 [21:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.poty-stuff/SAL [21:50:08] !log tools.poty-stuff Updated from 1c6be41 to cef3d40 [21:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.poty-stuff/SAL [22:05:39] Thanks to everyone for all the toolforge troubleshooting help here, I learned a few things today [22:41:53] https://wikitech.wikimedia.org/wiki/User:BryanDavis/Toolforge#Copy_files_from_local_to_tool's_$HOME_as_the_tool_user seems to work pretty well in limited testing to copy files from local to Toolforge as a tool's user. [23:18:38] !log mwstake deleting project and all resources [23:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Mwstake/SAL [23:45:03] !log tools.poty-stuff Updated from cef3d40 to 6b633e6 [23:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.poty-stuff/SAL