[00:14:52] !log bd808@tools-sgebastion-11 tools.gitlab-account-approval Rebuilding image to pick up T356350 fix [00:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.gitlab-account-approval/SAL [00:18:16] @Danny: You should be all set to use GitLab now. https://phabricator.wikimedia.org/T356350 explains the problem (overly aggressive API filtering) [00:24:02] Sounds great. Let me take a look at the ticket. [00:25:45] Happy to see the root-cause addressed and appreciate the quick response time. [00:26:26] thanks for reporting! This wasn't one I would have noticed until somebody complained :) [00:27:32] No problem. At last, I am signed in \o/ [00:31:17] I'll sign off for now. Wishing everyone a nice rest of their day. [09:05:17] a new day and a new challenge for T319953 [09:05:18] T319953: Migrate panoviewer from Toolforge GridEngine to Toolforge Kubernetes - https://phabricator.wikimedia.org/T319953 [09:05:37] this tool queues jobs from the web, not from the login server [09:06:11] (like several other tools) [09:07:07] the client cert is available via NFS, I can successfully contact the TJF API from the k8s web service container [09:08:06] but I don't want to include any implementation detail in the tool source which I think rules out using curl to queue the job [09:08:54] so I think I want the TJF CLI to be available in k8s webservice jobs [09:09:32] I tried just adding it to Aptfile in the buildpack, but it's not in a configured repo [09:10:41] yep, we have not yet tackled accessing the toolforge APIs from within the running services, so it's all new [09:10:42] it would be non-obvious for users anyway -- better if it can magically work in the base image [09:12:07] supposedly the grid engine will be shut down any day now, that's why I'm working on this [09:12:36] I'll open a task to follow up and discuss how to make it work [09:12:57] thanks [09:14:17] are you using python by any chance? we ship pypi packages [09:14:32] no [09:14:37] well... [09:15:17] the tool is a PHP web frontend which queues a job which runs a python script on the command line, we are installing a few python packages for that to work [09:16:08] but the tool author did not write the python script, there's no custom python source here [09:16:52] I have to warn though that the libs are still young, and they might change soon-ish, so probably clis are more stable [09:19:22] hmm, I do lean towards specifying that you require the clis somewhere though, as opposed to just expecting them to be there (that allows to know what is actually needed for the tool to run) [09:20:55] that would mean requiring the use of a buildpack for tools that need to queue jobs [09:21:34] @Danny you are right about needing whitenoise. I somehow skipped over adding that part to the tutorial :/ I'll update it now [09:22:27] Good morning blancadesal, are you online? [09:22:43] yes [09:22:45] Wow, that's a coincidence. [09:22:54] Great timing. [09:23:08] :) [09:23:56] Apparently the universe also works in mysterious ways across continents. [09:24:00] Anyway ... [09:24:20] Do you have link to help me understand why whitenoise specifically is needed? [09:24:21] TimStarling: well, only if you use a buildservice image [09:24:42] for other pre-built images (ex. python3.11/php7.4) it would be there [09:25:30] imdeni: https://docs.djangoproject.com/en/4.2/howto/static-files/ [09:26:13] blancadesal: I was more thinking why whitenoise specifically. Why can't I use vanilla Django? [09:26:22] https://devcenter.heroku.com/articles/django-assets [09:26:47] Django doesn't support serving static files in production out of the box [09:26:52] That is exactly what I needed. All is clear now. [09:27:30] this might be helpful too: https://whitenoise.readthedocs.io/en/stable/django.html [09:27:43] Do you know when you'll be able to have this in the tutorial? It's the middle of the night here, so ideally I'd go to bed and the documentation would be ready for me to do this in the morning. [09:27:48] If that's not too much to ask. [09:28:09] sure, I'll update it right now [09:28:12] Uhh, all the good links. I appreciate you sharing these. [09:28:23] That's great. Thank you. [09:28:55] you're welcome. do reach out again if you hit a snag when you try it! [09:29:06] And then, are you able to point me anywhere regarding database? I assume that relying on the file system in production would not work in this case. [09:29:32] Thank you for offering. I'll remember that. [09:31:20] Since, if the cloud service works the same way as Heroku, it can kill a worker any time and spawn another or even spawn multiple to handle more traffic. That's obviously not gonna work if this is a single sqlite database on the filesystem. [09:31:34] imdeni: if you want to connect to toolsdb/wikireplicas you have https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Connecting_to_ToolsDB_and_the_Wiki_Replicas [09:32:29] what dcaro said :) do not use sqlite in production [09:32:32] toolsdb is a mariadb instance with your database to use as you like (note that there's public and private ones https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#User_databases) [09:32:51] dcaro: Thanks, that looks like what I need. If I give those credentials to Django, would the user account have permission to create a new database/table? [09:34:38] yes, as long as it's named as specified there [09:35:06] (the DB, the table can be anything) [09:35:28] TimStarling: created T356377 for the toolforge packages [09:35:28] T356377: [toolforge] allow calling the different toolforge apis from within the containers - https://phabricator.wikimedia.org/T356377 [09:35:53] Gotcha, that's perfect. Thanks for the info. [09:35:54] feel free to add your specific usecase (php tool, built using buildservice, needs to trigger one-off jobs for example) [09:36:05] yw [09:37:48] blancadesal: Kind "feature" request if you ever get the time: adding using the database connection to the example app/tutorial. [09:38:20] noted :) [09:39:41] blancadesal and dcaro: Thank you both for the swift help 💫 Goodnight from San Francisco and I wish you both a great day! [09:39:57] good night! [09:41:44] nict! [09:41:48] *night [10:54:59] !log admin invite aborrero to /repos/cloud, /toolforge-repos, /cloudvps-repos on gitlab [10:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [12:54:02] it works fine locally. Quickstatements runs in Cloud vps and accepts very long urls. [12:54:02] I can configure nginx so it works in cloud vps. :) (re @wmtelegram_bot: wasn't the issue in the request between your tool and mix-n-match? how would moving to cloud vps solve that?) [12:57:02] thanks, I'm not willing to wait for a ticket to be solved. The cause you describe sound very likely to be the cause (default nginx does not allow very long urls/large headers). It defaults to 8KB (https://amalgjose.com/2020/05/15/how-to-set-the-allowed-url-length-for-a-nginx-request-error-code-414-uri-too-large/). [12:57:03] My tool needs more. (re @wmtelegram_bot: @dpriskorn: I think folks have tried to figure that out before without finding a clear answer. As I recall any limit in ...) [13:47:37] Ugh, I managed to lose my password for wikitech (lastpass overwrote it with a different set of credentials for a different WMF wiki). [13:50:48] I tried doing a reset wiht idp, but not making any progress [14:01:32] is my toolsadmin password what I'm supposed to be using on wikitech? [14:03:16] it... should be? [14:04:41] it's not working. [14:05:30] I can definitely log in there with my "wikitech" (LDAP) password [14:06:35] for all I know, in my multiple attempts to get idp to work, I may have changed it and now I don't know what it is. [14:07:47] the root cause is that I was just given an account on ombuds.wikimedia.org and that apparently confused lastpass since they're both in the same wikimedia.org domain. [14:08:08] Doesn't lastpass keep a history of "previous passwords" [14:08:23] >Locate your desired site entry in your Vault, then click Edit icon. Click the History icon (a round arrow with a clock inside) next to the field name for Username, Site password, or Notes. Next to the date, click the Show Text icon (eye icon) to display the stored data. [14:08:32] https://community.logmein.com/t5/LastPass-Support-Discussions/How-to-view-password-history-in-LastPass/td-p/261347 [14:10:25] hmmm, it says no history available. [14:10:36] I've got a bunch of scratch codes, but I don't know how to use them. [14:11:22] They're not for your password, they're if you don't have your 2FA [14:11:44] If you've tried https://idm.wikimedia.org/wikimedia/password/ and it's not sending you anything, it might be worth filing a bug to start [14:14:50] idp does send me a reset link. [14:15:09] Part of the problem is that it wants me to enter my username, and I'm not even sure what that is. [14:15:13] Aha [14:15:25] I do know my email address, which is usually what these things key off, but it doesn't want that. [14:15:44] It really should accept both [14:15:49] I've got multiple different usernames on different WMF wikis and such. [14:16:00] If you want to PM me your email, I can have a look [14:16:07] the WMF auth system is maddeningly obtuse. [17:05:11] Hi. I'm getting the following error when opening the shell [17:05:12] [17:05:14] Error from server (Forbidden): pods "shell-1706806903" is forbidden: exceeded quota: tool-ashbot, requested: limits.cpu=500m,limits.memory=512Mi,pods=1,requests.memory=256Mi, used: limits.cpu=8,limits.memory=8Gi,pods=16,requests.memory=4Gi, limited: limits.cpu=8,limits.memory=8Gi,pods=16,requests.memory=4Gi [17:05:15] [17:05:17] I expect jobs to be lightweight and fast (most of them are simple sql queries or bot edits). I have 25 different jobs that should run at most 20 min each. Is there any way to see which job specifically is using a lot of cpu/memory so I can terminate it and fix the issue? [17:07:08] when I do toolforge jobs list, nothing is currently running, so how do I check who is exceeding the quota? [17:07:43] `kubectl top pod` might help [17:09:49] shell-1706370360 0m 9Mi [17:09:50] shell-1706456222 0m 6Mi [17:09:51] shell-1706459218 0m 2Mi [17:09:53] shell-1706484849 0m 1Mi [17:09:54] shell-1706486575 0m 3Mi [17:09:56] shell-1706494544 0m 7Mi [17:09:57] shell-1706540314 0m 7Mi [17:09:59] shell-1706542918 0m 2Mi [17:10:00] shell-1706615537 0m 1Mi [17:10:02] shell-1706630151 0m 7Mi [17:10:03] shell-1706633653 0m 4Mi [17:10:05] shell-1706637935 0m 8Mi [17:10:06] shell-1706647382 0m 3Mi [17:10:08] shell-1706659380 0m 3Mi [17:10:09] shell-1706702361 0m 2Mi [17:10:11] shell-1706705180 0m 3Mi [17:10:12] [17:10:14] all cpus are 0 and memory is always less than 10Mi. (re @chicocvenancio: kubectl top pod might help) [17:10:27] yeah, top will show usage, but you're hitting the limit on requests and limits [17:10:43] ashotjanibekyan: those are old shells, I think you are hitting the pod quota [17:10:48] (number of pods) [17:10:51] but, why are they all named `shell` did you name your jobs that way? [17:11:00] you can try removing them `kubectl pod delete ` [17:11:16] or are they from several attempts at starting a shell through webservice? [17:12:04] might be, but for sure will not let trying again if there's too many pods xd [17:13:02] no, non of my jobs have "shell" in their name, the list looks like this now [17:13:03] [17:13:04] they are quite old though, so probably leftovers from unclosed connections [17:13:05] [17:13:06] Job name: Job type: Status: [17:13:08] ---------------------------- ------------------ ----------------------------------------------------------- [17:13:09] admin-stats schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:11] almost-uncat-articles schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:12] archivebot-own schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:14] col-2-ver schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:15] dead-people-without-image schedule: @monthly Waiting for scheduled time [17:13:17] del-draft schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:18] empty-pages schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:20] large-images schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:21] main-talk-redirect-mismatch schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:23] math-project-adder schedule: @monthly Waiting for scheduled time [17:13:24] math-project-views schedule: @monthly Waiting for scheduled time [17:13:26] most-linked-disambig-pages schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:27] most-linked-missing-articles schedule: @daily Last schedule time: 2024-01-31T07:52:00Z [17:13:29] non-free-images schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:30] number-of schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:32] old-unsource schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:33] only-red-categories schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:35] orphaned-talk schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:36] potd-move schedule: @weekly Last schedule time: 2024-01-30T05:37:00Z [17:13:38] potd-move-m schedule: @monthly Waiting for scheduled time [17:13:39] probably-dead-people schedule: @daily Unable to start, out of quota for cpu, memory, memory, pods [17:13:41] short-articles schedule: @monthly Waiting for scheduled time [17:13:42] translated-from-unsource schedule: @daily Last schedule time: 2024-01-31T12:22:00Z [17:13:44] userpage-cats schedule: @daily Last schedule time: 2024-01-31T07:25:00Z [17:13:45] vital-subject schedule: @monthly Waiting for scheduled time (re @chicocvenancio: but, why are they all named shell did you name your jobs that way?) [17:13:59] that's from webservice shell [17:14:03] I used sshfs to open edit the files in vscode (from WSL), can it be fromt that? (re @wmtelegram_bot: they are quite old though, so probably leftovers from unclosed connections) [17:14:05] it creates pods named `shell-*` [17:14:38] kill all shell pods and it should leave you enough room to start [17:14:41] I don't think so, unless it's running `webservice shell` somehow [17:14:47] yep that ^ [17:15:05] how do I do that? (re @chicocvenancio: kill all shell pods and it should leave you enough room to start) [17:15:24] kubectl pod delete , where pod-id is each of `shell-.....` [17:16:23] did you mean top? (re @wmtelegram_bot: kubectl pod delete , where pod-id is each of `shell-.....`) [17:16:48] `kubectl pod delete POD_NAME` I'd do some bash-fu with `kubectl pod delete $(kubect get pod| grep -v READY| cut -f1 -d' ')` [17:17:30] it's `pod` yes, not `top`, another foo (kubectl this time): `kubectl pod delete -l app.kubernetes.io/component=webservice-interactive` [17:17:43] we could have a command to clear those up :/ [17:18:04] `kubectl delete pod $(kubectl get pod| grep -v READY|grep shell| cut -f1 -d' ')` [17:18:31] thanks, this one worked. pod delete didn't, delete pod did (re @chicocvenancio: kubectl delete pod $(kubectl get pod| grep -v READY|grep shell| cut -f1 -d' ')) [17:18:55] hahaha, yep, got the wrong order [17:18:57] yeah, we both mistyped [17:20:29] https://hy.wikipedia.org/w/index.php?diff=prev&oldid=9098175 yey, my bot is working again. thank you all ^_^ [17:23:26] but in general, is there a way to get some statistics about jobs (like how long they took to complete or how much resources did they use)? [17:28:44] k8s-status and the linked Grafana dashboard may help (e.g., https://k8s-status.toolforge.org/namespaces/tool-jjmc89-bot/ for my jjmc89-bot tool) [17:36:06] thanks. this is great (re @wmtelegram_bot: k8s-status and the linked Grafana dashboard may help (e.g., https://k8s-status.toolforge.org/namespaces/tool-jjmc89-bot...) [18:08:39] Is our redis instance flaky in some way ? [18:09:41] I'm running a celery instance for link-dispenser and it keeps failing with a timeout error `redis.exceptions.TimeoutError: Timeout reading from redis.svc.tools.eqiad1.wikimedia.cloud:6379` [18:11:46] https://phabricator.wikimedia.org/P56080 for the full logs [20:17:03] !log paws prometheus and kube-state-metrics internal to cluster T355179 [20:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [20:17:09] T355179: Move prometheus inside of the cluster - https://phabricator.wikimedia.org/T355179