[12:39:20] !log admin removed some logs from the cloudmetrics1003:/var/log/carbon/ directory and stopped the carbon processes (they were crashing and filling up the disk with logs) [12:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [14:08:39] !log paws T343116 update helm chart, jupyterlab, jupyterhub, notebook. Dropping sparql to allow for update. [14:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [14:08:43] T343116: Upgrade paws helm chart - https://phabricator.wikimedia.org/T343116 [14:27:10] Just read that paper. I wonder how these numbers will look if the gene subtree gets cleaned out. [14:44:35] Nevermind, wrong channel. What I wanted to ask here: Can a toolforge root kill grid job 1496252? Seems to be stuck and I'm unable to kill it myself on the compute node (re @MaartenDammers: Just read that paper. I wonder how these numbers will look if the gene subtree gets cleaned out.) [14:46:34] done [16:01:37] !log admin rebooting cloudvirt2001-dev in an attempt to figure out what's happening with bastions [16:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [16:19:18] !log wikistats delete wikistats-bullseye VM [16:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikistats/SAL [19:27:32] Thanks Taavi, it's running again https://commons.wikimedia.org/wiki/Special:ListFiles/GeographBot (and looks like it passed the 5M uploads) (re @wmtelegram_bot: done) [19:27:42] Hello. I seem to be unable to execute shell commands and access the wikiquantos tool in toolforge. I tried deleting the venv, but that didn't work, as it seems to be some kubernetes problem. This is my error: [19:27:43] [19:27:45] yaml.scanner.ScannerError: while scanning a simple key [19:27:46] in "", line 20, column 1: [19:27:48] ata/project/wikiquantos/.toolsku ... [19:27:49] ^ [19:27:51] could not find expected ':' in "", line 21, column 1 [19:27:52] [19:27:54] Can anyone help me on this? My other tools are fine, and I didn't change anything for a while in this tool :/ [19:29:33] That suggests you're trying to read a yaml file, but it's not formatted correctly. You can try putting it in a tool like https://www.yamllint.com/ to see what is wrong with it. (re @ederporto: Hello. I seem to be unable to execute shell commands and access the wikiquantos tool in toolforge. I tried deleting the venv, bu...) [19:30:25] Already did that, My config file is valid, and it was working until at least July 18 (re @MaartenDammers: That suggests you're trying to read a yaml file, but it's not formatted correctly. You can try putting it in a tool like https:/...) [19:35:52] You should probably pastebin a bit more info. What is the full output? Where is the yaml file you're trying to load? Etc. (re @ederporto: Already did that, My config file is valid, and it was working until at least July 18) [19:37:08] File "/usr/local/bin/webservice", line 11, in [19:37:09] load_entry_point('toolforge-webservice==0.1', 'console_scripts', 'webservice')() [19:37:10] File "/usr/lib/python3/dist-packages/toolsws/cli/webservice.py", line 191, in main [19:37:12] KubernetesBackend.get_types(), [19:37:13] File "/usr/lib/python3/dist-packages/toolsws/backends/kubernetes.py", line 271, in get_types [19:37:15] kubeconfig=Kubeconfig.load(), user_agent="webservice" [19:37:16] File "/usr/lib/python3/dist-packages/toolforge_weld/kubernetes_config.py", line 62, in load [19:37:18] return cls.from_path(path=path) [19:37:19] File "/usr/lib/python3/dist-packages/toolforge_weld/kubernetes_config.py", line 38, in from_path [19:37:21] data = yaml.safe_load(path.read_text()) [19:37:22] File "/usr/lib/python3/dist-packages/yaml/_init_.py", line 94, in safe_load [19:37:24] return load(stream, SafeLoader) [19:37:25] File "/usr/lib/python3/dist-packages/yaml/_init_.py", line 72, in load [19:37:27] return loader.get_single_data() [19:37:28] File "/usr/lib/python3/dist-packages/yaml/constructor.py", line 35, in get_single_data [19:37:30] node = self.get_single_node() [19:37:31] File "/usr/lib/python3/dist-packages/yaml/composer.py", line 36, in get_single_node [19:37:33] document = self.compose_document() [19:37:34] File "/usr/lib/python3/dist-packages/yaml/composer.py", line 55, in compose_document [19:37:36] node = self.compose_node(None, None) [19:37:37] File "/usr/lib/python3/dist-packages/yaml/composer.py", line 84, in compose_node [19:37:39] node = self.compose_mapping_node(anchor) [19:37:40] File "/usr/lib/python3/dist-packages/yaml/composer.py", line 133, in compose_mapping_node [19:37:42] item_value = self.compose_node(node, item_key) [19:37:43] File "/usr/lib/python3/dist-packages/yaml/composer.py", line 82, in compose_node [19:37:45] node = self.compose_sequence_node(anchor) [19:37:46] File "/usr/lib/python3/dist-packages/yaml/composer.py", line 111, in compose_sequence_node [19:37:48] node.value.append(self.compose_node(node, index)) [19:37:49] File "/usr/lib/python3/dist-packages/yaml/composer.py", line 84, in compose_node [19:37:51] node = self.compose_mapping_node(anchor) [19:37:53] File "/usr/lib/python3/dist-packages/yaml/composer.py", line 133, in compose_mapping_node [19:37:55] Which part of use a pastebin didn't you get? [19:39:22] And I assume you used the manual at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Loading_jobs_from_a_YAML_file when you created it? [19:41:05] Sorry, I'm not used to pastebins, didn't catch the instruction. I'll look into this problem a bit further and seek help in Phabricator if I confirm that is a Kubernetes problem, as I suspect. Cheers [19:41:55] could not find expected ':' in "" <- key set, but no variable (re @ederporto: Sorry, I'm not used to pastebins, didn't catch the instruction. I'll look into this problem a bit further and seek help in Phabr...) [19:42:33] If I have a python venv and I want to set up a crontab to execute a script within that venv, what is the latest way of doing that? [19:42:39] paste in https://paste.toolforge.org/ and send the output link here (re @ederporto: Sorry, I'm not used to pastebins, didn't catch the instruction. I'll look into this problem a bit further and seek help in Phabr...) [20:00:49] hmm, LiftWing endpoints working on Toolforge? I am trying to create a wrapper in PHP to query the goodfaith/damaging status of a particular edit and the HTTP request does not return anything (in more than a few minutes) [20:06:49] External endoint works fine, by the way [20:12:24] @harej: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework is the job scheduler I would suggest. I generally find using a venv from a script to be easiest by using the full path to the python (or support script) inside the venv. Something like `$HOME/my_venv/bin/python3 ...` [20:13:58] @Yetkin: what endpoints are you trying that are failing? I would expect Toolforge/Cloud VPS to call public endpoints for any Wikimedia service. [20:14:56] Thank you for the link, Chico. I went to Phabricator, the bites are usually gentler there :) (re @chicocvenancio: paste in https://paste.toolforge.org/ and send the output link here) [20:17:23] https://inference.discovery.wmnet:30443/v1/models/enwiki-goodfaith:predict fails (re @wmtelegram_bot: @Yetkin: what endpoints are you trying that are failing? I would expect Toolforge/Cloud VPS to call public endpoints for...) [20:18:29] @Yetkin: that will definitely fail from outside of the Wikimedia production network and all of Cloud VPS is outside of the Wikimedia production network [20:22:26] okay, thanks for the info. I may try to use the external endpoint [20:24:09] @Yetkin: https://wikitech.wikimedia.org/wiki/Cross-Realm_traffic_guidelines broadly describes the supported traffic flows between the cloud and "prod" networks if you are interested in more details. [20:24:10] We're trying to help you. Can you share the config.yaml? (re @ederporto: Thank you for the link, Chico. I went to Phabricator, the bites are usually gentler there :)) [20:59:42] @ederporto: see T344289 for details about how I have corrected the corrupt config files you found manually. I'll do a bit more checking to see if this is an ongoing issue or if I can find other tools that ended up with broken config in the same way from some past maintain-kubeusers bug. [20:59:43] T344289: Corrupt $HOME/.kube/config preventing use of Kubernetes for wikiquantos, wikiroupas, and possibly more tools - https://phabricator.wikimedia.org/T344289 [21:50:32] bd808: do I understand correctly that the environment is generated anew on each run? [21:51:40] @harej: I am not sure I understand your question. Can you rephrase? [21:52:30] In https://wikitech.wikimedia.org/wiki/Help:Toolforge/Python#Kubernetes_python_jobs is the work to build the environment a one time task (like building docker containers), or does it happen every time the job is run? [21:54:33] @harej: ah, thanks for the context. It would be a one time need similar to when using a python venv for a webservice to create a venv in the tool's $HOME. [21:55:49] I think those instructions are trying pretty hard to point out that the venv needs to be built from inside the same runtime container that will be used later when executing the periodic job. [21:56:01] in trying so hard maybe things became a bit muddy [21:58:10] That does remind me that I need to make sure that I have my venv built correctly [22:04:30] @harej: I tried to clarify on the help page, but additional documentation updates are very welcome if the wording is still ambiguous. https://wikitech.wikimedia.org/w/index.php?title=Help:Toolforge/Python&diff=prev&oldid=2099971 [22:05:03] That is definitely clearer, thank you [22:26:09] Thank you for your help! (re @wmtelegram_bot: @ederporto: see T344289 for details about how I have corrected the corrupt config files you found manually. I'll do a bi...) [22:40:17] Is it possible for jobs to run for longer than 5 minutes? [22:42:04] @harej: they should be able to run indefinitely. If you are seeing a job killed quickly it is most likley running out of RAM [22:42:33] I got this error: "ERROR: timed out 300 seconds waiting for job 'credbot' to complete:" is that specific to --wait? [22:43:41] I think that yes that is the `--wait` timing out. I'm not sure if there is a way to make that timeout longer. [22:44:06] Could I replace the --wait with an email parameter? [22:44:18] You do not technically need the --wait option. I think that's in the tutorial mostly to make it less complicated [22:44:43] Mainly I want some indication if the script succeeds or (especially if) it errors [22:44:48] yeah you could have it email you on completion or just poll for the end state [22:46:28] I'm not going to update the wiki page with this because too many options tend to confuse folks, but you can also use `webservice python3.11 shell` or similar for an interactive kubernetes environment where you can build your venv too. [22:46:38] doesn't seem like its configurable https://github.com/wikimedia/cloud-toolforge-jobs-framework-cli/blob/90805dc0f53f38389a17d3014f4bc7fb8573f761/tjf_cli/cli.py#L36 [22:47:25] And how would --emails know where to send the email? I assume the email address associated with my LDAP account? [22:48:09] @harej: I think it emails to $TOOL.maintainers which then expands to the LDAP email addresses for all maintainers [22:48:28] more at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Email#Mail_to_a_Tool [22:48:53] This seems like an improvement over how it worked before [22:49:16] I think booting into the pod shell, and doing stuff within that, was too many layers of abstraction to deal with [22:49:21] on top of the venv [23:01:53] So, tried again, now I have been OOM killed. Is it possible to allocate more RAM? [23:04:47] @harej: yes. Use `--mem 2G` or similar. The default is 512M. `toolforge-jobs run --help` for more options. [23:20:20] Something going on with toolsdb? at 23:09 I got "Lost connection to MySQL server during query" then "Can't connect to MySQL server on 'tools.db.svc.wikimedia.cloud' ([Errno 111] Connection refused)" when trying to reconnect. Seeing it from the plagiabot tool. [23:23:00] JJMC89: I can reproduce and I see an alert for "ToolsToolsDBWritableState" at https://prometheus-alerts.wmcloud.org/?q=project%3Dtools [23:25:59] Aug 15 23:09:12 tools-db-1 kernel: [15825831.804009] Out of memory: Killed process 55492 (mysqld) total-vm:64824560kB, anon-rss:63601672kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:125676kB oom_score_adj:-600 [23:26:05] that will do it :/ [23:26:55] oh, I just arrived with a theory about the db outage but it sounds like the server was OOM? [23:27:04] So probably not related to the backup job that may or may not have just started. [23:27:27] andrewbogott: yeah. OOMkiller took out the mysqld process. [23:27:41] I'm grabbing some logs for a bug report and then will reboot it. [23:27:55] ok :) That's all i would do but lmk if I can be supportive [23:28:50] As it happens that db volume /did/ start to be backed up 30 minutes ago [23:29:09] which seems suspicious but I can't think how it would relate since it's backing up a snapshot [23:29:32] !log tools Rebooted tools-db-1.tools.eqiad1.wikimedia.cloud for T344298 [23:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [23:29:36] T344298: mysqld killed by oomkiller on tools-db-1.tools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T344298 [23:32:34] andrewbogott: do you know how to fail over to the replica toolsdb instance? [23:33:00] hm... I'd have to hunt for docs. Is -1 not coming up happily? [23:33:21] instance came up, but mariadb is angry [23:34:50] "Table 'mysql.db' doesn't exist" that seems bad, and odd [23:34:57] like it's not pointed at the right db files [23:35:04] oh I bet the volume isn't mounted? [23:35:09] https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Toolsdb#Failing_over_Toolsdb looks old and scary... [23:35:13] it isn't [23:35:39] * bd808 let's andrewbogott do the needful [23:37:37] hm, so it has an entry in fstab, and when I 'mount /srv' it claims success but actually does nothing [23:42:11] hm, prepare-cinder-volume sure has been rewritten since I saw it last and now does something different [23:42:18] andrewbogott: I'm not seeing an entry in /etc/fstab for a /srv volume [23:42:36] Yep, I removed it in hopes of coaxing prepare-cinder-volume to recreate with the right uuid [23:42:41] but it doesn't do that anymore I guess [23:42:49] So -- this should be very simple [23:42:53] we just want to mount the volume on /srv [23:43:05] this one: /dev/sdb1: UUID="3c81e678-dc67-45e2-ba8c-4999dffe441f" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="e1ebd472-89ba-4e8f-bec0-42fcda79126f" [23:43:43] Which should be the fstab line: [23:44:02] UUID=e1ebd472-89ba-4e8f-bec0-42fcda79126f /srv ext4 discard,nofail,x-systemd.device-timeout=2s 0 2 [23:44:37] so why does mount totally ignore me when I type 'mount /srv'? [23:44:42] bd808: any idea? [23:45:48] * bd808 tries [23:46:08] mount /dev/sdb1 /srv also seems to do nothing [23:46:24] hmmm.. no error to console. Is there anything in syslog? [23:46:38] "srv.mount: Succeeded." [23:47:35] I propose rebooting again now that fstab is correct, maybe there's locking contention for the mount point? [23:47:47] worth a shot [23:48:00] btw, do you think fstab should get the partuuid or the uuid? [23:48:08] I assume the partition since that's what we actually mount... [23:48:29] !status toolsdb outage (T344298) [23:48:30] T344298: mysqld killed by oomkiller on tools-db-1.tools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T344298 [23:48:47] * andrewbogott reboots it, has a 50/50 chance [23:49:12] thanks JJMC89 [23:49:38] hm, that reboot didn't do anything [23:49:44] andrewbogott: yeah, no idea. I guess you could compare to the fstab on -2 [23:50:41] nah, it's wrong there too, I assume because of devices being swapped and fstab not getting updated? [23:50:53] or at least the uuids in fstab don't match anything coming from blkid [23:52:02] trying one more reboot [23:52:12] third time lucky! [23:52:25] yep [23:52:52] logs seem happy now... [23:53:04] data seems to be there in /srv [23:53:33] and openstack-browser is working which is my default 'is toolforge broken?' test [23:53:40] although possibly you're going to tell me it doesn't use toolsdb [23:53:50] it does not [23:53:55] well then [23:54:01] have a test case at your fingertips? [23:54:19] I bet JJMC89 does [23:54:23] mariadb needs to be startecd still. I'll do it [23:54:37] ok! And make sure it's read/write [23:54:44] some weird default behaviors in those puppet classes [23:54:50] "Job for mariadb.service failed because the control process exited with error code." -- grrr [23:55:51] andrewbogott: the config seems to be that volume mounted to /srv/labsdb [23:56:03] *seems to want [23:56:05] my jobs are still trying to connect - I cn check logs if it is up again [23:56:10] JJMC89: not yet, sorry [23:56:44] bd808: I'll move it [23:56:45] Rook: FYI, toolsdb issue is being worked on here [23:57:16] bd808: try now? [23:57:44] (meanwhile I am fixing /etc/fstab on the replica as well, in hopes of avoiding this next time) [23:57:54] it did something different :) [23:58:07] I can connect now [23:58:08] it says 'Started mariadb database server.' [23:58:11] is it read/write? [23:58:12] or ro? [23:58:36] good morning/evening(?), what's up? can I help? [23:58:40] good question. JJMC89 can you do a write test easily? [23:59:00] taavi: T344298 -- we just got the db running again [23:59:01] T344298: mysqld killed by oomkiller on tools-db-1.tools.eqiad1.wikimedia.cloud - https://phabricator.wikimedia.org/T344298 [23:59:41] toolsdb is r-o by default when it starts, to ensure no writes are done before someone checks replication works fine