[07:47:40] morning [08:03:11] hmm, nfs seems to be working again (slow though) on tools-k8s-worker-1 [08:04:28] there's still some D processes though [08:13:54] morning [08:14:16] is it working again because you did something special? [08:14:31] nope, but now I can't reproduce the issues xd [08:14:43] (or not that I know) [08:14:55] :-S [08:16:09] I'll play for a few minutes try to see if it gets stuck again, but if not I'll reboot and wait for the next problem (I'm writing what I tried here T362690) [08:16:09] T362690: [infra] NFS hangs in some workers until the worker is rebooted - https://phabricator.wikimedia.org/T362690 [08:20:00] 👍 sounds good [08:20:10] hmm, if I ls -l directly the cwd of one of the stuck processes (ex. ls -la /proc/2701910/cwd/) then it gets stuck [08:20:37] lrwxrwxrwx 1 dcaro wikidev 0 Apr 16 10:24 /proc/2165614/cwd -> /mnt/nfs/labstore-secondary-tools-home/dcaro [08:22:02] not only directly, okok, so ls-ing that specific directory gets stuck, ls -l of the parent does not [08:32:43] so, I assume somehow the NFS connection is just borked [08:35:06] only for some directories [08:35:23] mounting as `soft` instead of `hard` works right away, and faster [08:36:45] ``` [08:36:48] https://www.irccloud.com/pastebin/hAVmKXTH/ [08:37:06] I wonder if `intr` is just not supported on our nfs stack [08:40:11] it's also interesting that we add the `timeo` option, that only has an effect for `soft` mounts [08:47:20] `intr` is not supported xd [08:47:25] `intr / nointr This option is provided for backward compatibility. It is ignored after kernel 2.6.25.` [08:56:29] hmm, there's a lot of this traffic going on on the worker that gets stuck: [08:56:37] https://www.irccloud.com/pastebin/qBNoeNgU/ [08:56:52] but not so much on one of the workers that does not get stuck [08:57:16] both harbors are stuck again [08:57:27] (as in, 100 pkg/sec vs 10ths) [08:57:33] blancadesal: dammit xd [08:58:12] let's get an lsof [08:58:19] dcaro: I'm not sure what that means [08:58:31] also, I'm curious about `cksum 0x6054 (incorrect -> 0x74c6)` [08:59:51] blancadesal: found `nginx 2245529 www-data 775u REG 0,38 1073741824 9268222 /var/lib/nginx/proxy/6/38/0000044386 (deleted)` [09:01:21] the file has been deleted while it was still open by a process? [09:02:26] yep, and the process seems to still be using it (I see syscalls to file descriptor 775) [09:02:36] storing an strace for a few seconds to read later [09:03:42] or better, copied the file over xd `root@proxy-03:~# cp /proc/2245529/fd/775 /tmp/badfile` [09:03:45] okok, we can restart [09:03:58] should be back online [09:04:00] if we increased the tmpfs size, would it just keep filling up, or would the file eventually be flushed? [09:05:00] I don't know, I guess it might depend on the actual size of whatever it's being buffered [09:05:03] https://www.irccloud.com/pastebin/M8iUL7gL/ [09:05:08] not very useful xd [09:07:47] nothing pops up from `strings /tmp/badfile` [09:09:47] if this was a beginner-level wargame challenge, it would have given something xd [09:12:05] hahaha [09:12:25] the first time that it failed writing to the tmpfs was `2024/04/17 06:12:14` [09:16:48] hmm... if the request it was still ongoing (it was still using the temp file), it would not be in the access log right? that gets written after the request finishes (has the status code) [09:17:14] amazonbot is doing weird requests :/ [09:18:36] is that a toolforge tool? [09:19:05] sus name [09:20:24] no, it's amazon, I'm reading the user agent `(Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)` [09:20:44] so, it is an incoming request to the front proxy? [09:20:48] the ip is also from amazon [09:21:15] yep, I'm reading the access logs of the proxy, but I don't think that the request that makes the tmpfs fill up is going to be there [09:22:33] probably not, I think the log entry only happens when the request has been responded to [09:24:20] wouldn't a 'stuck' request timeout after a while or something, with tmp files etc being closed/deleted? [09:25:08] we set a long timeout, to allow streaming for example [09:26:31] I think I'll try https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path , with a `min_free` set to something like 100M [09:26:46] that should also encompass the proxy temp directory (that by default is in the cache path it seems) [09:26:51] wdyt? [09:29:11] we can also try to put the nginx temp directory in the filesystem and make it bigger (though might slow down big requests, as it would need to write to disk instead of RAM) [09:30:01] there's also the `inactive` parameter that seems to control how long an item can remain in the cache without being accessed, maybe that would help? [09:30:01] I think we should experiment a bit maybe too, create a simple webserve that just streams random stuff continuously and see if that affects the proxy [09:30:15] blancadesal: it was being accessed, and it had been deleted already :/ [09:30:22] argh [09:30:34] (it was not cache eihter, but proxy buffering, under `proxy/../..`) [09:34:46] where can I see the current settings? (not that I understand much of this, but I'm curious) [09:39:15] on the host (you get the actual settings, under `/etc/nginx/`), or in puppet (under `modules/dynamicproxy`) [09:39:42] https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/dynamicproxy [09:40:05] mostly templates for the base config https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/dynamicproxy/templates/ [09:48:16] oh, we got one again [09:48:17] nginx 2268187 www-data 797u REG 0,38 103800832 9280705 /var/lib/nginx/proxy/2/34/0000001342 (deleted) [09:48:34] wait, let me count my bytes [09:48:35] xd [09:50:21] that's 100M, not 1G xb [09:51:11] did you tweak the settings? [09:51:17] I did not [09:55:31] * dcaro looking at big http requests [09:58:20] there's several over 200MB for harbor, but I did not see the temp file being created [10:08:34] I think we are buffering responses from the backends [10:09:50] this should disable all buffering https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020767 [10:11:21] hmm, maybe I can set proxy_max_temp_file_size to 0 instead, so there's some buffering, but it will not use temp files? [10:14:41] so the patch forces nginx to pass data directly to the client without writing it to disk? [10:15:31] that patch also forces it to not buffer any data at all, it just forwards everything it gets from the backend to the client [10:15:36] connections might then remain open longer, but that's maybe not a problem? [10:16:05] yep, I can use the `proxy_max_temp_file_size: 0` to allow small responses to be buffered in memory, and only pass the ones that are big [10:16:39] that might be a good in-between solution [10:19:04] thanks for the review arturo, just updated it to use something less aggressive https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020767 [10:19:38] dcaro: also looks good to me [10:28:52] deployed, let's see if that helped [10:35:06] xd, I triggered the alert of D processes on tools-worker-nfs-1 with my tests [11:02:16] okok, the node is back online [11:30:13] * dcaro lunch [12:46:46] * blancadesal cautiously refreshing the harbor ui from time to time checking it's not ded [12:53:32] I'm looking on the proxy with `df` and `lsof` too every few minutes xd [13:04:35] * arturo food time [13:29:01] blancadesal: are you using this branch? https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/tree/oapi-codegen-changes?ref_type=heads (seems superseded by https://gitlab.wikimedia.org/repos/cloud/toolforge/envvars-api/-/merge_requests/28), if not can I delete it? [13:33:01] dcaro: you can delete it [13:48:02] after a long investigation, I found that the issue I was having yesterday with "bump_version.sh" and file permissions is a bug in Docker for Mac :( [13:48:09] https://github.com/docker/for-mac/issues/6243#issuecomment-1930959547 [13:50:20] using the legacy "osxfs" instead of virtiofs everything works fine [14:00:41] quick review? https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/30 [14:02:36] O.o great debugging! [14:03:34] thanks for reviewing! question: is the .deb package created by the CI published somewhere? [14:03:48] not automatically, you can get it from the pipeline artifacts [14:04:12] I see [14:05:13] what about the git tag? do you usually push it with "git push origin {tag}"? [14:05:20] yep [14:05:49] after merge (so when the tagged commit is in main, though might not matter) [14:06:04] I also create a release manually (just me), so the badge in the repo gets updated [14:06:47] I want to add some automation so when the bump version MR is merged, it will create the tag + release + add deb packages to the release automatically (not sure if the last is possible though, you can't add any files manually, just links) [14:08:42] maybe we could push the .deb to object storage? (just an idea) [14:08:46] or directly to a deb repo? [14:09:53] yep :) [14:10:30] taavi wanted to add some small server in the tools-services VM to pull the package into the debian repos (iirc) [14:11:49] automating publishing from gitlab-ci to the toolforge apt repo is on my todo list. i think at this point that basically means migrating from aptly to reprepro and then adopting the main sre team's in progress automation for doing the same [14:12:06] btw what's the situation now that we have bookworm bastions? do we publish debs for buster+bullseye+bookworm? [14:13:20] I think so yes, until we consolidate the bastions OSes [14:13:35] I've been only publishing for toolforge [14:13:41] only for bookworm* [14:16:02] ha, indeed we have slightly older versions of packages in tools-sgebastion10 (buster) [14:17:02] is it something we care about? [14:17:59] I think we do as long as sgebastion-10 is login.toolforge.org [14:18:21] yep, we do, hopefully not for long [14:18:33] but we could swap login. to be a bookworm bastion, keep the login-buster. alias as is to -10, and then care significantly less [14:18:54] that might be a good step yes :) [14:19:14] I'll send an email to cloud-announce, any preference for the notice period? [14:19:26] notice period for what? [14:19:48] you want to swap and then notify? [14:21:27] I think that waiting a week would be ok [14:21:50] I don't see why we could not do that, especially as when I swapped dev I said that login. is likely to swap soon too [14:21:59] yeah but that's more cloud-announce email spam :( [14:22:45] * dcaro was rereading the announce [14:23:22] yep, you said it, I think it's ok too, I will just reply to the same email then [14:23:50] just make sure to update https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/login.toolforge.org [14:24:13] thanks! I would have forgotten about that [14:24:40] tools-bastion-12 is currently dev., looks like I already created -13 which should become the new login. I think [14:25:43] ack [14:28:00] -13 does not have a floating ip yet, I'll add one (can't reuse the old one) [14:28:43] wait, can I use one of the ones that are free there? [14:28:45] (I guess) [14:29:00] * taavi looks [14:29:43] should be fine, I guess? I wonder why those floating IPs exist in the first place [14:30:01] yep, I'll ask everyone in the meeting, and release them if they are not parked for a reason [14:53:54] dcaro: hopefully quick question. With the stack we use today is it possible for both the python and golang buildpacks to run for the same build? Or is it only nodejs that can run along side another language buildpack? [14:54:35] bd808: only specific ones yes, like nodejs/apt/procfile [14:55:06] it should not be very hard to allow specifying which ones you want though of the ones available for power users [14:55:21] (as we generate the list of "stacks" on the fly already) [14:57:23] that could be neat. I was trying to make a thing for the bridgebot tool with python + golang. Mostly because I had code I could copy-n-paste for most of the python bit. [14:57:34] I think that there's no task for it yet though, feel free to open one, it's not in the current roadmap, but I might give it a go if I have an idle day/want to switch topics [14:58:36] For this particular project I need a thing in the container to generate a config file at runtime because the upstream code doesn't do 12-factor envvar stuff. [14:59:29] xd [14:59:36] * bd808 will write a feature request later [14:59:41] 👍 [15:01:06] quick review here https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/76 [15:05:58] arturo: lgtm, but the pipeline failed with a mypy error [15:06:10] yeah, sending a different patch for that [15:07:38] https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/77 [15:07:41] interesting, I wonder why mypy complains now but not before [15:08:30] oh, I see, new release of toolforge-weld [15:12:03] hmm, in horizon, in the tools project->dns->toolforge.org zone there's the SOA record that says 'pending' [15:12:05] https://usercontent.irccloud-cdn.com/file/MmIvULIb/image.png [15:12:15] any ideas why? [15:12:59] oh, now it's ok, maybe it was pending because I changed the bastion.toolforge.org? (not sure why it would need to change SOA) [15:13:28] bump the serial? [15:14:18] yeah, when you change a record it should automatically bump the serial in the SOA. that should fix itself in a few moments [15:16:13] ahh, okok [15:16:14] yep [15:17:47] swap done :), login.toolforge.org now points to tools-bastion-13 [15:21:04] Rook (or anyone) was there a 'paws-dev' project in codfw that has since been replaced with 'pawsdev'? [15:21:31] There are some dangling resources associated with paws-dev which I'm going to clean up but I want to make sure I know what's happening first [15:29:45] Please remove them. Everything is replaceable so it shouldn't matter [15:34:09] ok! [15:36:41] * dcaro off [15:36:43] cya tomorrow [15:42:24] I need a quick +1 here https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/77 [15:42:51] lgtm [15:42:58] thanks [15:45:41] I'm not deploying a new jobs-api version to kubernetes because is late for me and wont stay in front of the laptop for much longer today [15:53:27] * arturo offline [16:05:01] Getting the Toolforge shared redis out of wikibugs was apparently the right thing to do. It has been running for basically a week without failing to heal itself from the various other interruptions that happen normally. [16:33:11] I made a bunch of changes to https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Packaging -- if you recently upgraded a toolforge .deb package I would welcome your review :) [16:40:11] unrelated: there's a message in the moderation queue for cloud-announce, I'm not sure if we should approve it or if it should go to a different list [16:40:22] (subject is "WDQS Scaling update") [16:42:13] It went out on wikitech-l. I'm not sure that cloud-announce is the right forum since it really has nothing to do with Cloud VPS or Toolforge. https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/AHOKYOHFMHHDVOSVTFON3PGB5EAUUPX2/ [16:42:59] Someone could nudge Guillaume to send it just to cloud@ instead [16:43:55] I mostly worry about breaking the social contract that cloud-announce@ is low volume and important/urgent information. [16:46:29] i rejected it with a message to that effect [16:54:11] thanks :) [17:11:48] dcaro: I see you updated https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/login.toolforge.org. Was there a cloud-announce message that I missed to go along with that too? [17:13:03] I assume this also means that the new bastion there is like dev.toolforge.org and missing the former grid engine packages. Is there sill a "fat" bastion for folks who are managing tools that expect things like perl to use with their deployment tooling? [17:14:18] There's a message yes, a reply to taavi's one, the old bastion is still at login-buster.t.o (in the message too) [17:14:44] awesome. thanks! [17:15:41] Verify that you got it though, I think I remember checking in the archive of the list, but might be wrong [17:16:51] I just found it in the archive. The threaded view confused me at first apparently. [20:25:56] tools-k8s-worker-nfs-50 is apparently very sad per -cloud-feed [20:30:36] I'm trying to log in and it is slooooow [20:45:48] ok, now the host itself seems OK -- it probably lost NFS for a bit and then recovered. https://prometheus-alerts.wmcloud.org/?q=alertname%3DToolforgeKubernetesWorkerTooManyDProcesses doesn't show anything as stuck now... [20:46:24] but the other dash shows quite a lot [20:51:58] whoah, I rebooted it and it somehow still shows stuck processes? [20:52:19] ah, there we go