[07:18:33] !log arthurtaylor@tools-bastion-13 tools.phpunit-results-cache deployed 3fb13097895 (build notification support) [07:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.phpunit-results-cache/SAL [07:33:11] !log tools add AAAA record on *.toolforge.org T211575 [07:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [07:33:16] T211575: Enable IPv6 on toolforge.org - https://phabricator.wikimedia.org/T211575 [13:53:05] !log dcaro@tools-bastion-13 tools.wm-lol testing [13:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wm-lol/SAL [14:00:57] !log dcaro@tools-k8s-worker-111 wm-lol test [14:00:59] wmbot~dcaro@tools-k8s-worker-111: Unknown project "wm-lol" [14:00:59] wmbot~dcaro@tools-k8s-worker-111: Did you mean to say "tools.wm-lol" instead? [14:42:48] !log tools.cluebotng-staging cleanup 200G+ of old log files per T395006 [14:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.cluebotng-staging/SAL [14:42:51] T395006: cluebotng-staging tool uses ~560G of disk space - https://phabricator.wikimedia.org/T395006 [14:58:19] Is there a way i can take snapshot of toolsdb? like `mysqldump`? [14:58:46] from toolforge? [15:00:22] mysqldump used to be installed but isn't anymore šŸ˜” T378882 [15:00:36] so, am I out of luck? [15:01:05] well, you can go to that task and look at the workaround mentioned there [15:01:32] it amounts to putting the dump in NFS, because we all love NFS [15:01:36] We just switched to toolforge from our droplet [15:02:07] there's a pre-built image that has mysql on it [15:02:11] * dcaro looking [15:03:47] I was already dumping snapshot, compressing, then uploading it on `backup-bot` for the last month [15:20:52] `toolforge jobs run --command "umask o-r; ( mariadb-dump --defaults-file=~/replica.my.cnf --host=tools-readonly.db.svc.wikimedia.cloud credentialUser__DBName > ~/DBname-$(date -I).sql )" --image mariadb backup` [15:21:36] Seems like it would do the job, but it comes with a big warning, *Note that we don't recommend storing backups permanently on NFS (/data/project, **/home**, or /data/scratch on Toolforge) or on any other Cloud VPS hosted drive* [15:24:43] Suddenly, the tools became slower [15:26:49] We are keeping transition period of 30 days for monitoring this tool to determine whether toolforge is the right decision or not [15:48:51] nokibsarkar: yep, you should not rely on toolforge nfs for backups, currently we don't have a backup solution, so you would have to transfer them somewhere else in the long run [15:48:57] can you elaborate on "Suddenly, the tools became slower" ? [15:49:07] (which tool, what action, etc.) [15:49:14] might be NFS misbehaving [15:49:57] toolforge nfs is still better than nothing, i.e. if for some reason there is any data corruption in toolsdb, or unintentional human error that deletes some data, you would still have the nfs backups [15:51:06] yep, but if the backups are too big they might get truncated by us to free space [15:52:23] you're going to have to have some pretty big and fast-growing backups for that to happen without warning [15:52:37] yep [15:52:56] i am planning to keep only one week of daily compressed snapshot. is it ok? [15:53:17] do you care if it's actually there when you need it? (re @nokibsarkar: i am planning to keep only one week of daily compressed snapshot. is it ok?) [15:53:19] if it's about 100MB per each day, it's ok [15:53:34] (I see the last one is 84MB) [15:54:32] with that size, even 2 or 4 weeks would not be an issue [15:54:59] after compression, it became 6mb [15:55:06] then you can keep 1 year :D [15:55:09] if size became an issue you could use something like bup, rdiff-backup. [15:55:10] but if the NFS disk failed then you're screwed. [15:56:02] I think the comparison was vs. previous hosting not a previous state of wmcloud. (re @wmtelegram_bot: can you elaborate on "Suddenly, the tools became slower" ?) [15:56:23] yep [15:56:34] `-rw-r--r-- 1 nokibsarkar tools.backup-bot 6.9M May 22 12:00 campwiz-backup-2025-05-22_12-00-01.sql.tar.gz` [15:56:55] are those numbers real? actual sql takes 84 mb [15:57:10] that's peanuts yep [15:57:13] do you have an example operation we can try to reproduce on toolforge AND on your previous hosting, to compare the speed? [15:57:17] and here compressed for takes less than 1/10 [15:57:18] sql is text, text compresses well? [15:57:33] tar.gz is pretty good at compressing text, you could use bz2 to compress even more :P [15:59:13] I think there's some comment in https://phabricator.wikimedia.org/T394730#10848302 [15:59:37] > My current performance issue remains loading thumbnail from wikimedia commons server. [15:59:57] dcaro: thanks, I didn't see that task [16:00:22] nokibsarkar: can you elaborate a bit more on how that loading is done? is it the backend pulling directly from commons and then serving the user? or the user's js pulling directly from commons? [16:00:28] is it a background process? [16:00:43] are they stored somewhere? or served on-demand? [16:01:54] user pulling from commons, backend is also providing the thumbnail ur [16:01:59] I'm also logged in now, is there a way I can go to one of those pages? [16:02:20] if it's the user's browser connecting to commons directly, it's out of toolforge :/ [16:02:37] it has some kind of access management. Let me create a test campaign for u (re @wmtelegram_bot: I'm also logged in now, is there a way I can go to one of those pages?) [16:02:49] ack [16:02:51] thanks [16:03:07] btw. it's really snappy :) [16:03:13] (so far) [16:03:49] Unfortunately I added the redirection rules from previous host to toolforge. if we want to compare, I have to re configure the nginx [16:04:29] we can check first a bit more in detail what's the issue [16:04:36] (what's what is slow) [16:04:57] can u load this: https://campwiz.toolforge.org/campaign/c2c7piesolaf4 [16:05:44] it is awfully slow now, I cannot even go to my admin panel [16:05:44] done, full refresh is <2s, should I go to some of those? [16:05:57] still fast for me :/ [16:06:01] i think my internet is slow then [16:06:21] your wiki username? [16:06:41] dcaro? [16:06:55] DCaro (WMF) [16:08:02] can u go this >> https://campwiz.toolforge.org/campaign/c2c7piesolaf4 Then hit `Evaluation Area` [16:08:23] 1.94s [16:08:41] then my ISP is the villain, i guess [16:08:54] that image is directly from commons yep, so it does not even pass through toolforge [16:09:23] 6MB [16:09:31] that feels like a lot for a thumbnail [16:10:00] I think it might be downloading the full image twice, then the thumbnail [16:10:04] https://usercontent.irccloud-cdn.com/file/D25JN74v/image.png [16:10:48] wow. Now i see the high bandwidth usage complaint by one of the user [16:11:06] luckily the project lead gave them bandwidth stipend [16:11:40] xd [16:12:09] I found metered networks quite annoying [16:12:20] why did it not show up on my terminal? [16:12:37] what terminal? [16:12:47] that's from firefox 'network' tab of the dev tools [16:12:53] my network tab, previously [16:12:54] (the screenshot) [16:13:05] yep, I am talking about that (re @wmtelegram_bot: that's from firefox 'network' tab of the dev tools) [16:13:21] you had some kind of filter set in network tab? [16:13:25] try again? [16:13:34] on the developer console, I was looking for high bandwidth usage, but with no luck [16:13:54] maybe you had it cached [16:14:22] maybe. But thanks for giving me a lead on a unsolvable case [16:14:22] if I re-enable the broswer cache, it does not download the images again [16:14:30] np :), happy to help [16:15:34] my load takes 6.89s [16:16:27] what is the slowest thing showitg up in the network tab? [16:17:34] https://campwiz.toolforge.org/?_rsc=flnrj [16:17:55] hmm... for me that takes 1.30s [16:18:13] there's a 1.7MB svg too [16:18:19] (not much) [16:18:43] 1.7 MB is huge [16:18:50] it's the background image I think [16:19:12] https://usercontent.irccloud-cdn.com/file/asSwbLUV/image.png [16:19:29] it is the animated loader [16:19:51] the thing that changes color while navigating and clicking bunch of stuffs [16:20:54] it is so much used that it should be cached already [16:21:08] it's cached yes [16:21:14] for me that url opens this [16:21:17] https://usercontent.irccloud-cdn.com/file/Ubcyo7jr/image.png [16:24:19] oooo [16:28:22] how can upload photo here? [16:29:51] the api that loads images on the front page is the public api https://campwiz-backend.toolforge.org/api/v2/campaign/ [16:30:01] I use irclcoud.com client, it does it itself [16:30:04] it takes 1.81s here [16:30:33] that url takes 0.5s for me [16:31:03] is it some kind of proximity thing? [16:31:14] my droplet was located in india [16:31:37] might be, our hosting is in eqiad datacenter olny, no CDN [16:32:06] (that's Ashburn in the US) [16:32:27] we also did not have api, but i think being in my neighboring country was the reason behind performance [16:32:42] I turned off my cache [16:32:50] changed group settings, try again (re @nokibsarkar: how can upload photo here?) [16:33:15] cc bd808 (re @jeremy_b: changed group settings, try again) [16:34:03] https://tools-static.wmflabs.org/bridgebot/73eaf9c3/file_70673.jpg [16:34:10] using the VPN through india jumps the loading time of that page to 3s [16:34:14] this is without cache [16:34:21] are your users all in a particular region? [16:34:33] mostly indian (re @jeremy_b: are your users all in a particular region?) [16:35:03] but a few from Africa as well [16:35:24] 42s is a lot yep [16:35:51] I'm 400km from Ashburn now [16:37:47] from ghana the speed is quite better than india too :/ [16:38:36] picking up any twi while you're there? [16:38:37] ivory coast is fast too [16:38:48] oh this is all VPN? [16:38:53] you tricked me! [16:38:56] did u try disabling cache? [16:39:38] jeremy_b: yep sorry :), all jumping with the VPN [16:39:51] I have the cache disabled yep [16:40:34] bangladesh also jumps to over 2s for the api call :/, I think that might be proximity yes [16:41:17] (hopefully a little bit better as my traffic has to go all the way there before going to the datacenter and back, so i'm doing twice the distance) [16:41:56] But, is it possible that toolforge backend with the nginx we are using is adding some delay of 1s? [16:42:13] I don't think so, or I would be seeing that too [16:42:34] (maybe I just have been lucky so far though) [16:42:57] Toolforge is all served from the Ashburn DC with no edge caching anywhere. So speed of light and network speed to Ashburn is going to be baseline latency. [16:43:13] Hi. Is there a phab task for supporting Toolforge Build Service images for ARM64 platform? [16:43:29] DaxServer: we don't ahve any ARM servers at the moment [16:43:44] DaxServer: I don't think so, but yep :), you would not have where to run it in toolforge [16:44:20] because, even with caching (the campaign list is loading 2s slow) [16:45:23] DaxServer: The Toolforge build service is currently imagined for running code on Toolforge and not as a generic container creation and management system. That makes ARM image support relatively difficult to prioritize. [16:46:18] nokibarkar: that might be just network latency, that's the bump I see more or less when connecting through the VPN (difference between connecting directly from switzerland vs india) [16:47:16] nokibarkar: can you ping api.svc.toolforge.org? [16:47:23] I get ~400ms [16:48:30] and even if it was supported then how do we test that it's working properly without a place to run the images? (re @wmtelegram_bot: DaxServer: The Toolforge build service is currently imagined for running code on Toolforge and not as a generic containe...) [16:48:48] (pyvenv) tools.backup-bot@tools-bastion-13:~/backups/CampWiz-NXT$ ping api.svc.toolforge.org -n 5 [16:48:49] PING 5 (0.0.0.5) 56(124) bytes of data. [16:48:51] ^C [16:48:52] --- 5 ping statistics --- [16:48:54] 11 packets transmitted, 0 received, 100% packet loss, time 10239ms [16:48:55] bd808 dcaro I have M1 Macbook. And Portable Antiquities Scheme website has their API behind Cloudflare which is rejecting requests from Toolforge and sending a human challenge (which is of course not possible as I'm not emulating a human there). This led to the scenario that a Commons bot I run had to be run from my computer. I could directly run [16:48:55] the command from the project from terminal, but it also means I have to be wary of any changes that I make to the codebase, which is currently in active development. If there was an arm64 heroku builder, I could have built the image on my system and run the image, rather than running directly from active project. Ref: [16:48:56] https://commons.wikimedia.org/wiki/Commons:Batch_uploading/Portable_Antiquities_Scheme [16:49:10] from within toolforge [16:49:24] from home not toolforge (re @nokibsarkar: (pyvenv) tools.backup-bot@tools-bastion-13:~/backups/CampWiz-NXT$ ping api.svc.toolforge.org -n 5 [16:49:24] PING 5 (0.0.0.5) 56(124) byte...) [16:49:58] nokibsarkar: can you paste one of the intermediate lines? where it says the single-request time [16:49:59] 64 bytes from api.svc.toolforge.org (185.15.56.11): icmp_seq=3 ttl=53 time=415 ms [16:50:21] or the last line with the stats [16:50:25] rtt min/avg/max/mdev = 414.092/414.714/415.432/0.551 ms [16:50:49] ping api.svc.toolforge.org -c 5 [16:50:49] PING api.svc.toolforge.org (185.15.56.11) 56(84) bytes of data. [16:50:51] 64 bytes from instance-tools-proxy-9.tools.wmcloud.org (185.15.56.11): icmp_seq=1 ttl=45 time=496 ms [16:50:52] 64 bytes from api.svc.toolforge.org (185.15.56.11): icmp_seq=2 ttl=45 time=575 ms [16:50:54] 64 bytes from instance-tools-proxy-9.tools.wmcloud.org (185.15.56.11): icmp_seq=3 ttl=45 time=457 ms [16:50:55] 64 bytes from toolforge.org (185.15.56.11): icmp_seq=4 ttl=45 time=455 ms [16:50:57] 64 bytes from api.svc.toolforge.org (185.15.56.11): icmp_seq=5 ttl=45 time=458 ms [16:50:58] --- api.svc.toolforge.org ping statistics --- [16:51:00] 5 packets transmitted, 5 received, 0% packet loss, time 4189ms [16:51:01] rtt min/avg/max/mdev = 454.697/488.143/575.166/46.070 ms [16:51:03] from home [16:51:13] that's quite good actually [16:52:21] this is from toolforge: PING api.svc.toolforge.org (172.16.18.101) 56(84) bytes of data. [16:52:22] 64 bytes from tools-proxy-9.tools.eqiad1.wikimedia.cloud (172.16.18.101): icmp_seq=1 ttl=63 time=1.49 ms [16:52:24] 64 bytes from tools-proxy-9.tools.eqiad1.wikimedia.cloud (172.16.18.101): icmp_seq=2 ttl=63 time=1.21 ms [16:52:25] 64 bytes from tools-proxy-9.tools.eqiad1.wikimedia.cloud (172.16.18.101): icmp_seq=3 ttl=63 time=0.511 ms [16:52:27] 64 bytes from tools-proxy-9.tools.eqiad1.wikimedia.cloud (172.16.18.101): icmp_seq=4 ttl=63 time=0.629 ms [16:52:28] 64 bytes from tools-proxy-9.tools.eqiad1.wikimedia.cloud (172.16.18.101): icmp_seq=5 ttl=63 time=0.367 ms [16:52:30] --- api.svc.toolforge.org ping statistics --- [16:52:31] 5 packets transmitted, 5 received, 0% packet loss, time 4018ms [16:52:33] rtt min/avg/max/mdev = 0.367/0.840/1.488/0.431 ms [16:52:51] that should be localhost [16:52:52] yep, that's from the same local network [16:52:59] from toolforge isn't interesting. it's from the same building [16:53:00] (it's not localhost though) [16:53:15] ok [16:53:34] nokibsarkr [16:53:37] oops [16:53:51] nokibsarkar: can you try `curl -k -o /dev/null -w '\n* Response time: %{time_total}s\n' https://campwiz-backend.toolforge.org/api/v2/campaign/` from your machine? [16:54:23] % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed [16:54:24] 100 4117 0 4117 0 0 2822 0 --:--:-- 0:00:01 --:--:-- 2823 [16:54:25] * Response time: 1.458476s [16:54:53] that for me takes ~0.4s [16:54:59] * Response time: 0.315861s [16:55:01] The API call takes 1.4s from my network [16:55:13] DaxServer: Can you just use a Dockerfile to build your local container? Would you be interested in help doing that if it sounds like it would work but you need some help figuring out the specifics? [16:56:00] ill join in, campwiz backend, i get 0.40s - 0.43s ish [16:56:11] I have an M3 macbook rather than an M1, but modern Docker Desktop supposedly can run AMD images on either most of the time. [16:56:50] nokibsarkar: it's interesting though that ping is fast, but curl is not [16:57:12] I turned on previous droplet [16:57:35] still redirecting [16:57:45] you could just run your own cdn in a droplet? varnish? [16:58:02] fronting toolforge [16:58:25] that should help [16:58:44] without cache : https://tools-static.wmflabs.org/bridgebot/2985ead0/file_70674.jpg [16:58:52] from previous droplet [16:59:11] can u try this: https://campwiz.wikilovesfolklore.org/campaign/c2c7piesolaf4 [17:00:19] ~3s [17:00:35] (no vpn, from switzerland) [17:00:43] you could even serve different DNS entries depending on geoip and send some people straight to toolforge. if toolforge allowed custom domain. (re @jeremy_b: you could just run your own cdn in a droplet? varnish?) [17:02:08] previous droplet feels like instant (figuratively) [17:02:22] hmm.. the droplet takes ~7s from california [17:02:49] so I think it's most probably jus the latency between US-India [17:02:54] *just [17:03:17] can u load the home page without cache? https://campwiz.wikilovesfolklore.org/ [17:03:18] is it HTTP 2? [17:03:40] bd808 Thanks. I'm using podman. I'll try to spin up an amd arch machine and will try to rebuild the image and see how it works! [17:03:41] can you maybe shift some things to lazy loading? [17:04:31] bd808: DaxServer: you could also try building locally with "pack", see https://wikitech.wikimedia.org/wiki/Help:Toolforge/Building_container_images#Testing_locally_(optional) [17:04:41] nokibsarkar: from california it took me ~8s [17:04:51] you would need to replace tools-harbor.wmcloud.org/toolforge/heroku-builder:22 with the upstream heroku builder, which also exists for arm [17:04:54] most of the things are lazy loaded, some are server rendered (re @jeremy_b: can you maybe shift some things to lazy loading?) [17:05:26] dhinus: unfortunately, that will not build the exact same thing than the build service (we inject some fixes/stuff during build that pack does not) [17:05:43] if everything is client rendered then the initial loading time would've busted [17:06:02] dcaro: yes, depends if the tools requires those extra things or not... you could also run the tools-harbor image with AMD emulation with something like https://lima-vm.io/docs/config/multi-arch/ [17:06:33] DaxServer: https://podman-desktop.io/docs/podman/rosetta would in theory be the config you need to make AMD images work with Podman on an M1 mac. [17:07:26] bd808: yep, that should also work, I'm not sure if that works together with "pack", or if you would need to build a custom image [17:07:35] So, the public API for listing campaigns is running ~500ms : https://tools-static.wmflabs.org/bridgebot/ef673aee/file_70675.jpg [17:07:46] https://campwiz.wikilovesfolklore.org/api/v2/campaign/ [17:09:07] nokibsarkar: that takes 1.6s from california [17:09:27] without cache? [17:09:33] ~0.52 from switzerland [17:09:36] yep [17:10:10] so, what should be the conclusion? Who is the villain? my shitty code? or proximity thing? [17:10:23] bd808: dcaro: what about installing lima-kilo and running the build from there? overkill? :P [17:11:05] I'd say proximity adds the extra ~2s you see from india, then there's the other (not so critical) issue of downloading the original/big images [17:15:19] bd808 Thanks for the link. Looks like I have it enabled - probably correct working would be: it was enabled by default without my interventions [17:15:36] dhinus: maybe a bit yep xd [17:16:17] dhinus lima-kilo seems interesting read on the wikitech. Is someone already using it? [17:17:00] DaxServer: it's used daily by the toolforge developers, for testing toolforge itself [17:17:19] but actually once the image for your tool is built for AMD, and stored in Harbor... you shouldn't need it [17:17:28] yep. i am kinda skeptical about the decision now. but first i have to fix the big image thingy (re @wmtelegram_bot: bd808: dcaro: what about installing lima-kilo and running the build from there? overkill? :P) [17:19:02] @nokibsarkar my suggestion of lima-kilo was for DaxServer, not related to campwiz. there are two intertwined conversations, sorry for the confusion. [17:19:30] yeh, i understood, no worries [17:19:44] ghana<->india seems preety fast too, I think that the main delay is US<->india, unfortunately toolforge is only in the US (for now) [17:20:25] i think one reason might be too much traffic goes through US? [17:20:26] probably if it was hosted in europe/africa, you'd get a better "mean" response time from both US and India [17:22:12] is public wikiusername considered PII? [17:22:44] Okay.. good news is that I figured out what was missing.. "platform: linux/amd64" in compose.yaml and that did the trick. Next to figure out is how to inject podman secrets (pywikibot creds). Thanks y'all :) [17:25:25] DaxServer: nice one! [17:26:39] nokibsarkar: by itself I think it's not [17:27:46] yeah "by itself" I'd say no, but the sensitive information might be that a certain user is using your tool. I'm not a lawyer though :) [17:27:47] After turning on redirection, i was stuck in a loading war with the browser (in toolforge domain), suddenly i noticed cache is disabled and it is taking about 4 minutes and still did not finish [17:28:25] that sounds like a different issue, nothing should take 4min [17:28:38] * dhinus has to go offline [17:29:00] one user asked me, if i store her username and she is from Europe, does it comply with GDPR. (re @wmtelegram_bot: yeah "by itself" I'd say no, but the sensitive information might be that a certain user is using your tool. I'm not a l...) [17:29:41] (different issue as in some kind of network issue, request lost waiting for retry, or something weird like that), what does your network tab say it's waiting for? [17:29:44] i know it's not a serious thing, but just in case, I asked [17:29:59] no, it loaded finally after 5 minutes (re @wmtelegram_bot: (different issue as in some kind of network issue, request lost waiting for retry, or something weird like that), what d...) [17:30:36] the same page that takes about 2s from my droplet (without cache) [17:30:55] if you retry still takes 5min? [17:31:06] yep [17:31:12] would be interesting to see exactly which resource it's stuck on [17:31:24] can you share the page? (so I can try) [17:31:40] without cache (like first time user) [17:32:01] just https://campwiz.toolforge.org/ while logged out then? [17:33:32] I'm getting ~2s, no cache, logged out [17:33:58] sometimes up to 3s, but mostly 2.something [17:34:30] I am stuck at telegram [17:34:43] oh, weird :/ [17:34:54] nokib: so can you share the timing of the network pannel? like [17:34:59] https://usercontent.irccloud-cdn.com/file/dSvlHVgC/image.png [17:35:11] that should show which request is the slow one or if it's all of them getting slow [17:35:16] Slow mode is on, u cannot message multiple times, says on telegram [17:35:45] @nokibsarkar: https://wikitech.wikimedia.org/wiki/Wikitech:Cloud_Services_Terms_of_use#7.2_If_this_is_a_Toolforge_Project -- there is a carve out for on-wiki usernames collected because OAuth is in use, but yes usernames are considered PII in the general Wikimedia Privacy Policies. [17:36:42] yep, I was wondering about oauth too [17:36:46] thanks bd808 [17:37:23] The general Wikimedia Privacy Policy is very user-centric privacy conscious in ways that can be confusing to folks who are used to how the"normal" internet treats user data. [17:37:41] https://tools-static.wmflabs.org/bridgebot/667dffb7/file_70676.jpg [17:39:15] the same endpoint that used to take ~3s is somehow taking 1.5 minutes? [17:39:46] like, does the packet ask google map for navigation? [17:41:36] xd [17:42:07] nokibsarkar: that screenshot does not show the timings though [17:43:30] https://tools-static.wmflabs.org/bridgebot/932aa184/file_70678.jpg [17:44:04] each thing takes more than a second [17:44:40] something is going on there, it took 2s to download an svg file... [17:45:19] from my vpn it still takes under 6s from india [17:45:26] so might be your internet provider [17:45:27] takes 12.52 secons for 1kb download [17:45:52] is this a joke? [17:46:22] can you try for example https://prometheus.svc.toolforge.org/tools/graph [17:46:22] that is also a static file probably cached by nginx [17:46:50] (that is not hosted in toolforge, but in the same datacenter, even if the domain seems the same) [17:48:18] still loading [17:49:09] https://tools-static.wmflabs.org/bridgebot/9a5a0091/file_70679.jpg [17:49:40] figures look veeeeeery nice. [17:50:23] 1.09 minute . At this rate I think I would be be the highest scorer even if I do not try [17:51:04] xd [18:00:33] nokibsarkar: does horizon.wikimedia.org also load that slow? (or the idp login page it redirects to) [18:02:53] still loading [18:03:21] 2 minutes, still loading [18:03:50] yep, that's is the US too :/, not in the same hardware, but same DC [18:04:10] there's some issue with your ISP provider and (at least) wikimedia network [18:04:52] (imo, or your local internet connection of some sort) [18:05:08] https://tools-static.wmflabs.org/bridgebot/989288ba/file_70682.jpg [18:08:10] bangla wikipedia loads very fast: : https://tools-static.wmflabs.org/bridgebot/fd423160/file_70683.jpg [18:08:28] and other things, like commons [18:08:42] without cache [18:09:58] must be something with `*.toolforge.org` [18:10:27] horizon.wikimedia.org is not under toolforge [18:10:30] https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue [18:10:56] thanks taavi! that has a lot of debugging stuff yep [18:13:23] https://tools-static.wmflabs.org/bridgebot/5f2ce2d9/file_70684.jpg [18:15:55] !log tools restart tools-static nginx due to nfs hiccup [18:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:17:46] This is from france using vpn: : https://tools-static.wmflabs.org/bridgebot/27dcab02/file_70685.jpg [18:19:10] nokibsarkar: I'm right close to france :) (10km), and I get <3s [18:24:45] u guys really should consider a data center not within US. When I wanted to move to toolforge 2 years ago, everyone from my team was very skeptical. I anyway tried to move there and the performance was disasterous even with more RAM things. So, I retreated (in simple words ran away and begged WLF for a server). Then, through this 2 years of experience everyone in [18:24:45] this region (Ind [18:24:46] ia, bangladesh, Pakistan, Sri Lanka, Nepal) said they prefer not to use any toolforge hosted tool if other options are available. [18:26:14] But on the other side of the world, people seemed satisfied (at least not annoyed by toolforge performance). [18:27:06] yep, we are currently only in one datacenter, we have been thinking on expanding to a different one, so that's definitely in our mind yes [18:27:49] can you open a task with some details about this? (the latency/network issues and such), that will help us in the future reference it when planning expansions and such [18:29:29] For wiki tools, our first choice should've been toolforge. But, we resort to self-host or other options at first because of poor performance. People do not understand that this is a proximity issue. The users are mostly non-technical, they only understands toolforge tools are very bad to work with. But we have to use it anyway, because, meh, what can we do? They are [18:29:29] the bosses. T [18:29:30] his is a literal quote from one of my friends who first introduced me to WLF. [18:30:31] that's really good feedback, and we shuold surface the proximity issue yes [18:31:41] Sorry for getting such frustrated. Since the first trial of the tool on toolforge, I was blaming my shitty code or poor choice of technology. [18:32:39] as a reminder: the page I linked above has the exact instructions on how and where to report this so that things can be improved [18:33:17] +1 for that yep, I'm trying to get some data from vpn but I don't see that issue [18:35:43] from the eqsin datacenter (singapore) things are fast too :/ [18:36:14] so yep, nokibsarkar, would be really useful if you write that task with the tests suggested there, that might allow us to pin-point the ISP/areas that are affected [18:39:44] gtg. might be around later, nokibsarkar thanks a lot for your patience and your efforts! [20:19:33] which is why IMHO it would make perfect sense to install `mysqldump` directly on the bastions so we can stream the dumps right over SSH… (re @wmtelegram_bot: nokibsarkar: yep, you should not rely on toolforge nfs for backups, currently we don't have a backup solution, so you wo...) [20:19:54] (sorry to keep banging on this drum for a bit. probably I’ll soon enough forget about it again and just ignore that my toolsdb backups have been broken for a year now :|) [20:20:46] @lucaswerkmeister: have you tried streaming through `webservice shell -- mysqldump ...` sort of command chaining? [20:21:16] can webservice shell be used non-interactively? o_O [20:21:22] (and is there a type that includes mysqldump?) [20:21:36] I tried to hack together a direct kubectl command and eventually gave up [20:22:50] (if `toolforge jobs run` had something like `mwscript-k8s`’ `--follow` that would probably work pretty well) [20:24:08] yes and yes, at least in theory. We made a container jsut for mysql stuff like this and webservice at least used to be able to pass the inner command via the invocation [20:26:41] @lucaswerkmeister: if `webservice shell -- ...` doesn't work, then https://wikitech.wikimedia.org/wiki/User:BryanDavis/Kubernetes#Launch_an_interactive_shell_in_the_cluster should [20:28:13] https://docker-registry.toolforge.org/#!/taglist/toolforge-mariadb-sssd-base is the container to use [20:28:24] (with apologies to the IRC side) [20:28:26] ``` [20:28:27] tools.quickcategories@tools-bastion-13:~$ webservice foo shell [20:28:28] This --mount option is only supported for 'buildservice' type [20:28:30] Review the arguments passed and try again``` [20:28:50] could probably use an improvement in the case where the ā€œargumentsā€ come from the file instead of argv ;) [20:29:49] bd808: I know there’s a container, but I’m not sure there’s a webservice type corresponding to it? `webservice --mount=all foo shell` shows a list and none of them look like mariadb [20:29:58] maybe I’ll just use your kubectl command then [20:31:11] ah, that may be the case. It should be fixable, but we may have kept punting in the hope that `toolforge shell` would become a thing [21:17:12] `ssh toolforge become quickcategories kubectl run interactive --image=docker-registry.tools.wmflabs.org/toolforge-mariadb-sssd-base:latest --restart=Never --command=true --labels=toolforge=tool --rm=true --attach=true -- mysqldump --defaults-file=~/replica.my.cnf --host=tools.db.svc.wikimedia.cloud --single-transaction 's53976__quickcategories'` [21:17:19] gives me *something* but not a proper SQL dump [21:17:22] it seems to start in the middle [21:17:32] might need to tweak some flags like TTY/no-TTY some more [21:18:12] (it also ends with ā€œpod "interactive" deletedā€ which I don’t really want in the stdout / .sql file ^^) [21:19:11] but it looks like the tool also has unrelated errors I should look into first [21:19:16] KeyError: 'getpwuid(): uid not found: 53976' [21:19:20] OSError: No username set in the environment [21:19:36] (`kubectl logs background-runner-6d4b45f48-52thd`) [21:19:41] (`quickcategories` tool) [21:19:54] SSSD acting up again, maybe? [21:20:16] is that the component responsible for turning UID numbers into user names? (I think that’s what the Python code is trying to do) [21:20:57] sssd mounting into the container would then provide the LDAP backed user lookups, yes [21:21:49] 53976 is tools.quickcategories [21:22:25] technically "uid" is never a number. "uidNumber" is, while "uid" is a string [21:22:33] on the LDAP side [21:23:11] "SSSD acting up" does sound like a good guess [21:24:17] `id 53976` jsut worked in the shell I launched, but that would be exec node specific if ssd is being flakey [21:25:18] @lucaswerkmeister: I'm not sure about your command using `--attach=true` vs `--stdin=true --tty=true` in my template [21:27:27] I thought attach made more sense because I don’t need a real TTY, but I’ll try that too [21:27:43] and yeah I tried to kubectl exec in one of the broken pods but it didn’t work (presumably they’re long dead) [21:27:53] I’m trying to deploy a new version now, maybe that’ll help [21:27:57] yeah, I'm not sure it is wrong, it is just not a thing I've used consciously [21:28:21] !log lucaswerkmeister@tools-bastion-13 tools.quickcategories deployed 1d0fb31941 (upgrade dependencies) [21:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.quickcategories/SAL [21:28:54] After `become bd808-test` this actually seemed to work `kubectl run interactive --image=docker-registry.tools.wmflabs.org/toolforge-mariadb-sssd-base:latest --restart=Never --command=true --labels=toolforge=tool --rm=true --stdin=true --tty=true -- mysqldump --defaults-file=~/replica.my.cnf --host=tools.db.svc.wikimedia.cloud --single-transaction 's53976__quickcategories'` [21:29:25] hm, new version still seems to be having the same OSError issues [21:29:27] It does end with a spurious `pod "interactive" deleted` line from the session ending [21:31:00] looks like `kubectl exec -it` doesn’t work with a pod that’s in CrashLoopBackOff [21:31:16] no, it wouldn't [21:31:34] btw this image was built with `--use-latest-versions` – I wonder if that’s related šŸ¤” [21:31:44] crashloopbackoff means the pod is down and then scheduler is not going to try restarting for a while [21:32:00] I think it’s possible that it was actually broken ever since I used that, and I just didn’t notice / pay attention [21:32:25] the running `quickcategories` pod (powering the still-working webservice) is 45d old [21:32:39] /me describes the pod [21:33:08] yup, `kubectl exec -it quickcategories-9dc6df5d9-qrmct -- python3 --version` (the working pod) prints Python 3.10.12 [21:33:35] so I think my reported success at T381923 was premature, dangit :S [21:36:31] I think .python-version worked for me somewhere recently... [21:36:48] * bd808 goes to look at gitlab history [21:37:02] task reopened [21:37:25] yeah, I got it to work in https://gitlab.wikimedia.org/toolforge-repos/containers-bnc/-/commit/a937384ad626eb5902910906e37749348caa3757 [21:37:34] interesting [21:39:19] bd808: maybe your code doesn’t call getpwuid()? [21:39:33] according to the stack trace, the failing call in my code is buried in pymysql [21:40:24] so.. .yeah. buildservice containers don't have SSSD mounted ever I don't think. And if they do they don't have the /etc config files to make it all work. [21:41:22] I'm not sure that $USER ends up set either [21:41:28] (well, ā€œburiedā€ is an overstatement. `pymysql/connections.py` just tries to get the user name at initialization time for some reason 🤷) [21:41:49] but previously it worked as a build service… [21:41:53] probably so it can use ~/.my.cnf by default [21:41:56] /me looks at connections.py in the old container [21:42:54] aha, in Python 3.10 `getpass.getuser()` raises a `KeyError: 'getpwuid(): uid not found: 53976'` [21:42:58] which pymysql catches [21:43:10] in Python 3.13 that becomes an `OSError: No username set in the environment` [21:43:16] which pymysql doesn’t catch [21:43:27] jesus [21:44:27] ā€œChanged in version 3.13: Previously, various exceptions beyond just OSError were raised.ā€ https://docs.python.org/3/library/getpass.html#getpass.getuser [21:44:29] -.- [21:45:36] ok, so I need https://github.com/PyMySQL/PyMySQL/commit/a1ac8239c8 [21:45:42] which… isn’t in a published release yet? [21:46:10] yeah last release was a year ago [21:52:35] !log lucaswerkmeister@tools-bastion-13 tools.quickcategories deployed 74cd3dee83 (install PyMySQL from Git for Python 3.13 compatibility; CC T381923) [21:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.quickcategories/SAL [21:56:40] fixed (see https://phabricator.wikimedia.org/T381923#10850835 for details) [22:00:26] /me tries `--stdin=true --tty=true` for mysqldump now [22:03:39] so far I’m always getting various forms of truncated SQL [22:04:07] and also sometimes ā€œIf you don't see a command prompt, try pressing enter.ā€ at the beginning of the file [22:48:51] (left a comment with that information on T378882 so it’s not lost)