[06:14:52] requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='api.svc.tools.eqiad1.wikimedia.cloud', port=30003): Read timed out. (read timeout=30) [06:14:52] ERROR: Please report this issue to the Toolforge admins: https://w.wiki/6Zuu [06:14:59] full traceback: https://phabricator.wikimedia.org/P54253 [06:38:26] seems reproducible if I just run `toolforge jobs restart rusty` on the dbreps tool [08:50:14] Hello! I am back, prepare for trouble, and make it double! I am here to ask more advices about replicas databases access [08:51:14] So, I made some progress, but now I have a problem finding the best solution (main objectives: less code complexity/low load for servers) [08:52:56] Context, again: I am building a tool to search key words in revisions commentaries. It is indeed an intensive request that can stress the DB, so I already planned to restrain the date range period of the search to 3 months max. I will try to implement server/client caching too. [08:54:31] The only problem is: the editor can choose the linguistic edition (frwiki, dewiki eswiki, etc.). It it way easier to query each project api than handling DB connections to each database/shard. So my question, how would you handle multiple databases instances to instantiate dynamically? [08:55:17] I tried to think about it, but since pooling is forbidden, I see no alternatives to just open a connection to the right host (I don't need to work on shards) and close it when the user request is done. [08:56:42] So at project building time, I can generate a static json file to link linguistic edition to its host (enwiki <-> enwiki.analytics.db.svc.wikimedia.cloud) and then fetch this json file at each request to open a DB connection? [08:58:42] My life would be much simple if a big tenant database would exist, but it is not the case, and the sharding splitting is not really helping me since the app allow to query every linguistic edition. [09:00:39] Meh, the JSON file is even superfluous I believe, the idea was meaningful when I was thinking about using shards, since a project could move to another shard. [09:02:04] A secondary problem would be that... if I have 43 users on the app at the same time, it would open 43 connections? :/ [09:02:36] Waste of resources. [09:03:26] you can try using singletons for the connections, and re-opening them when there's a user request but the connection is closed [09:04:40] Still a bit new on nodejs world (I chose it to use Codex, the new Wikimedi UI package), but there is no real singleton in the JS world? Just a global variable? [09:04:55] It could still work yea... [09:05:41] But I don't really know what would happen if the connection is used at least two times at the same time. Guess it really depends of the framework used for queries. [09:05:45] I'm not very familiar with nodejs either :) [09:06:05] We doomed :-( [09:06:39] I guess my needs are a bit unorthodox anyway, that's why I am not even finding at lof of POV on the web about this problem. [09:06:41] nah, just clueless ;) [09:10:01] I am also a bit insecure about the project. Someone could easy just search key words on 100 linguistic edition and it would cause a mess on the replica... [09:10:30] I mean I can check the number of concurrent requests from one IP, but then they could evade it by proxing, etc. [09:10:53] To put it in a nutshell: a simple project is becoming complex! [09:12:30] Meh, I could just count the number of connection instances created at the current time and abort any new request if overflooded... or just make them wait. [09:16:57] [09:20:35] Reading a bit, you could try using a pool (assuming you are using https://github.com/mysqljs/mysql#pooling-connections) and make sure that after every query you do a `connection.destroy()`, that will make sure you never pass the limit of `connectionLimit` parallel connections. But I don't see an `idle timeout` type of parameter though :/ [09:21:01] We could, but: [09:21:13] "Usage of connection pools (maintaining open connections without them being in use), persistent connections, or any kind of connection pattern that maintains several connections open even if they are unused is not permitted on shared MariaDB instances (Wiki Replicas and ToolsDB)." [09:22:34] as long as you kill the connections, no connections will be persistent [09:23:28] So using a pool, opening a connection, sending the query, and closing the connection just after is just like... letting the garbage-collector cleaning the instance when we destroy our used-once connection? [09:23:34] nah? [09:23:41] the problem is when you have idle connections to the DBs, no matter the language or the libraries you use, so if you make sure to `destroy` the connection every time you create one, no idle connection should be left (if no errors happen... stuff happens) [09:24:36] kinda, but the logic of not passing X number of parallel connections, creating a new connection when needed, and queueing requests is already done by the pooling object [09:25:27] Interesting way [09:25:33] (that solves your issue about many users doing many queries at the same time, as the actual number of parallel DB queries would be controlled by the pool, the web requests will just wait for a slot to run their query) [09:25:51] Are we sure that a singleton is the right pattern with this approch? [09:26:20] would be really nice if it had an "idle timeout" kind of parameter, that way you could put it in a very low value and kill your idle connections little after they become idle [09:26:34] yep [09:26:56] the pool itself should be shared to keep track of the parallel connections, otherwise you would create a pool for each web request with it's own connection limit [09:27:55] I guess you could use a different pattern too, but you'd need a way to share the pool between web connections [09:28:23] (╯°□°)╯︵ ┻━┻ [09:30:21] Still in my young ages, I've been fed on design patterns, GoF ones from Java world, but today you just understand that these patterns are just workarounds to fill semantic gaps withing the popular languages, always hard to think about the right choices [09:31:22] Antipatterns are antipatterns themselves too :D [09:31:26] this one seems to support idle timeout: https://github.com/sidorares/node-mysql2#using-connection-pools [09:31:51] Cool, it is the driver I am using [09:33:35] so in theory, if you set an idle timeout quite low (say, a second), it should be ok on the wikireplicas side, as any idle connection would be terminated quickly, and that would allow you to reuse connections when they are requested quickly one after the other [09:34:07] you might want to test it to make sure just in case, but that's what I understand from the docs [09:34:16] Seems the best approach yea [09:34:43] Just need to think about the singleton, it will make it hard to make tests since it is shared [09:34:47] thanks for help [09:34:52] np [09:43:20] legoktm: yep, I see the log in the jobs-api, and the reply took 31s (when the timeout is 30s): `Wed Dec 6 06:12:10 2023] POST /api/v1/jobs/rusty/restart => generated 3 bytes in 31486 msecs (HTTP/1.0 200) 2 headers in 70 bytes (1 switches on core 0)` [09:43:45] is it still happening? (can I try restarting the job?) [09:45:19] there have been 8 requests that timed out in the last 3 days or so it seems [09:47:43] wait, no, there's more xd [09:47:55] most of them are logs --follow, so those are expected to be long [10:07:51] !log commtech decom `Commtech-Wiki-002` [10:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Commtech/SAL [14:28:32] !log paws pywikibot to 8.6 T352794 [14:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [14:28:37] T352794: New upstream release 8.6 for Pywikibot - https://phabricator.wikimedia.org/T352794 [14:40:29] dcaro: I feel like most of the slowness in this case is coming from the CLIs, not the APIs [14:41:34] I think that's just the logs --follow requests, those are streamed, so they might take as much as they are streaming logs [14:44:18] i mean, it should not take 3.5s to run `toolforge jobs --help` [14:45:49] xd agree [14:46:26] and that number magically turns into 0.5s if I log in as root which bypasses the systemd user resource controls [14:47:48] that's interesting [14:48:25] yep, I think we've known that the resource control is making things slower, but didn't know it did that much [14:49:34] it seems too much, is the node under stress or something? [14:49:56] real 0m3.577s [14:50:01] it took me 3.5s as root [14:50:55] from sudo that is, without sudo it takes 0.5 [14:51:04] (sudo -i) [14:51:04] no, the load factor on the bastions is very low (especially after I killed some nfs-stuck processes on -10). i think we could raise the current CPUQuota=30% by a large factor and things should still be fine. there's unfortunately no burst option it seems [14:51:29] you need to SSH in as root, logging in as your user and using sudo won't have a difference [14:55:06] I think we could just try something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/980877/ and see what happens [14:56:10] sure [14:57:05] fyi. straces look very very similar, no extra path searching or ldap calls or such it seems [14:57:19] so yep, resources might be the bottleneck [14:59:58] Q: Which Tag in phabricator when there is is crash with php 8.2 but php 7.3 works file in the toolforge-cloud [15:09:23] #toolforge [15:29:14] my bot bothasava (which is run in toolforge-run) has dies twice in a row. All I have is the word "Killed" in weekly.err . can we tell what happened? thanks [15:48:23] Hi! I'm having some issues migrating from the grid to the new mariadb k8s image. I left a summary at https://phabricator.wikimedia.org/T254636#9385420 [15:48:28] cc komla bd808 [15:48:47] I'm probably missing something obvious [15:50:34] musikanimal: I guess on the bastions we provision a mariadb config file that defaults to --host=tools.db.svc.wikimedia.cloud, but we don't have that in the container image [15:51:36] ooh good guess! let me try adding that [15:55:04] no dice :( and also an additional dependency I guess I missed: [15:55:19] ERROR 2002 (HY000): Can't connect to local server through socket '/run/mysqld/mysqld.sock' (2) [15:55:20] ./var/backups/backup.sh: line 30: /usr/sbin/exim: No such file or directory [15:55:52] the exim stuff is used just for emails. That isn't critical, so I can remove that [15:58:18] I can try using the Build Service instead, but as this is such a tiny backup script, I thought the vanilla Jobs framework would be better [16:00:07] I'll have a deeper look at why the existing image does not work later, right now I need to head to the office for breakfast [16:01:31] ok, thank you! [16:44:17] @taavi: thanks & is this enough information? https://phabricator.wikimedia.org/T352886 [16:44:42] Wurgl: can you add the xml file you've been using for testing? [17:23:16] help! my bot bothasava (which is run in toolforge-run) has died three times in a row. All I have is the word "Killed" in weekly.err . can we tell what happened? thanks [17:23:52] !help my bot bothasava (which is run in toolforge-run) has died three times in a row. All I have is the word "Killed" in weekly.err . can we tell what happened? thanks [17:23:52] If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-kanban [17:29:49] @Kotz Just a try: Too few memory provided? try starting with option --mem 2G (or whatever memory you need) [17:31:38] Wurgl I wouldn't think it should be the case, the same bot has been running with 4GB for months. I will upgrade to 6GB but i don't have a lot of hope. thanks [19:40:25] !log tools.wikibugs Updated channels.yaml to: 9e1fe76a8be70ee8b6b90c65e9e14fc1481f877e remove PAWS from #pywikibot [19:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [19:40:51] thanks taavi [23:54:08] ooh, ty for the extra CPU taavi (and dcaro) [23:55:22] legoktm: yw, we should probably done that a while ago :D