[05:49:14] GitLab approval queue: I confirm that user '3MindedScholar' seems not a bot - part of an hackathon I think [05:50:00] (I've no permissions to approve GitLab users - thanks if you can click) [06:13:25] Can someone take a look at this running job? It hangs and I have no idea where, and no idea how to prevent such problems in the future? daily-29281120-9jkwh 1/1 Running 0 3h30m 192.168.165.60 tools-k8s-worker-nfs-53 [07:44:00] Is someone around with access to Logstash to verify T402660 is resolved. That would be extremly helpful. [07:44:00] T402660: TypeError: MediaWiki\Extension\Math\WikiTexVC\MMLnodes\MMLmtext::__construct(): Argument #3 ($text) must be of type string, MediaWiki\Extension\Math\WikiTexVC\MMLnodes\MMLmn given, called in /srv/mediawiki/php-1.45.0-wmf.15/ext - https://phabricator.wikimedia.org/T402660 [08:00:29] wurgl: what's the tool? [08:00:44] persondata [08:01:37] It is this kind of very seldom hang, every few month [08:02:51] is it currently hung? [08:03:35] @dcaro: It seems to be my fault. I started it again and it hangs at the same article, so this smells to be my fault. [08:05:05] ack 👍, I'll let you give it a look first, if it does not get unstuck, let us know and we can look at the infra level (would be useful for debugging if you can leave the stuck process stuck if that happens). [08:18:53] @dcaro: WITH RECURSIVE Cat AS (SELECT page_title, page_id, 0 AS level FROM page WHERE page_title = 'Burnus' AND page_namespace = 14 UNION SELECT SubCat.page_title, SubCat.page_id, Cat.level + 1 FROM page AS SubCat, categorylinks, Cat WHERE SubCat.page_namespace = 14 AND cl_from = SubCat.page_id AND cl_to = Cat.page_title AND cl_type = 'subcat' AND Cat.level < 3) SELECT DISTINCT img_media_type FROM Cat, categorylinks, page [08:18:53] as P, image WHERE cl_to = Cat.page_title AND cl_from = P.page_id AND P.page_namespace = 6 AND img_name = P.page_title [08:20:20] I am nur sure if the installed version of the database understands UNIQUE in the WITH-part, it did not when I was developing that code [08:22:00] this is running against which database? [08:22:34] commons [08:27:47] you mean the union? [08:28:00] (there's no unique in the query you passed) [08:29:29] Yes, there is no UNIQUE because it caused (when I developed the code) an error. UNIQUE ist to prevent circular loops in this recursive walk thru the data [08:30:13] okok [08:31:01] according to https://modern-sql.com/caniuse/with_recursive_(union_distinct) , and given that the version of mariadb we are running is 10.6 or higher (10.11 for the one hosting commonswiki), it should be supported [08:31:31] But select * from categorylinks limit 10; <-- but this hangs too on commonswiki_p [08:57:32] dcaro: re the list - likely the same as T401861 [08:57:32] T401861: Enable SSL in Trove MariaDB - Trixie MariaDB client requires SSL but SSL is not enabled in the Trove server - https://phabricator.wikimedia.org/T401861 [08:58:51] yep, I think it's the same, the client seems to try to ssl by default [09:00:53] we might be able to add that option to the replica.my.cnf automatically on creation [11:39:13] wurgl: that last select takes >10m from quarry (https://quarry.wmcloud.org/query/96896#), so I'm not surprised the previous query hangs [11:39:39] there's also some replag going on https://replag.toolforge.org/ [11:40:32] if you can open a task it will be easier to follow up [11:47:22] > Is someone around with access to Logstash to verify T402660 is resolved. That would be extremly !help ful. [11:47:23] T402660: TypeError: MediaWiki\Extension\Math\WikiTexVC\MMLnodes\MMLmtext::__construct(): Argument #3 ($text) must be of type string, MediaWiki\Extension\Math\WikiTexVC\MMLnodes\MMLmn given, called in /srv/mediawiki/php-1.45.0-wmf.15/ext - https://phabricator.wikimedia.org/T402660 [12:08:21] physikerwelt: added a note there [12:08:48] but take it with a grain of salt, as I'm not familiar with mediawiki logs [14:44:23] dcaro: thank you [15:16:06] wurgl: dhinus took a look and got the DB unstuck, things should be way faster now, though there's quite some replag so the DB might take some time to catch up [15:24:17] Everything is fine [15:26:43] The variable lock_wait_timeout in mariadb is a little bit large. 86400 is one day, you may consider a smaller value, 1 hour is sure enough [15:47:09] wurgl: innodb_lock_wait_timeout is 50, and I expected that one to apply [15:48:23] according to "show processliste" my statement was blocked with "Waiting for table metadata lock" (in Colums "State") [15:49:43] And this metadata lock was caused by some change in the index https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance [15:51:36] yes I could definitely see many queries in that state, and with waits bigger than 50 [15:51:48] Maybe this metadata is not in innodb? [15:52:04] I'm reading the docs but it's not clear [15:52:10] I'll open a task [15:54:34] I tried on the command line setting it to 10. Then tried the statement " select * from categorylinks limit 10;" and after 10 seconds I got some error. [15:54:34] the root problem was that some queries were causing the lock... even if the timeout worked correctly, the table would still be locked [15:55:54] the non-waiting queries were in state "creating sort index", plus there was a big "ALTER TABLE" running (a column was dropped in prod) [16:14:57] my understanding is that two very long SELECT queries were preventing the ALTER TABLE from completing, and that in turn blocked all the other queries on the categorylinks table [16:15:30] I'm not 100% sure, but in any case a lock_wait_timeout would not help, I think [16:18:01] possibly it would have killed the ALTER TABLE after 1 hour, but that was also not great [16:19:37] the 2 running queries that I manually killed to get the server unstuck, would have been killed by the wmf-pt-kill script at some point (after 3 hours) [16:21:49] the same queries were in fact killed many times over the past few days, so we should try to identify where they come from and stop them :) [16:22:00] I can only say this: Before: "select * from categorylinks limit 10;" did not finish. After "SET lock_wait_timeout = 10;" ist finished with some error after ~10 seconds. [16:22:26] yes, there were many queries that were stuck for hours [16:22:40] I'm trying to get to the root cause of what caused them to be stuck :)