[04:33:36] Going to start with s6 switchover [04:43:31] elukey: For https://phabricator.wikimedia.org/T371132 I am off next week, do you want to leave db1179 (the only one which is in production) done today so you aren't blocked on me for the databases part? [04:55:35] s6 is done, I am going to do s8 now [05:55:15] set global binlog_format=ROW; [05:55:20] sorry, wrong window [06:06:18] Can I get a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1058295 - this is for Jaime's new backups hosts. He asked to get them ordered and installed while he's out. [06:06:54] https://phabricator.wikimedia.org/T371416 here he mentions the recipe we need to use [06:46:08] marostegui: holaaaa [06:46:22] elukey: holaaaaaaa guapo! [06:46:49] definitely yes, the cookbook reboots the host afaics, so once you have time and you want to depool I'll be ready [06:47:24] Sure, I will do it now [06:50:09] elukey: you can proceed anytime - I have also downtimed it (1h) [06:51:30] backup CR done [06:52:17] thank you volans - I am sending the fix [06:57:18] marostegui: ack thanks! [06:57:22] so db1179 right? [06:57:27] elukey: yeop [06:57:29] yep [07:01:31] started! [07:05:08] elukey: 🤌 [07:07:30] lol [07:15:19] I see the host is back \o/ [07:15:38] I will wait for the green light though, to start mariadb etc [07:20:24] marostegui: db1179 done! [07:20:31] checked and it was rebooted [07:20:42] thanks a lot! [07:21:03] I'll start also all the others [07:36:52] Thanks elukey ! [09:15:34] clouddb1019 is taking a long time to apply the s4 schema change (clouddb1015 did it in 24 hours, clouddb1019 is taking 3+ days) [09:15:55] I think this is connected to T367778, still not clear if it's due to user load or to something wrong with that server [09:15:56] T367778: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778 [09:16:10] I'll try depooling clouddb1019 for a while and send all the traffic to clouddb1015 [09:18:50] dhinus: There are lots of variables on why that could be, including the innodb buffer pool size efficiency based on the queries the host get [09:19:41] yep, pretty hard to draw a clear conclusion. I will depool for now and see how the other server handles the traffic [09:20:07] sounds good [09:20:29] I will temporaraily increase the pt-kill threshold on clouddb1015 to match the "analytics" threshold used in clouddb1019 [09:21:43] yeah mean you'll allow queries to run for longer? [09:22:23] *you mean [09:22:30] yep, otherwise analytics queries currently going to 1019 will be killed after 5 mins if they run on 1015 [09:22:46] Yep [09:23:50] I will set the threshold to 10800 in /etc/default/wmf-pt-kill on 1015, so it should behave just like 1019 [09:24:26] the schema change is already applied on 1015 and replag is at 0 [09:31:30] dhinus: sometimes it has to wait for table lock, make sure the "show slave status" is on "alter table" state [09:32:01] it's on "copy to tmp table" [09:32:17] that's good too [09:32:37] I've seen replication got hold up for days in one case (it was clouddb1021 I think) because queries were constantly locking the table [09:34:39] > 11 | system user | | commonswiki | Slave_SQL | xxx | copy to tmp table | ALTER TABLE revision CHANGE rev_id rev_id [09:34:54] it just needs time [09:35:28] yep, but it's taking 3x the time it took on the other server... hopefully now that I depooled it should be faster [09:36:34] yeah, it'll help [12:27:15] here's a "fun" Grafana issue that hopefully is rare enough that I'm the only one who encountered it: T367778 [12:27:16] T367778: [wikireplicas] frequent replag spikes in clouddb hosts - https://phabricator.wikimedia.org/T367778 [12:27:21] sorry wrong task [12:27:30] T371485 [12:27:34] T371485: Grafana MySQL charts can be inconsistent when zooming out - https://phabricator.wikimedia.org/T371485 [12:27:39] cc godog [12:28:52] dhinus: ack thx, will take a look [12:36:35] dhinus: what should be the right tags for this https://phabricator.wikimedia.org/T371486 ? [12:36:50] And same for https://phabricator.wikimedia.org/T371488 [12:36:55] It is not something we (DBAS) do [12:37:02] It would be either wmcs or data-engineering [12:41:21] I'd say #data-services in the "wikreplicas" column [12:41:58] dhinus: Just that one? No team tag? [12:42:15] you can add "cloud-services-team" but I'll see them anyway [12:42:24] ok :) [12:42:30] we're considering auto-adding the team tag for all tags related to things that we manage [14:16:48] FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db2148:9104 has too large replication lag (15m 0s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2148&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [14:17:30] volans: https://phabricator.wikimedia.org/T369654#10032068 all those hosts are bookworm [14:17:38] Checking db2148 [14:18:12] jeez another index corruption, I will fix it and ping mariadb [14:19:30] > Last_SQL_Error: Error 'Index for table 'pagelinks' is corrupt; try to repair it' on query. Default database: 'idwiki'. Query: [14:19:36] I am fixing it yes [14:19:40] I will upgrade to .18 too [14:19:48] Is it depooled? [14:19:50] I can do it [14:20:04] it is [14:20:06] I just did it [14:20:14] thanks [14:20:19] I was too slow :( [14:20:46] marostegui: re T369654#10032068: I know, but the issue exists only because we're still supporting the possibility to reimage into buster and puppet 5 :D [14:20:47] T369654: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654 [14:20:56] volans: ah right right [14:21:02] You gave me a heart attack [14:21:14] :D [14:21:16] sorry [14:29:58] Hi folks! On beta, I'm getting the following error when querying a table for a new feature: "ERROR 1728 (HY000): Cannot load from mysql.proc. The table is probably corrupted" Would someone be willing to look into this or tell me where to look? [14:31:02] corrupted tables are fun [14:31:17] try running "optimize table " [14:31:19] Daimona: I don't even have access to beta but you can try a mysql_upgrade --force [14:31:47] FTR, it's reproducible with `sql wikishared` and then: SELECT ceil_id,ceil_name,ceil_status,ceil_created_at,ceil_user_id,COUNT (ceilu_id) AS `ceil_editor_count` FROM `ce_invitation_lists` LEFT JOIN `ce_invitation_list_users` ON ((ceil_id=ceilu_ceil_id) AND (ceilu_score >= 25)) WHERE ceil_wiki = 'metawiki'; [14:32:34] optimizing didn't do anything [14:36:06] As for, mysq_upgrade, I've no idea if I need to specify other options, run it on a specific host, or whatever [14:36:34] Daimona: On the database host, just as root [14:38:08] I don't know what that host is, and what the password is (I'm sure someone explained this to me a few years ago, but...) [14:38:34] :( I don't have access to beta so I cannot help much [14:39:11] I can take a look once I'm done with the checklist runbook [14:40:13] Wait a sec [14:40:25] For what is worth, we don't maintain beta :) [14:40:40] It seems to be caused by an extra space after `COUNT` in the query???!!! [14:40:54] woot [14:41:06] SELECT COUNT (ceilu_id) FROM `ce_invitation_list_users`; <-- triggers the error [14:41:19] SELECT COUNT(ceilu_id) FROM `ce_invitation_list_users`; <-- works just fine [14:41:31] that makes no sense :-/ [14:41:40] Daimona: can you run analyze table mysql.proc ? [14:42:58] Permission denied and I don't know how to fix that per above :O [14:43:07] right... [14:43:29] I wonder if this is just part of a general corruption of the systems tables after an upgrade or half of it [14:56:59] It seems like this might be intended behaviour according to the note in https://dev.mysql.com/doc/refman/8.4/en/functions.html. I didn't find an equivalent page for mariadb but I guess it doesn't matter [14:59:00] I imagine that without the space, the parser hallucinates and think I'm querying whatever strange table that doesn't exist or something. Because saying "Heeeeelp the table is corrupted!" is not quite on the same level as "bozo, you have an extra space in your query". [15:00:55] So, I guess I'll add this to my list of ridiculously unhelpful MySQL error messages and update the code. Still, thanks y'all for assisting!