[07:49:48] Amir1: can I run the schema change on s6 *DC master* in codfw? [09:08:50] federico3: go for it. Just run it with --check first to make sure it's running on master of codfw only [09:09:18] ok, after s6 can I move on to the next ones? We can schedule the runs for today [09:10:55] my request would be to prioritize codfw and codfw masters, some might need switchovers [09:11:28] let's make sure all schema changes we have in flight is done in codfw and codfw masters https://phabricator.wikimedia.org/project/board/1060/?filter=oIMnKhOjALmY [09:11:44] yes I'm talking about codfw masters: can I run s5 and then s2? [09:12:33] directly on master? I need to check how big the tables are, one sec [09:13:01] (with --dc-masters option you mean) [09:13:24] I just ran check on s5 and s2 and they were quick, while s3 has a lot of dbs [09:13:43] yes using --dc-masters [09:14:10] ok s3 is already done, so we need s5 s2 s4... [09:14:40] s4 definitely needs switchover [09:14:41] and s1 [09:14:56] and s1 too, but s2 or s5 might be okay with running directly on master [09:16:07] if you can show what kind of check you run to tell if something needs switchover or not I can add it to the wiki [09:16:26] basically how it works: the biggest table on that cluster (usually the biggest wiki there, e.g. s5 would be dewiki) shouldn't be so big that the alter would take more than a minute on that table. Otherwise, actual writes start to pile up. That's why we can run on s3 almost all the time, it takes a long time but it's just 900 small wikis so the writes don't pile up [09:16:27] also I can do the switchover for s4 [09:16:57] otoh, s4, s8, s1 is basically default should be treated as they need switchover as their tables are huge [09:17:10] yeah, doing switchover on s4 would be amazing [09:17:35] I'm about to do s8 on eqiad [09:19:13] ok can I start https://phabricator.wikimedia.org/T404050 ? s4 is/was running "Drop rc_new from recentchanges table in wmf production (T402763) (ladsgroup) " [09:19:14] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [09:20:34] ugh right, let me check that [09:21:20] that's eqiad only, go for it [09:23:55] once you're done with s4 codfw master, let me know, there are a couple schema changes that I need to run on the old master of s4 too (dropping categorylinks columns for example) [09:31:36] I wanted to run the repool cookbook on db2170 but accidentally ran the upgrade cookbook. It's nice to reboot hosts anyway [09:36:33] PROBLEM - MariaDB sustained replica lag on s1 on db2170 is CRITICAL: 399 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2170&var-port=9104 [09:37:41] PROBLEM - MariaDB sustained replica lag on s1 on db2170 is CRITICAL: 181 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2170&var-port=9104 [09:42:41] RECOVERY - MariaDB sustained replica lag on s1 on db2170 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2170&var-port=9104 [09:46:03] well, at least it's harmless [10:11:51] doing s8 switchover [10:11:53] Amir1: s4 DC master has been flipped. [10:11:53] in eqiad [10:12:17] federico3: Awesome. After I'm done with s8 in eqiad, I'll do the schema changes on old master of s4 in codfw [10:12:32] ok [10:13:12] can you check the list of schema changes you have and run the ones that are not done on s4 codfw old master? e.g. the abuse filter one [10:13:49] can I flip another DC master in codfw? Or do we feel comfortable running the schema change on s2 and s5 without flip? [10:14:39] I need to check the table sizes there [10:14:46] but right now switching over s8 in eqiad [10:16:19] ok, I'll update the status for 2025/change_afl_defaults_T401906.py [10:24:30] Amir1: I can start the schema change on s4 in codfw *replicas* as they are still to be done [10:25:00] go for it [10:25:18] let's focus on getting codfw done as much as possible for all schema changes [10:25:47] ok started now [10:41:00] > Last_SQL_Error: Error 'Duplicate entry '32148341' for key 'PRIMARY'' on query. Default database: 'wikidatawiki'. Query: [10:41:08] on the old master of s8 [10:41:19] I'm just going to clone it from another replica [10:41:29] in eqiad? [10:41:32] yup [10:41:54] that would mean it'll get the schema changes automatically too :P [10:52:28] to recap on codfw the schema change is needed on: s1 master, s4 master, s4 replicas (ongoing now), s5 master, s7 master, s8 master [10:54:52] Amir1: can I flip another master in codfw? [10:56:26] s5 sounds good to me [11:14:11] taking a tiny break [11:24:00] ok, flipping s5 codfw master [11:55:02] the flip is done [11:57:01] i can run the schema change on the replica db2213 (previously master) Amir1? [12:04:06] Go for it [12:06:41] ok starting [12:28:30] jynus: for when you have time, several hosts in T399540 are backupX-codfw|eqiad clusters. The semi-sync bug is really nasty, so for when you have time, would you mind upgrading the backup hosts? [12:28:30] T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update - https://phabricator.wikimedia.org/T399540 [12:29:29] Amir1: do we want to flip the masters in s8 and s7? [12:29:49] I think it's good enough for today [12:29:50] ah you did s8 [12:32:07] (correction: it was s8 in eqiad, but s8 in codfw still need the master flip) [12:38:56] I think those were already upgraded [12:39:30] the 4 backup1- hosts are on 10.11.13 [12:39:40] isn't that enough= [12:39:42] ? [12:40:14] jynus: nope, let me grab you the correct version from debmonitor [12:40:46] on 10.11 it should be 10.11.13+deb12u2 and not 10.11.13+deb12u1 [12:41:38] db2183 is already on u2 [12:44:44] and so is db1204 [12:45:40] Thanks for checking [12:48:03] all backup1 hosts have 10.11.13+deb12u2 [12:50:05] and all backup sources too [12:50:52] wait, wrong command, let me double check [12:53:18] for backup sources there is 5 missing: db2197,db1225,db1239,db2199,db1245 [12:54:56] and from backup1-* sections, the only one missing is: db2184 [12:55:21] but both masters are upgraded [12:56:29] I can do db2184 now [12:58:22] thanks. For the backup sources, they won't become masters so that should be totally fine [13:14:55] db2184 upgraded [13:16:22] Thanks! [13:19:47] ok the replicas in s5 codfw are done [13:21:57] I'm still handling the overload of backups [13:22:04] cancelling now a lot of jobs [13:22:14] which will mean we will get more complains before recoveries [13:22:20] the replicas in s4 are still running the schema change [13:25:56] yeah, s4 is big, schema changes take longer [13:27:20] PROBLEM - MariaDB sustained replica lag on s7 on db1181 is CRITICAL: 83.75 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104 [13:28:26] sigh [13:28:53] should recover now [13:29:18] RECOVERY - MariaDB sustained replica lag on s7 on db1181 is OK: (C)10 ge (W)5 ge 4.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104 [14:29:37] re: replicas in s4, the alter table got stuck, I un-stuck it by killing a query. details in T404090 [14:29:37] T404090: [wikireplicas] clouddb1015 replication lag when applying ALTER TABLE - https://phabricator.wikimedia.org/T404090 [14:39:58] thanks [15:10:56] according to puppet in s1 eqiad the candidate master is: [15:10:57] db1163.yaml-# candidate master for s1 [15:19:31] ...which is flagged as updated on https://phabricator.wikimedia.org/T399540 and is indeed running 10.6.22 [15:20:38] also db1244.yaml-# candidate master for s4 <-- is up to date [15:22:13] db1193.yaml-# candidate master for s8 <--- this is also up to date but not flagged in the task [15:46:02] might have found an issue with bacula backups on trixie hosts [15:46:45] on trixie I got bacula-fd 15.0.3-3.. but before we were only on 9.6.7-7, quite the version jump [15:46:59] and the version of bacula-fd and director were matching before [15:47:51] and now seeing errors from bacula-fd that potentially are due to the version mismatch [15:48:11] ~ "hello.c:191-650228 Bad caps from SD: auth cram-md5" [15:54:43] jynus: ^^^ [15:56:21] I don't think client should be upgraded before storage [15:56:53] can the previous version be backported? [15:58:27] no idea at this point. just barely finding out that it failed and raising because these are one of the very first trixie hosts [15:58:54] also no urgency because the real prod data is still on previous hosts [15:59:26] if moritzm is around I would like to ping him for awereness ^ [15:59:30] the internet says that newer -fd version should work with older director.. but not the other way around [15:59:40] but maybe not that many versions [15:59:58] I get the opposite advice [16:00:08] dont worry, I can make a ticket so it can be debugged async [16:00:46] New Catalog format in version 13.0 and greater [16:00:48] seems tricky to solve if indeed versions are supposed to match but we are mixing distro versions [16:01:36] maybe they dropped MD5 or something.. based on the error message.. speculation [16:01:42] gotcha [16:01:53] yeah, https://docs.baculasystems.com/BEUpgradeAndRemoval/BEUpgrade/BEUpgradeOnLinux/BEUpgradeOnLinuxPreparation/index.html#file-daemon confirms the file should be the last to be upgraded [16:02:55] it would be nice to have 9.6 client on trixie while the transition happens [16:03:09] ack [16:03:32] is there a host I can do manual testing first? [16:06:12] jynus: sure, people1005.eqiad.wmnet [16:12:56] Amir1: I'm running a change on s8 codfw replicas [16:14:29] also starting the s2 codfw master flip as discussed [16:19:04] I have a second, entirely unrelated topic. This one is not for backup people but a DBA thing. I would like to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180999 and the puppet compiler shows this as a no-op on DB servers. But it is still mildly scary (at least without talking to someone here) because it touches modules/mariadb/templates/default.my.cnf.erb and how [16:19:10] innodb_buffer_pool_size is calculated. The point would be to make mariadb classes usable on trixie by replacing a puppet legacy fact with a modern syntax. It should not change anything. Afaict this is needed because of newer ruby versions on trixie, while puppet version stays the same. [16:19:56] the bacula stuff is a bit of a mess because there is a dependency on the openssl lib [16:20:24] oof, I see [16:21:01] so it is not just a simple downgrade, I wonder if the packages have been customized to make it work on debian stable - 2 [16:21:18] as I see a dependency on libssl1.1 [16:21:34] but in theory it should depend on 3 [16:27:47] it is a libssl3 vs libssl3t64 dependency, would need at the very least a recompilation [16:30:30] s2 codfw master flipped - running schema change on the ex-master [18:10:46] running s1 codfw rc_new schema change as discussed ( T402763 ) [18:10:47] T402763: Drop rc_new from recentchanges table in wmf production - https://phabricator.wikimedia.org/T402763 [18:16:45] re: bacula. ack! thanks for making the ticket [18:17:46] merged your change to ignore failures on trixie people* hosts. postponing switching to them. prio: low (on my end) [18:19:04] (this kind of thing is why we test, knowingly being an early adoptor)