[04:30:28] Going to start s2 primary switch [05:03:16] I am going to do s3 switchover too [05:48:33] starting T367055 [05:48:34] T367055: Switchover es6 master (es1038 -> es1037) - https://phabricator.wikimedia.org/T367055 [05:59:41] I depooled db1233 on s2 as it had issues, its downtimed for 24hrs, I'm continuing on w/ ↑ [05:59:50] I will take care of it, thanks [07:11:10] marostegui: can I go for s2 old eqiad master T360332 ? [07:11:11] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [07:11:25] arnaudb: not yet [07:11:29] I will ping you [07:11:34] thanks! [07:20:56] there is a power outage in my building [07:21:14] if not the neigboorhood [07:21:36] I hope it takes less time to get that fixed than your internet :) [07:21:55] omg 😭 [07:22:37] will i have to eat raw pizza from the freezer at lunch? [07:22:59] _joe_ would love that [07:31:25] I called the energy provider, they gave me a 2hr delay for the fix [07:31:41] the whole neigboorhood is in the dark [08:15:09] :( [08:23:57] arnaudb: old s2 eqiad master can be used, db1222, it is pooled [08:30:46] thanks! [08:32:28] I am installing10.11 on db1153 a non used x2 replica [09:17:18] 🎉 both internet and power came back [09:17:29] \o/ [09:24:35] will run my schema change and be done with T360332 [09:24:36] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:53:10] heads up I'm depooling and restarting clouddb1014 (T366555) [10:28:09] I am going to switch s5 codfw [10:37:56] clouddb1014 restarted and repooled. proceeding with clouddb1015 [10:41:45] dhinus: are you also running apt-get upgrade during the maintenance? [10:41:56] it would be nice to pick the latest mariadb version [10:42:14] I'm not, I can do it for the next few oens [10:42:16] *ones [10:42:52] yes please [10:43:09] there's also T365424 but that one takes a bit longer [10:43:11] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [10:43:17] yeah [10:43:41] s5 codfw switched [10:48:27] I noticed I don't get any alert when I stop the services in clouddbs, is that expected? (I didn't set a downtime) [10:49:20] dhinus: I don't know, I think your team has alerts for that [10:49:32] We have alerts only for lag/replication [10:49:41] I thought we had, but maybe we don't :) I'll double check [10:49:48] but those hosts are downtimed [10:50:04] all of them? [10:50:07] We'd have gotten the replication ones on irc [10:50:18] dhinus: clouddb1015 is downtimed from what I can see [10:50:26] yes that's the reboot cookbook that downtimed it [10:51:06] but before starting the cookbook it was not downtimed I think. I will check again when I do the next one [10:51:22] dhinus: there's sometime between stopping the service and alerting [10:51:31] re: apt upgrades, would you do all or just mariadb? [10:51:36] "grub-common grub-pc grub-pc-bin grub2-common libc-bin libc-l10n libc6 linux-perf linux-perf-5.10 locales python3-wmfmariadbpy wmf-mariadb106 wmfmariadbpy-common" [10:51:40] dhinus: I normally do everything [10:51:53] ack [10:53:53] Emperor: What should we do with that recurrent alert: FIRING: [88x] DiskSpace: Disk space thanos-be1001:9100:/srv/swift-storage/sdd1 4.323% free - maybe silence it and track it on a task? [10:54:38] marostegui: its related to T351927 [10:54:39] T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927 [10:55:07] Emperor: but can we silence it in the meantime? it's firing every 10 minutes [10:55:23] every 10 minutes?!? [10:55:37] from what I can see in our -feed channel, yep [10:56:21] I think godog merged some changes (via T357747 ) today, so hopefully it'll resolve soon. Maybe silence for 24h? If those changes don't resolve it we probably need further action [10:56:21] T357747: Capacity planning/estimation for Thanos - https://phabricator.wikimedia.org/T357747 [10:58:22] yeah space is going to be freed in the next 48h, I suggest silencing/acking alerts for known issues indeed [11:00:23] OK, I'll put in a silence for 2 days [11:01:22] thank you both! [11:02:10] {{done}} [11:08:23] sure np marostegui [11:09:59] clouddb1015 rebooted, apt-upgraded and repooled [11:35:24] Amir1: https://phabricator.wikimedia.org/T352010 I've done a couple of dc masters today [11:35:38] I just finished s5 codfw [11:35:51] Today I did s2 primary, s3 primary and s5 codfw [11:36:17] You cannot use s5 codfw master [11:36:27] but s2 is done from my side and so is s3 eqiad [11:36:38] Thanks. I get to it now [11:37:00] s3 old primary is being repooled, you want me to stop it? [11:37:02] and you take care of it? [11:44:49] Amir1: ^ [11:47:14] marostegui: if you depool it that'd be amazing [11:47:20] ok [11:47:21] Or stop repool [11:47:51] Amir1: stopped and depooled [11:48:59] Thanks! [11:50:00] I am going to go afk till the evening, I've got a meeting later and I started working at 6am :) [11:50:15] 👋 [11:56:15] PROBLEM - MariaDB Replica SQL: s3 on db1240 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Index for table recentchanges is corrupt: try to repair it on query. Default database: cywiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:56:24] its a backup source jynus fyi [11:56:38] it seems that it caught the index issue that occured yesterday [11:57:36] i've downtimed the host to avoid paging [12:03:42] Jaime is on holidays today, please rebuild that table [12:04:13] arnaudb: normally it's best to create a task as otherwise it's easy to forget about it [12:04:29] ack, I forgot [12:06:51] will run it on all the otherbase once cywiki is rebuilt [12:49:09] * dhinus is depooling and rebooting clouddb1016 [13:47:26] * dhinus is depooling and rebooting clouddb1017 [15:24:17] es1038 is back to normal after topranks maintenance [18:29:17] PROBLEM - MariaDB sustained replica lag on s4 on db1221 is CRITICAL: 31.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [18:31:17] RECOVERY - MariaDB sustained replica lag on s4 on db1221 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104