[01:08:48] PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 23.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:13:04] RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:15:12] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 24.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:16:40] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [06:53:45] I'd like to reboot cumin2002, when would be a good time for all the Data Persistence things running from it? [07:10:49] a few minutes to finish backups on my side, moritzm, and will be free [07:18:38] I don't plan to reboot it now :-) usually this needs a few days of headsup time, so mostly just reaching out for now [09:48:30] I see no ongoing sessions, so Amir1, I know you are using cumin1, but are you using 2002 ? [09:48:40] or planning to? [09:49:12] I barely use cumin2002. For cumin1001 arnaudb is doing a schema change which will take a couple of days [09:49:28] oh, there may be some elastic thing moritzm ask bking [09:50:16] moritzm: for me, any time except from 0 to 8:30ish UTC are ok to reboot [09:50:30] (which is when snapshots run) [09:51:24] I see some ongoing watches there possibly related to elastic? [11:56:16] ack, thanks. I'll check with Brian and otherwise I'll reboot cumin2002 tomorrow after 9:00 UTC [13:51:20] Emperor: moving here from #mediawiki_security: TBH, I jumped straight to https://wikitech.wikimedia.org/wiki/Swift/How_To#Create_a_new_swift_account_(Thanos_Cluster) (or a variation thereof), not realizing that https://wikitech.wikimedia.org/wiki/Swift/How_To#Rollover_a_Swift_key was meant to be SOP in this scenario [13:51:51] you're right, having a A Way™ is important [13:52:44] 's not the end of the world in any case :) [13:53:07] So, one way or another, I'll take this as an opportunity to iterate on the documentation... but I wonder if there isn't merit in including a "new user" process [13:53:36] I chose that, because it seemed like the seamless method [13:54:30] i.e. you could just deploy and not worry about processes simultaneously accessing the old and new users [13:54:43] then clean up the old user [13:55:28] I guess; but that's more fiddling around in multiple repos to add and remove user accounts. And in practice clients have typically cached a session so we've not seen spikes in errors when e.g. rolling over the mw key [13:55:56] Ok [13:56:07] [a consequence of a new user needing to be added in 3 different places, whereas just changing the password is 1 change in one repo] [13:56:17] fair enough, I don't want to create problems where none exist :) [15:10:25] Amir1: Re testreduce; I am guessing we only got added because of the potential grant things (also maybe backups just in case), while someone else will take of the app migration [15:13:53] got confirmation from Brian that cumin2002 can be rebooted, will do it tomorrow at 9:00 UTC [15:16:51] I see [15:16:53] thanks [15:17:42] actually no backups, will comment that just in case [19:25:27] thanks for the patches urandom ! Just +1'd [19:26:14] 👍 [20:53:03] PROBLEM - MariaDB sustained replica lag on s5 on db2113 is CRITICAL: 26 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2113&var-port=9104 [20:55:55] RECOVERY - MariaDB sustained replica lag on s5 on db2113 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2113&var-port=9104 [21:41:37] Just a heads up I'm running a patched querysampler querey about every 5 seconds per each clouddb1013-clouddb1020 DB port (via the proxies). If you need to stop it, you'll find it running under my ID on clouddb-wikireplicas-query-1.clouddb-services.eqiad1.wikimedia.cloud .