[01:08:48] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 23.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321
[01:13:04] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321
[01:15:12] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 24.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:16:40] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[06:53:45] <moritzm>	 I'd like to reboot cumin2002, when would be a good time for all the Data Persistence things running from it?
[07:10:49] <jynus>	 a few minutes to finish backups on my side, moritzm, and will be free
[07:18:38] <moritzm>	 I don't plan to reboot it now :-) usually this needs a few days of headsup time, so mostly just reaching out for now
[09:48:30] <jynus>	 I see no ongoing sessions, so Amir1, I know you are using cumin1, but are you using 2002 ?
[09:48:40] <jynus>	 or planning to?
[09:49:12] <Amir1>	 I barely use cumin2002. For cumin1001 arnaudb is doing a schema change which will take a couple of days
[09:49:28] <jynus>	 oh, there may be some elastic thing moritzm ask bking
[09:50:16] <jynus>	 moritzm: for me, any time except from 0 to 8:30ish UTC are ok to reboot
[09:50:30] <jynus>	 (which is when snapshots run)
[09:51:24] <jynus>	 I see some ongoing watches there possibly related to elastic?
[11:56:16] <moritzm>	 ack, thanks. I'll check with Brian and otherwise I'll reboot cumin2002 tomorrow after 9:00 UTC
[13:51:20] <urandom>	 Emperor: moving here from #mediawiki_security: TBH, I jumped straight to https://wikitech.wikimedia.org/wiki/Swift/How_To#Create_a_new_swift_account_(Thanos_Cluster) (or a variation thereof), not realizing that https://wikitech.wikimedia.org/wiki/Swift/How_To#Rollover_a_Swift_key was meant to be SOP in this scenario
[13:51:51] <urandom>	 you're right, having a A Way™ is important
[13:52:44] <Emperor>	 's not the end of the world in any case :)
[13:53:07] <urandom>	 So, one way or another, I'll take this as an opportunity to iterate on the documentation... but I wonder if there isn't merit in including a "new user" process
[13:53:36] <urandom>	 I chose that, because it seemed like the seamless method
[13:54:30] <urandom>	 i.e. you could just deploy and not worry about processes simultaneously accessing the old and new users
[13:54:43] <urandom>	 then clean up the old user
[13:55:28] <Emperor>	 I guess; but that's more fiddling around in multiple repos to add and remove user accounts. And in practice clients have typically cached a session so we've not seen spikes in errors when e.g. rolling over the mw key
[13:55:56] <urandom>	 Ok
[13:56:07] <Emperor>	 [a consequence of a new user needing to be added in 3 different places, whereas just changing the password is 1 change in one repo]
[13:56:17] <urandom>	 fair enough, I don't want to create problems where none exist :)
[15:10:25] <jynus>	 Amir1: Re testreduce; I am guessing we only got added because of the potential grant things (also maybe backups just in case), while someone else will take of the app migration
[15:13:53] <moritzm>	 got confirmation from Brian that cumin2002 can be rebooted, will do it tomorrow at 9:00 UTC
[15:16:51] <Amir1>	 I see
[15:16:53] <Amir1>	 thanks
[15:17:42] <jynus>	 actually no backups, will comment that just in case
[19:25:27] <inflatador>	 thanks for the patches urandom ! Just +1'd
[19:26:14] <urandom>	 👍
[20:53:03] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s5 on db2113 is CRITICAL: 26 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2113&var-port=9104
[20:55:55] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s5 on db2113 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2113&var-port=9104
[21:41:37] <dr0ptp4kt>	 Just a heads up I'm running a patched querysampler querey about every 5 seconds per each clouddb1013-clouddb1020 DB port (via the proxies). If you need to stop it, you'll find it running under my ID on clouddb-wikireplicas-query-1.clouddb-services.eqiad1.wikimedia.cloud .