[01:06:54] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 4.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[01:08:10] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[04:11:20] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:42:56] <Amir1>	 marostegui: morning!
[05:43:12] <marostegui>	 o/
[05:43:24] <Amir1>	 I go grab a coffee before switchover
[05:43:28] <marostegui>	 sure
[06:15:33] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:38:50] <marostegui>	 I have finished all the reboots, only active masters are pending
[06:50:36] <Amir1>	 marostegui: thanks!
[06:50:49] <Amir1>	 what do you want me to do for 10.6 upgrade?
[06:50:58] <Amir1>	 T311106
[06:50:58] <stashbot>	 T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106
[06:51:14] <marostegui>	 Amir1: We need to see how we can try to simulate that MW load
[06:51:28] <Amir1>	 I thought about it and I have a fun idea
[07:07:03] <marostegui>	 Amir1: do you need to run something on db1136? (old s7 master)
[07:07:06] <marostegui>	 Or should I repool it?
[07:12:43] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:14:50] <Amir1>	 marostegui: I don't have anything on drifts
[07:14:54] <Amir1>	 so it should be fine
[07:14:59] <marostegui>	 oki thanks
[07:28:24] <jynus>	 re:809091, will any of those substitute backup sources?
[07:35:53] <marostegui>	 Yes, db2078 (misc)
[07:36:01] <marostegui>	 But I was planning to ping you before 
[07:36:04] <marostegui>	 They are not even racked yet
[07:36:09] <marostegui>	 So no need to worry about it for now
[07:36:40] <jynus>	 is db2175 skipped for any reason?
[07:36:57] <marostegui>	 Uh
[07:37:05] <marostegui>	 It shouldn't no
[07:37:31] <marostegui>	 Good catch, I will ammend it
[07:37:34] <jynus>	 I don't think it would have been a worrying issue, to be fair
[09:14:03] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:17:46] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:13:39] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:39:26] <jynus>	 remember I had 2 main suspects, P_S and thread pool? one is already showing up on Amir's flamegraphs
[12:40:35] <jynus>	 (could me missleading, as I looked at it for 1 second, but interesting, I would say)
[13:15:59] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:45:19] <marostegui>	 Thanks for the graphs Amir1, let's dig there
[13:48:40] <Amir1>	 Once I'm back I'll put more pressure if you think it's needed
[14:03:23] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 174 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321
[14:08:57] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321
[14:20:17] <icinga-wm>	 PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:12:06] <icinga-wm>	 RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[15:48:27] <volans>	 jynus: FYI I'm adding the already mentioned AAAA record to backup2002 now... finally :)
[15:49:01] <jynus>	 as long as you do it before 0 am UTC, ok
[15:49:07] <jynus>	 the hosts are idle until then
[15:49:38] <volans>	 :)
[15:49:56] <jynus>	 heads up on T311526 in case phab is naughy with notifications (but don't think it is an emergency)
[15:49:57] <stashbot>	 T311526: es2033 crashed at Jun 28 ~15:34 - https://phabricator.wikimedia.org/T311526
[15:53:19] <marostegui>	 thanks jynus
[15:53:24] <marostegui>	 I will take care of it 
[15:53:53] <volans>	 see also my comment, just posted
[15:54:14] <marostegui>	 yeah 
[18:50:32] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 25.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[18:56:44] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 29 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321
[18:56:54] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[19:03:47] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321