[01:06:54] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 4.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:08:10] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [04:11:20] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:42:56] marostegui: morning! [05:43:12] o/ [05:43:24] I go grab a coffee before switchover [05:43:28] sure [06:15:33] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:38:50] I have finished all the reboots, only active masters are pending [06:50:36] marostegui: thanks! [06:50:49] what do you want me to do for 10.6 upgrade? [06:50:58] T311106 [06:50:58] T311106: Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 [06:51:14] Amir1: We need to see how we can try to simulate that MW load [06:51:28] I thought about it and I have a fun idea [07:07:03] Amir1: do you need to run something on db1136? (old s7 master) [07:07:06] Or should I repool it? [07:12:43] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:14:50] marostegui: I don't have anything on drifts [07:14:54] so it should be fine [07:14:59] oki thanks [07:28:24] re:809091, will any of those substitute backup sources? [07:35:53] Yes, db2078 (misc) [07:36:01] But I was planning to ping you before [07:36:04] They are not even racked yet [07:36:09] So no need to worry about it for now [07:36:40] is db2175 skipped for any reason? [07:36:57] Uh [07:37:05] It shouldn't no [07:37:31] Good catch, I will ammend it [07:37:34] I don't think it would have been a worrying issue, to be fair [09:14:03] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:17:46] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:13:39] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:39:26] remember I had 2 main suspects, P_S and thread pool? one is already showing up on Amir's flamegraphs [12:40:35] (could me missleading, as I looked at it for 1 second, but interesting, I would say) [13:15:59] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:45:19] Thanks for the graphs Amir1, let's dig there [13:48:40] Once I'm back I'll put more pressure if you think it's needed [14:03:23] PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 174 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [14:08:57] RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [14:20:17] PROBLEM - Check unit status of swift_ring_manager on thanos-fe1001 is CRITICAL: CRITICAL: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:12:06] RECOVERY - Check unit status of swift_ring_manager on thanos-fe1001 is OK: OK: Status of the systemd unit swift_ring_manager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:48:27] jynus: FYI I'm adding the already mentioned AAAA record to backup2002 now... finally :) [15:49:01] as long as you do it before 0 am UTC, ok [15:49:07] the hosts are idle until then [15:49:38] :) [15:49:56] heads up on T311526 in case phab is naughy with notifications (but don't think it is an emergency) [15:49:57] T311526: es2033 crashed at Jun 28 ~15:34 - https://phabricator.wikimedia.org/T311526 [15:53:19] thanks jynus [15:53:24] I will take care of it [15:53:53] see also my comment, just posted [15:54:14] yeah [18:50:32] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 25.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [18:56:44] PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 29 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [18:56:54] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [19:03:47] RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321