[01:09:14] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 11.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104
[01:09:42] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 11.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321
[01:10:50] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104
[01:11:18] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321
[06:08:05] <marostegui>	 Amir1: I have assigned this to you: https://phabricator.wikimedia.org/T335330
[08:23:50] <marostegui>	 what's up with the alert  (SystemdUnitFailed) firing: systemd-timedated.service Failed on thanos-be1003:9100 that has been there for days? should we silence it?
[08:23:55] <marostegui>	 or should I create a task for it?
[08:34:01] <Emperor>	 probably something needs kicking, I'll have a look
[08:34:11] <marostegui>	 thanks!
[08:34:55] <Amir1>	 marostegui: you always leave the fun ones to me
[08:35:04] <marostegui>	 you are welcome
[08:35:05] <Emperor>	 73049]: systemd-timedated.service: Failed to set up mount namespacing: /run/sys>
[08:35:08] <Emperor>	 o.O
[08:50:57] <Emperor>	 well, Advanced Debugging Strategy 2 worked
[08:51:32] <marostegui>	 reboot?
[08:51:32] <marostegui>	 XD
[08:51:47] <Emperor>	 yep
[08:52:22] <Emperor>	 interestingly, it's not running systemd-timedated at all now (systemd-timesyncd is running), so maybe nothing should have been trying to run the former.
[08:52:41] <Emperor>	 [and that might explain why it wouldn't restart OK if it's not meant to be running]
[09:17:40] <Emperor>	 Could I get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/911771 please? It's adding the new backends to swift::storagehosts (which is the thing that needs doing before I can think about adding them to the rings)
[09:18:28] <Emperor>	 thanks :)
[09:18:35] <marostegui>	 :)
[10:18:41] <Emperor>	 oddly, nearly all of these new nodes have a sad fs; so-far wipe and recreate seems to be doing the trick
[10:52:26] <Emperor>	 TFW you update a wikitech page to note a transitional process will likely run until 2028/29 FY...
[10:53:07] <Amir1>	 It's oddly specific 
[10:55:25] <Emperor>	 5-year hardware refresh cycle
[10:58:51] <Emperor>	 At least ATM it's too much hassle to migrate nodes to the newer layout (you have to drain, remove from the rings, reimage, re-add to the rings)
[10:59:33] <Amir1>	 ah I see
[11:00:22] <Emperor>	 we might at some point decide to do that, but I don't yet think it's worth the bother (each drain/re-add takes 2 weeks or so)
[16:15:52] <urandom>	 Emperor: so with https://gerrit.wikimedia.org/r/c/operations/puppet/+/911779 merged, the 8 new nodes are immediately added to storage, and the 8 old nodes will start a (protracted) process of moving their data to other nodes?
[16:38:54] <Emperor>	 urandom: yes, the new nodes are gradually added and the old gradually removed
[18:09:30] <urandom>	 Emperor: thanks
[18:11:15] <urandom>	 Emperor: is the gradual nature of that just some sort of rate-limiting?  Do the adds/drains create a new topology, one where the transfers happen over a period of time, or is there something more complicated happening?
[18:12:35] <Emperor>	 urandom: we have swift_ring_manager that makes the changes gradually
[18:13:13] <urandom>	 oh, ok.
[18:13:17] <Emperor>	 urandom: some notes on this are https://wikitech.wikimedia.org/wiki/Swift/Ring_Management
[18:13:58] <Emperor>	 roughly, we have code that compares the current rings to the state specified in hosts.yaml, and adjusts the rings a little way towards that desired state