[01:09:14] PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 11.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:09:42] PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 11.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:10:50] RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:11:18] RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [06:08:05] Amir1: I have assigned this to you: https://phabricator.wikimedia.org/T335330 [08:23:50] what's up with the alert (SystemdUnitFailed) firing: systemd-timedated.service Failed on thanos-be1003:9100 that has been there for days? should we silence it? [08:23:55] or should I create a task for it? [08:34:01] probably something needs kicking, I'll have a look [08:34:11] thanks! [08:34:55] marostegui: you always leave the fun ones to me [08:35:04] you are welcome [08:35:05] 73049]: systemd-timedated.service: Failed to set up mount namespacing: /run/sys> [08:35:08] o.O [08:50:57] well, Advanced Debugging Strategy 2 worked [08:51:32] reboot? [08:51:32] XD [08:51:47] yep [08:52:22] interestingly, it's not running systemd-timedated at all now (systemd-timesyncd is running), so maybe nothing should have been trying to run the former. [08:52:41] [and that might explain why it wouldn't restart OK if it's not meant to be running] [09:17:40] Could I get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/911771 please? It's adding the new backends to swift::storagehosts (which is the thing that needs doing before I can think about adding them to the rings) [09:18:28] thanks :) [09:18:35] :) [10:18:41] oddly, nearly all of these new nodes have a sad fs; so-far wipe and recreate seems to be doing the trick [10:52:26] TFW you update a wikitech page to note a transitional process will likely run until 2028/29 FY... [10:53:07] It's oddly specific [10:55:25] 5-year hardware refresh cycle [10:58:51] At least ATM it's too much hassle to migrate nodes to the newer layout (you have to drain, remove from the rings, reimage, re-add to the rings) [10:59:33] ah I see [11:00:22] we might at some point decide to do that, but I don't yet think it's worth the bother (each drain/re-add takes 2 weeks or so) [16:15:52] Emperor: so with https://gerrit.wikimedia.org/r/c/operations/puppet/+/911779 merged, the 8 new nodes are immediately added to storage, and the 8 old nodes will start a (protracted) process of moving their data to other nodes? [16:38:54] urandom: yes, the new nodes are gradually added and the old gradually removed [18:09:30] Emperor: thanks [18:11:15] Emperor: is the gradual nature of that just some sort of rate-limiting? Do the adds/drains create a new topology, one where the transfers happen over a period of time, or is there something more complicated happening? [18:12:35] urandom: we have swift_ring_manager that makes the changes gradually [18:13:13] oh, ok. [18:13:17] urandom: some notes on this are https://wikitech.wikimedia.org/wiki/Swift/Ring_Management [18:13:58] roughly, we have code that compares the current rings to the state specified in hosts.yaml, and adjusts the rings a little way towards that desired state