[01:09:26] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 10.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:18] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [16:33:08] * urandom > Run puppet on the host twice, reboot, and run puppet again [16:33:12] LOL [16:33:35] (via https://wikitech.wikimedia.org/wiki/Swift/How_To#Add_a_proxy_node_to_the_cluster) [16:35:31] I'm a bit Chesterton's fence about that bit of the instructions :-/ [16:39:11] Step N: light a black candle [16:40:17] also, TIL: Chesterton's fence [16:41:26] but yeah, I will absolutely run puppet twice, reboot, and then run it again.... as absurd as it seems, I'm sure there is a reason [16:42:16] we could ask go.dog about it some time, but that might be unkind at this time on a Friday [Europe] afternoon [16:42:19] :) [16:50:10] Emperor: since you seem around still (why are you around still?), I guess thanos-fe2004 is an exception to the process at https://wikitech.wikimedia.org/wiki/Swift/How_To#Add_a_proxy_node_to_the_cluster? [16:55:41] I'm only here for another 5 minutes :) I've never added thanos nodes before; I'm not sure if go.dog wants to do it since thanos is their service, or if they'd like to add some docs somewhere... I'll drop them a line on Monday [16:56:39] I think the medium-term plan is to have separate nodes running thanos, and then the current thanos-{fe,be} nodes can be Just Swift and it'll make things a bit simpler [16:56:46] that host is `role(insetup::observability)`, which had me wondering if they didn't intend to handle it [16:56:54] quite possibly :) [16:57:38] anyway, I'll start with the ms-fe nodes and regroup on the thanos one afterward [16:58:54] cool, thanks. [16:59:16] I've made some headway on my ghost-removal cookbook, but this weeks' had quite a lot of distractions [16:59:26] (==latest euphemism for 🔥 ) [17:39:11] At $DAYJOB-1 adding a new cluster node included the step "Run puppet until things stop changing." because we had managed to build a whole suite of circular dependencies into our manifests. [17:41:26] bd808: that’s awful [17:44:38] RhinosF1: it wasn't ideal. I tried to help un-fsck some of it at the time, but I really didn't grok how puppet's resolver worked then so I didn't make a lot of progress. The SREish folks who had built it seemed mostly happy with their mess so :shrug:. [17:45:26] bd808: I mean we have a system at my current job which has 3 servers but if one goes down none of the others can work at all [17:45:37] Also the it’s dual powered but not redundantly [17:45:57] Sadly the fans aren’t dual powered, only the server [17:47:53] Oh and all the network and management stuff are only done from 1 server [17:48:06] So if 1 server fails, you can’t boot or reboot any server [17:48:12] Because reboot command doesn’t work