[01:09:26] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 10.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:11:18] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[16:33:08] * urandom > Run puppet on the host twice, reboot, and run puppet again
[16:33:12] <urandom>	 LOL
[16:33:35] <urandom>	 (via https://wikitech.wikimedia.org/wiki/Swift/How_To#Add_a_proxy_node_to_the_cluster)
[16:35:31] <Emperor>	 I'm a bit Chesterton's fence about that bit of the instructions :-/
[16:39:11] <urandom>	 Step N: light a black candle
[16:40:17] <urandom>	 also, TIL: Chesterton's fence
[16:41:26] <urandom>	 but yeah, I will absolutely run puppet twice, reboot, and then run it again.... as absurd as it seems, I'm sure there is a reason
[16:42:16] <Emperor>	 we could ask go.dog about it some time, but that might be unkind at this time on a Friday [Europe] afternoon
[16:42:19] <Emperor>	 :)
[16:50:10] <urandom>	 Emperor: since you seem around still (why are you around still?), I guess thanos-fe2004 is an exception to the process at https://wikitech.wikimedia.org/wiki/Swift/How_To#Add_a_proxy_node_to_the_cluster?
[16:55:41] <Emperor>	 I'm only here for another 5 minutes :) I've never added thanos nodes before; I'm not sure if go.dog wants to do it since thanos is their service, or if they'd like to add some docs somewhere... I'll drop them a line on Monday
[16:56:39] <Emperor>	 I think the medium-term plan is to have separate nodes running thanos, and then the current thanos-{fe,be} nodes can be Just Swift and it'll make things a bit simpler
[16:56:46] <urandom>	 that host is `role(insetup::observability)`, which had me wondering if they didn't intend to handle it
[16:56:54] <Emperor>	 quite possibly :)
[16:57:38] <urandom>	 anyway, I'll start with the ms-fe nodes and regroup on the thanos one afterward
[16:58:54] <Emperor>	 cool, thanks.
[16:59:16] <Emperor>	 I've made some headway on my ghost-removal cookbook, but this weeks' had quite a lot of distractions
[16:59:26] <Emperor>	 (==latest euphemism for 🔥 )
[17:39:11] <bd808>	 At $DAYJOB-1 adding a new cluster node included the step "Run puppet until things stop changing." because we had managed to build a whole suite of circular dependencies into our manifests.
[17:41:26] <RhinosF1>	 bd808: that’s awful
[17:44:38] <bd808>	 RhinosF1: it wasn't ideal. I tried to help un-fsck some of it at the time, but I really didn't grok how puppet's resolver worked then so I didn't make a lot of progress. The SREish folks who had built it seemed mostly happy with their mess so :shrug:.
[17:45:26] <RhinosF1>	 bd808: I mean we have a system at my current job which has 3 servers but if one goes down none of the others can work at all
[17:45:37] <RhinosF1>	 Also the it’s dual powered but not redundantly
[17:45:57] <RhinosF1>	 Sadly the fans aren’t dual powered, only the server
[17:47:53] <RhinosF1>	 Oh and all the network and management stuff are only done from 1 server
[17:48:06] <RhinosF1>	 So if 1 server fails, you can’t boot or reboot any server
[17:48:12] <RhinosF1>	 Because reboot command doesn’t work