[06:16:41] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1186 is CRITICAL: 344 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1186&var-port=9104
[06:27:37] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB sustained replica lag on s1 on db1186 is CRITICAL: 944 ge 2 Marostegui known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1186&var-port=9104
[08:25:10] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1186 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1186&var-port=9104
[14:39:51] <urandom>	 Emperor: I guess you saw godo.g's email about SystemdUnitFailed?
[14:40:12] <urandom>	 that's probably can make those rclone failures more...visible
[14:40:23] <urandom>	 err.. s/can/will/
[14:40:29] <urandom>	 or... damn
[14:40:33] <Emperor>	 urandom: no...?
[14:40:45] <urandom>	 ops list
[14:40:58] <urandom>	 they're being made critical (from warning)
[14:41:21] <urandom>	 (the alerts, that is)
[14:41:49] <Emperor>	 Oh, that's a bit of a bore
[14:42:46] <urandom>	 I clear two or three of those a month, I've made a habit of checking on monday's
[14:43:36] <Emperor>	 We need a less race-condition-ful solution, but there aren't any easy wins there :-/
[14:56:05] <urandom>	 Emperor: if you have the time to spare, https://gerrit.wikimedia.org/r/c/operations/puppet/+/997607 & https://gerrit.wikimedia.org/r/c/operations/puppet/+/997546 could mostly just using some sanity checking
[14:56:49] <urandom>	 btw, the former of those gives you some idea of what a Cassandra node config looks like without restbase in the picture 
[14:56:51] <Emperor>	 I'm in the middle of futzing around with redfish, will get to those in a bit 
[14:56:59] <urandom>	 awesome; thanks
[14:57:37] <urandom>	 not in a hurry; just wanted to nudge while I could still assume you were around
[16:27:46] <topranks>	 folks just wanted to check in regarding the planned switch migration in codfw rack A2 tomorrow 
[16:28:16] <topranks>	 it's got ms-fe2009 and ms-fe2013, notes say swift needs to be depooled in codfw before we proceed?
[16:28:36] <topranks>	 thanos-fe2001 also lives in that rack I see 
[16:37:19] <Emperor>	 topranks: yes, we'll depool all of ms swift in codfw ; and just thanos-fe2001 for thanos swift
[16:37:51] <topranks>	 Emperor: that's great, thanks!
[16:38:04] <Emperor>	 I mean, I assume you'd rather I took care of it :-)
[16:38:40] <topranks>	 that would probably be best for my nerves if possible :)
[16:38:51] <Emperor>	 NP
[16:43:53] <jynus>	 I restarted ms-backups2 processing a few minutes ago
[16:50:37] <Emperor>	 urandom: I don't think there is a spare::system role, so I don't think you can do that...
[16:51:25] <urandom>	 hrmm, I didn't see anyone using it and wondered the same.  But it's what wikitech said to use, and I thought the role did exist
[16:51:34] <urandom>	 I guess I could just use an insetup role instead?
[16:51:51] <Emperor>	 wikitech> where, OOI? 
[16:52:04] <Emperor>	 insetup wouldn't be bad if you're not just going to decommission right away?
[16:52:53] <urandom>	 we are going to decomm it right away
[16:53:15] <urandom>	 I mean, I think it gets removed from site.pp immediately after the cookbook is run... so it's only temporary anyway
[16:53:40] <urandom>	 and, my mistake, it's not wikitech its in the template of the decommission phab template
[16:53:54] <Emperor>	 urandom: in which case, don't bother assigning a role at all (like I didn't for the ms-be nodes), it's easier
[16:54:04] <urandom>	 https://phabricator.wikimedia.org/maniphest/task/edit/form/52/
[16:54:23] <urandom>	 Emperor: you mean, remove it from site.pp altogether?
[16:55:00] <Emperor>	 urandom: exactly so
[16:55:22] <urandom>	 Ok, what happens when puppet runs and those nodes?
[16:55:34] <urandom>	 or do we care, and just run the cookbook immediately?
[16:55:44] <Emperor>	 urandom: broadly, who cares? You're about to decommission them :)
[16:55:52] <urandom>	 Ok
[16:58:41] <volans>	 remove it from site.pp *after* the decom cookbook as run
[16:58:43] <volans>	 not before ;)
[16:59:09] <urandom>	 volans: what do you use before then?
[16:59:23] <volans>	 whatever is the current role of the host no?
[16:59:52] <jinxer-wm>	 (SessionStoreOnNonDedicatedHost) resolved: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost
[16:59:59] <Emperor>	 volans: huh, I've always done the remove first and then run the decom cookbook
[17:00:23] <Emperor>	 (I rather thought the latter would check for nodes still mentioned in puppet and complain)
[17:01:37] <volans>	 that's what https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_ANY_Opsen says :D
[17:02:14] <urandom>	 ooph
[17:03:24] <Emperor>	 the decom checklist has the other ordering
[17:03:38] <Emperor>	 or at least puts "remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below." before running the cookbook
[17:03:50] <urandom>	 and there are contradictions in the dcomm phab template too
[17:04:18] <urandom>	 oh, right
[17:04:21] <Emperor>	 Which I have possibly incorrectly always parsed as "remove site.pp, THEN (replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.)" IYSWIM
[17:04:48] <Emperor>	 i.e. the replacement with role(spare::system) step is optional if you're going to decom immediately
[17:07:31] <volans>	 I have no power on the phab template
[17:07:35] <volans>	 for that 301 to rob :D
[17:08:14] <volans>	 there is no more spare::system role since long time
[17:23:21] <urandom>	 yeah, using the role it always had results in errors (missing hiera values)
[17:23:40] <urandom>	 but... I guess that's ok 🤷‍♂️