[06:16:41] PROBLEM - MariaDB sustained replica lag on s1 on db1186 is CRITICAL: 344 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1186&var-port=9104 [06:27:37] ACKNOWLEDGEMENT - MariaDB sustained replica lag on s1 on db1186 is CRITICAL: 944 ge 2 Marostegui known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1186&var-port=9104 [08:25:10] RECOVERY - MariaDB sustained replica lag on s1 on db1186 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1186&var-port=9104 [14:39:51] Emperor: I guess you saw godo.g's email about SystemdUnitFailed? [14:40:12] that's probably can make those rclone failures more...visible [14:40:23] err.. s/can/will/ [14:40:29] or... damn [14:40:33] urandom: no...? [14:40:45] ops list [14:40:58] they're being made critical (from warning) [14:41:21] (the alerts, that is) [14:41:49] Oh, that's a bit of a bore [14:42:46] I clear two or three of those a month, I've made a habit of checking on monday's [14:43:36] We need a less race-condition-ful solution, but there aren't any easy wins there :-/ [14:56:05] Emperor: if you have the time to spare, https://gerrit.wikimedia.org/r/c/operations/puppet/+/997607 & https://gerrit.wikimedia.org/r/c/operations/puppet/+/997546 could mostly just using some sanity checking [14:56:49] btw, the former of those gives you some idea of what a Cassandra node config looks like without restbase in the picture [14:56:51] I'm in the middle of futzing around with redfish, will get to those in a bit [14:56:59] awesome; thanks [14:57:37] not in a hurry; just wanted to nudge while I could still assume you were around [16:27:46] folks just wanted to check in regarding the planned switch migration in codfw rack A2 tomorrow [16:28:16] it's got ms-fe2009 and ms-fe2013, notes say swift needs to be depooled in codfw before we proceed? [16:28:36] thanos-fe2001 also lives in that rack I see [16:37:19] topranks: yes, we'll depool all of ms swift in codfw ; and just thanos-fe2001 for thanos swift [16:37:51] Emperor: that's great, thanks! [16:38:04] I mean, I assume you'd rather I took care of it :-) [16:38:40] that would probably be best for my nerves if possible :) [16:38:51] NP [16:43:53] I restarted ms-backups2 processing a few minutes ago [16:50:37] urandom: I don't think there is a spare::system role, so I don't think you can do that... [16:51:25] hrmm, I didn't see anyone using it and wondered the same. But it's what wikitech said to use, and I thought the role did exist [16:51:34] I guess I could just use an insetup role instead? [16:51:51] wikitech> where, OOI? [16:52:04] insetup wouldn't be bad if you're not just going to decommission right away? [16:52:53] we are going to decomm it right away [16:53:15] I mean, I think it gets removed from site.pp immediately after the cookbook is run... so it's only temporary anyway [16:53:40] and, my mistake, it's not wikitech its in the template of the decommission phab template [16:53:54] urandom: in which case, don't bother assigning a role at all (like I didn't for the ms-be nodes), it's easier [16:54:04] https://phabricator.wikimedia.org/maniphest/task/edit/form/52/ [16:54:23] Emperor: you mean, remove it from site.pp altogether? [16:55:00] urandom: exactly so [16:55:22] Ok, what happens when puppet runs and those nodes? [16:55:34] or do we care, and just run the cookbook immediately? [16:55:44] urandom: broadly, who cares? You're about to decommission them :) [16:55:52] Ok [16:58:41] remove it from site.pp *after* the decom cookbook as run [16:58:43] not before ;) [16:59:09] volans: what do you use before then? [16:59:23] whatever is the current role of the host no? [16:59:52] (SessionStoreOnNonDedicatedHost) resolved: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [16:59:59] volans: huh, I've always done the remove first and then run the decom cookbook [17:00:23] (I rather thought the latter would check for nodes still mentioned in puppet and complain) [17:01:37] that's what https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_ANY_Opsen says :D [17:02:14] ooph [17:03:24] the decom checklist has the other ordering [17:03:38] or at least puts "remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below." before running the cookbook [17:03:50] and there are contradictions in the dcomm phab template too [17:04:18] oh, right [17:04:21] Which I have possibly incorrectly always parsed as "remove site.pp, THEN (replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.)" IYSWIM [17:04:48] i.e. the replacement with role(spare::system) step is optional if you're going to decom immediately [17:07:31] I have no power on the phab template [17:07:35] for that 301 to rob :D [17:08:14] there is no more spare::system role since long time [17:23:21] yeah, using the role it always had results in errors (missing hiera values) [17:23:40] but... I guess that's ok 🤷‍♂️