[05:56:17] was db1105 reimaged recently? [05:56:26] it's been alerting for a few days about the promotheus exporter [05:56:47] Ah yes, looks like it was done on thursday? [05:56:52] Anyways, I fill wix that [05:56:55] I will fix that [06:19:54] Ugh I'm sorry. I think I fixed most of them [06:20:45] nah no worries [06:27:54] https://wikitech.wikimedia.org/wiki/User:Ladsgroup/sandbox now fully automated updating every three minutes. Will move it to the main namespace later [06:28:15] sounds good, post here the final link so I can update the bookmarks [06:28:25] does it catch mX sections too? [06:29:03] Not yet. It takes dbctl depool but we don't depool m hosts iirc [06:29:17] ah it is based on dbctl, got it [06:29:21] no worries [07:08:16] hi folks! [07:08:50] qq - is there a timeline for the Swift MOSS cluster? (https://phabricator.wikimedia.org/T279621) [07:09:52] I am asking since the ML team is currently using the Thanos swift cluster for storing models, and we'd love to migrate to MOSS at some point [08:21:21] elukey: short answer is "no"; the slightly longer answer is that I hope to have some time to work on it in Q4, but that is not a timeline of any sort. [08:23:36] Emperor: ack thanks :) [09:19:57] I cannot find why db1096:3316 is depooled, anyone working on it at the moment? [09:20:47] It was depooled a few days ago by myself, but never got repooled, I assume the schema change failed, but if no one is working on it, I will start repooling it back [09:44:45] marostegui: not me, at least [09:45:02] yep thanks [09:50:38] Amir1, marostegui: going to run schema change against s7 [09:50:50] sounds good [09:51:15] Awesome [10:17:07] Amir1: dumps are running against a host that the script is trying to depool. do i just ignore it, and leave it run? [10:19:50] kormat: you can ignore it, if it gets drained before the timeout then the schema change will go thru, if not, the host will be repooled but the schema change won't happen, so you'd need to check once the loop has finished to see if it was applied to all the hosts [10:20:03] Yup [10:20:16] Check the logs afterwards for cases that failed [10:20:35] The timeout is one hour and that's usually enough [10:22:03] oh.. crap. so the script doesn't error out. i guess that means i need to go look through the logs for all previous runs i've done to check :( [10:22:17] kormat: yep, you'd need to do so [10:23:19] is there a useful string to search for in the logs? [10:24:26] kormat: usually "not applied" [10:25:15] The logs are stored in the logs directory [10:25:27] the logs appear to be incomplete [10:25:30] Worst case it shows up in drift tracker [10:25:40] e.g. i know that i applied it to a single replica in codfw first. but that's not visible. [10:25:59] It should be [10:28:01] Amir1: ~kormat/software/dbtools/auto_schema/logs/T300774.log [10:28:05] (on cumin1001) [10:28:51] Dry runs also log. I'm not on laptop but grep it [10:30:41] I actually should make it that dry runs won't log to reduce the logspam [10:33:58] oh. i don't have a way to see what instance i ran it against. because that info is kept in the .py file, which has since been changed multiple times :/ [10:34:25] maybe i happened to mention it on irc? scrolling [10:34:31] kormat: so maybe on SAL [10:34:35] kormat: yeah, probably [10:34:59] Or the ticket [10:36:46] didn't mention which one on irc, it's not on the ticket either. [10:36:54] i'll go look at SAL [10:37:48] 13:14 kormat@cumin1001: START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [10:38:37] Amir1: is there a way to see in the log file if a run was a dry-run? [10:38:45] looks like i _might_ have done a dry-run first [10:39:37] Not yet but it's easy to fix [10:40:00] The logging logic in encapsulated and tbh never needed it [10:40:09] ok, i think that's where the confusion comes from [10:40:21] the first run (or multiple? very hard to tell) was almost certainly a dry-run [10:40:35] And we know it's not incomplete so grepping failed should suffice here [10:41:09] You probably can see it's dry if the timestamp is not moving much [10:41:27] (Being done quickly) [10:42:46] grepping for 'failed' gives one hit: [10:42:51] 2022-02-17 18:06:40.155925 Depool failed for db1110 [10:43:55] There you have it [10:44:06] Amir1: is that the only possible failure message? [10:44:37] Yup [10:44:44] The rest cause a panic [10:44:54] The script just stops [10:45:34] Depool failed is actually confusing as it is mostly cause of the host wasn't able to get drained [10:45:40] at least in my experience [10:45:41] ok. so what manuel said above about 'not applied' is not correct? [10:46:05] Not applied is when it errors in which it would stop [10:46:27] marostegui: make a patch 😁 [10:46:40] Grep "failed" in the code [10:46:45] 😈😈 [10:48:15] https://gerrit.wikimedia.org/r/c/operations/software/+/764323/ [10:49:17] filed T302207 too [10:49:18] T302207: Use logging with levels so that errors are visible - https://phabricator.wikimedia.org/T302207