[06:46:17] I'm looking for a window to reboot cumin1001, when would be a time suitable for DBAs/backups? [06:54:25] Right now I am automatically repooling a host [06:54:33] It should be done in a couple of hours [06:54:46] I am not sure about Amir1's schema change though [07:00:09] ack, let's wait for Amir, I'm flexible, that can also wait some more days or next week [07:13:06] yeah that schema change he's doing is quite big, so it might take a few days for him to finish the current iteration of s3 and s5 [10:17:48] moritzm: hi, yeah, this extlinks schema changes are a pain [10:18:17] possibly in two days? Would friday sound good? I know it's not the best time [10:21:54] I'm fine with Friday! [10:22:11] thanks! [10:22:18] even if for some reason cumin1001 wouldn't come back we still have cumin2001 until DC ops look into it [10:22:38] I'll check with you on Friday whether your job has completed [10:22:42] yeah, makes sense [10:22:48] awesome. thanks. [10:24:52] not for this time, but as a general question, would it be great if long running scripts could be stopped during a safe time and resumed (potentially elsewhere). For example some long running cookbooks for search to log explicitely when they are sleeping between hosts and it's safe to interrupt it. And then it can be restarted resuming from where it left (I think because it checks the [10:24:58] status of the cluster, don't recall the ... [10:25:01] ... details). [10:25:04] s/question/suggestion [10:29:20] I think we had discussions on making auto schema try to read a file like "STOP" and if it exists, it would avoid moving on to the next replica. That still means it might take a day for the specific alter to finish on that replica (yup, we had cases of that) but still better than waiting for the whole section to finish [10:29:46] I think I never got around to implementing it. Funnily enough, it's not hard. It's just the sea of stuff I need to do [10:41:49] if someone creates a ticket, at least I will keep it tracked [10:43:16] file should definitely contain the text "hammer time" [10:43:55] haha [11:50:50] omg, the wmf checsums tables are at least in order of terabyte, not 300GB. cumin grouped same results so many of them needs to be multiplied by some number [11:51:50] e.g. I counted 10GB for s6, but that's actually 100GB [11:57:46] marostegui: I'm seeing a couple of .bak and .old full directories when grepping for wmf_checksu. e.g. this [11:57:48] https://www.irccloud.com/pastebin/yjZovxNu/ [11:57:53] can I rm -rf them? [12:03:37] yeah, they must be old too [12:03:38] I guess? [12:04:52] for db2141, it was last edited in July 27 [12:05:04] is that a backup source' [12:05:07] that seems recent [12:05:19] but i don't see any reason to keep it, unless it is something jynus might have been working on [12:05:37] yeah [12:05:46] we have a couple of .old directories too [12:05:59] e.g. on db1145 [12:06:32] this one was last edited in Jun 28 [12:06:59] I created those while moving the data dirs [12:07:24] can be removed now that backups are proven to work well [12:09:46] hammer time! [12:13:00] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1145&var-datasource=thanos&var-cluster=wmcs&from=1690977329388&to=1690978372960&viewPanel=28 [12:13:01] weeee [12:16:38] there was two sections of s4 and s5 in db1150 with last edited in March 31. Deleted them too [12:23:48] Last snapshot for s3 at codfw (db2139) taken on 2023-08-02 06:21:00 is 1235 GiB, but the previous one was 1302 GiB, a change of -5.2 % [12:24:50] I did some stuff :D [12:25:08] I think it might be externallinks optimization [12:25:40] (literal optimize table, the drops are not done yet) [12:26:22] and I dropped ops.__wmf_checksums a couple days ago on s3 [12:28:35] jynus: would you mind quickly checking what was the table that had the biggest change? [12:28:41] I know there was a query [12:28:45] sure [12:28:53] Thanks <3 [12:29:14] also dump.es5.2023-08-01--00-00-05 failed! [12:29:22] es2025? [12:29:30] yes [12:29:37] https://phabricator.wikimedia.org/T343254 [12:30:14] ok [12:31:10] wanna re-start later? [12:32:10] I will see, as it takes 2 days to run, I may wait until bacula runs on thursday [13:04:00] Amir1: https://phabricator.wikimedia.org/P49966 (sorry, got distracted with something else) [13:04:17] no rush. Thanks [13:04:51] it's the optimize, we made proto-relative urls take one row instead of two, that removed a lot of rows from lots of wikis [13:05:23] the actual fun ones are yet to come :P