[07:29:02] cumin2002 seems currently unused and I'd go ahead with a reboot unless that would cause an issue with DB/backup things? [07:31:40] +1 from my side [07:31:43] jynus: what about from your side? [07:31:59] go ahead [07:32:19] thanks, going ahead now [07:37:51] preparing s3 backups on codfw is taking 3h 3m, while on eqiad only 1h 32m [07:38:01] I have to check why the difference [07:50:57] Some day I will not be confused by s3 meaning something other than Simple Storage Service. [08:32:45] Emperor: re: ms-fe1012, it was left depooled and swift-proxy not restarted for inspection, are you interested in poking at it or should I wrap that up (i.e. restart swift-proxy and repool) ? [08:39:43] Amir1: I guess a left over from yesterday was db1131 (s6 master) still pooled in API [08:39:45] I am removing it [08:42:18] marostegui: I'm confused, yesterday we did s1 [08:42:37] then from another day or something [08:42:40] s2 had the same issue [08:42:43] I am reviewing all of them [08:42:51] oh yeah, thanks [08:43:32] Maybe we should avoid pooling candidates and masters in groups? We probably have enough replicas [08:43:38] to avoid this in the future [08:43:50] Sorry about it, I missed it [08:43:55] There's a check on the switchover checklist to make sure the master isn't pooled in any groups I believe [08:49:12] Amir1: Any idea where this error is coming from? https://logstash.wikimedia.org/goto/d74312d618067920ce2086bc81524f81 1.3M errors in 24h is a lot! [08:49:48] sigh, this is not alerting [08:49:55] I get to it [08:50:20] <3 [08:52:00] So it's RCLinked, it's warning only so logspam mostly [09:11:08] it's not just RCLinked, it's all RCs [09:15:08] would tomorrow at 7 UTC / 9 UTC work for a cumin1001 reboot from the perspective of DB/backup things? [09:16:11] We have a switchover at 06:00 utc I think? [09:16:44] I think thursdays are a busy day for backups [09:16:48] Or it used to be? [09:16:50] jynus: ^ [09:17:03] I have a s4 schema change that will take a while :/ [09:17:49] godog: please go ahead, I don't think I have much spare brain this week [09:18:02] thursday is busy for backup* hosts, not for other hosts [09:18:23] let me see when snapshots on eqiad finish [09:19:17] ack, I could also do it later with more distance to the failover (backups permitting), but if it gets too late in the day too many people start logging in again :-) [09:19:59] snapshots should finish for cumin1001 by 7:30UTC [09:21:33] our stuff should spawn a container in k8s but I should stop day dreaming [09:22:35] anyway, I won't start the long schema change on s7 then but the s4 and s1 ones are running and will take maybe a full week. I don't know how much progress they made [09:25:43] Emperor: ack [09:30:20] is everyone fine with a cumin1001 reboot at 8:00 UTC tomorrow, then? that's two hours after the switchover (and if there is any issue caused by the switchover, we'd simply postpone anyway) [09:30:51] ok to me [09:31:48] moritzm: the schema changes I'm talking are happening on cumin1001, sorry If I wasn't clear [09:32:18] I'm trying to see how to pause/stop them, the problem is that I need to wait for a 12-hour alter to finish and stop it right before it starts the next one [09:33:15] moritzm: it is probably easier to do it on monday per the above schema change [09:34:53] sorry, I misunderstood! no need to pause. do we expect it to be finished by Monday morning? [09:35:16] for s1, I'm fairly certain, the s4 one is a bit tricky [09:35:26] let me run a check to see how many hosts are left [09:37:03] 14 hosts still don't have the schema: ["db1150:3314", "db1190", "db1147", "db1141", "db1142", "db1148", "db1149", "db1121", "db1144:3314", "db1145:3314", "db2139:3314", "db2106", "db2179", "db2137:3314"] [09:37:18] each schema change will take 12-15 hours [09:38:30] marostegui: I have an idea for future, auto_schema should do two replicas in parallel, one in codfw and one in eqiad, we don't want to depool too many dbs in one dc but there is nothing wrong with depooling one from each dc [09:38:47] that basically brings back the time to run a schema change to the original time [09:39:29] anyway, that's for the next week [09:39:48] another thing that could be done is some kind of locking mechanism to stop a change, both globaly and per host, that backups and dbs follow [09:42:02] e.g. if I create a file called /var/lock/db.lock, backups and alter tables should pause or something [09:42:34] ok, that doesn't sound like it ends by Monday, I'll recheck on Monday what's left and then we can pick a date [09:46:52] Amir1: for a long-running cookbook of search what they ddid when I asked to be able to "pause" it was to have some sleep in between the hosts it loops and a clear log message that tells when it's safe to pause it. I guess for schema changes would not work because of the long duration of a single host vs any sleep you could add [09:46:52] moritzm: thanks. Sorry for this :/ [09:47:15] so in this case yes, would be better to be able to read something before going to the next one [09:47:38] and in case that something blocks you, then poll it until it clears to resume [09:47:48] yeah, let me think of something [09:48:06] (the cookbook above can be then ctrl+c-ed and then when restarted resumes from twhere it left, just to give full context) [09:48:54] np at all :-) [09:49:15] and for reboots you might need something similar, as you'll have to know from where to restart, or have to poll al DBs to check which ones are done and which not :D [09:49:50] for reboots it has minimum kernel version [09:50:04] which skips if it satisfy it [09:50:14] but also reboots take five ten minutes [09:50:25] the issue is on alters that take 12 hours [09:50:41] otoh, with improving tables, such large alters become less common [09:50:55] anyway, I have two things in my todo list now [10:03:30] would you like some more? :) [10:06:42] sorry, "two more" [10:06:55] * Amir1 looks at the list, it's already 52 for today [10:48:19] marostegui: found the reason behind all the warnings, made the patch. Let me see when I can get it reviewed and deployed [10:48:45] oh thanks :) [10:48:46] btw, it's not breaking anything [13:50:47] moritzm: good news, the s4 one broke. For the s1, I can make sure it stop before tomorrow [13:51:49] for future, I will make sure you will have a graceful shut down method [13:55:13] Amir1: can you review the depooled hosts and check which ones can be repooled again? [13:55:35] not needed now, but let's try to do it before the weekend [13:55:40] marostegui: sure, thing, the only one I'm handling right now is db2140 [13:56:05] there are some from s1 there [13:56:18] and not sure which ones are under work in progress [13:56:25] db1127 can be left alone (10.6) [13:56:33] there is one is already happening [13:59:24] https://vettabase.com/blog/a-summary-of-mariadb-10-9-vault-integration-innodb_log_file_size-and-more/ [14:03:17] Amir1: so we have db1169, db1118 from s1 [14:03:41] db1169 is mine [14:03:50] PK fixes of templatelinks, ongoing maint [14:03:52] db1118 is the old s1 master I think? [14:04:12] ah, yup, I think I forgot to repool it, the schema change took a day [14:04:19] * Amir1 has memory of goldfish [14:04:20] I can take care of that [14:04:26] db1173 from s6? [14:04:29] sorry, thanks <3 [14:04:47] I need to check that [14:05:02] sure, no need to do it now [14:05:09] but let's remember to clean that list before the weekend [14:05:41] I don't have anything in s6 to my knowledge [14:06:01] so db1173 can be repooled? [14:06:45] the only thing I'm seeing in SAL is that my script saw it's depooled and just downtimed it [14:06:52] didn't do anythign further, [14:07:01] unless it's old master than my bad [14:07:14] it is probably the old master yep [14:07:19] cause it is candidate master now [14:07:28] I will go ahead and repool it [14:07:34] assuming all the schema changes happened already [14:07:36] then the same issue, forgot to repool it after switchover [14:07:39] yeah [14:08:02] ok, repooling [14:08:20] thanks [14:29:21] so should I send a mail for tomorrow 8 UTC, then? [14:44:56] marostegui: yes! [14:45:24] moritzm: ^ XD [14:45:37] ugh, sorry my bad [14:47:39] excellent, will do that in a bit