[05:16:58] why do we have two s5 hosts depooled? [05:17:05] db1096:3315 and db1161 [08:26:30] db1161 was depooled this morning: https://phabricator.wikimedia.org/P25876 [08:27:18] db1096:3315 was depooled yesterday: https://phabricator.wikimedia.org/P25582 [08:27:38] marostegui: best guess is that the schema change was running against db1096 when i killed it yesterday (during dbctl issues) [08:37:40] good news: most of the db reboots are done. bad news: all the ones left are the painful ones. [08:43:00] i'd love to have a way to pause a schema change to do other maintenance. as it is an entire section is blocked for a day, and i end up twiddling my thumbs. [09:17:39] let's make sure db1096 is repooled during the day then [09:17:48] so we have all hosts ready for the long weekend [10:11:38] Amir1: for when you wake up, I am repooling db1109 [10:57:26] Hm, our swift container listings don't contain Etag, and rclone ignores 'hash' [10:58:16] ok, I guess I have to repool db1096:3315 then [10:58:39] I've punted https://forum.rclone.org/t/swift-sync-checksum-calls-head-on-every-object-so-is-very-slow/30322 at upstream to see what they think (since they seem quite responsive) [12:28:04] godog: do you know if we have any chunked objects in ms? cf upstream Q at https://forum.rclone.org/t/swift-sync-checksum-calls-head-on-every-object-so-is-very-slow/30322/2 [12:38:06] (I think from grobbling around inside the swiftrepl code we currently (assume we) don't, BICBW) [13:34:53] marostegui: hey, sorry, i was afk for a lot longer than expected [13:36:56] marostegui: right now the full list of depooled hosts is: db1132 (which you're working on), and db1179 (s3) [13:37:35] yep, db1132 can be ignored [13:37:41] db1179 I think is coming from Amir1's schema change [13:38:49] confirmed, yeah re: db1179 [13:41:23] Good morning. Everything is marostegui's fault [13:43:52] I actually checked all terminated schema changes yesterday and repooled the ones that were not repooled. I must have missed that one [13:43:56] Thanks [13:44:44] i have a tiny shell script on cumin1001 to show depooled hosts: `~kormat/bin/list-depooled all` [13:45:19] kormat: oh nice, can you push it somewhere 🥺 [13:45:29] i did. to cumin1001. :P [13:45:52] * Amir1 trouts kormat [13:46:05] btw, what sections do you need it to be stopped? [13:47:23] Amir1: for example, i'd like to reboot db1154. it's in: s1/s3/s5/s8. [13:47:56] hmm, I see [13:48:01] which is basically impossible to do without stopping schema changes [13:52:53] okay, I won't do any on Sunday/Monday, it should be easy for you to pick them up [13:53:06] it's hard to stop a running one [14:01:19] Amir1: monday is a holiday here [14:47:31] * Emperor shaves yaks [15:09:36] Amir1: refreshLinkRecommendations.php has been running against db1120 for >1h since it was depooled [15:27:15] kormat: I think I created a ticket for that long time ago [15:27:18] let me double check [15:27:28] https://phabricator.wikimedia.org/T299021 [15:27:47] A comment there would be amazing :P [15:29:02] on it [15:29:32] Amir1: also, i think we have to establish some guidelines for maintenance scripts. "Either you check every 30mins for a depool, or you don't mind if we kill the connection." [15:31:33] the underlying problem is that mw's connection manager is trying to handle both "one-minute-top webrequest" queries and "let's rewrite half of our database in one run" queries [15:31:45] and does a terrible job at both [15:32:00] but since it prioritizes the former, the latter suffers [15:32:42] I think I can make it work by splitting this and making script use a different connection manager but that will take time