[08:21:28] sal is down? I have a 404 on it, that doesn't look like "down" though ^^ [08:24:06] don't see any possibly related messages on -operations [08:24:26] up again? [08:25:41] fabfur: ah I missed it! not up from my pov I'm affraid [08:26:16] https://wikitech.wikimedia.org/wiki/Server_Admin_Log I see latest updates [08:27:33] arnaudb: are you talking about the SAL on wikitech, or the elasticsearch powered mirror on toolforge? [08:28:54] https://sal.toolforge.org/ → that one on toolforge, I forgot about the wikitech one! [08:30:03] fyi @bd808 looks like the webservice is down [10:00:13] arnaudb p858snake|cloud fabfur I just restarted the SAL webservice on toolforge [10:00:22] thanks arturo ! [10:00:28] thanks! [10:12:24] np [13:46:43] hashar: that was quick, thx! [13:52:34] general heads-up if users report any issues related to video, commons (ie almost all of) videoscaling has been moved back to shellbox [14:10:22] revi: sometime yes, I merge the integration/config as I witness them :] [14:10:52] I was pretty lucky then xD [15:48:15] topranks: are you around? [15:48:52] I tried https://netbox.wikimedia.org/extras/scripts/4/ (for the first time), and accidentally the whole thing [15:49:03] urandom: I’m at a conference but during a break right now [15:49:07] what’s up? [15:49:28] whoops, don't worry, we can get it sorted [15:49:39] it picked IPs from the parent subnet it seems like [15:49:54] and didn't add domain names [15:50:02] (which seems to be what the cookbook was mad about) [15:50:20] https://phabricator.wikimedia.org/T378730 [15:50:29] dns cookbook, probably, yeah. I'm surprised we don't have netbox validators for that (or they didn't kick in) [15:50:42] uh ok sorry bout that hmm [15:51:04] definitely had tested it but must have made some error [15:51:25] it took quite a while before we were able to use it in anger :) [15:52:20] what’s the urgency on it? I can take a proper look tomorrow morning eu time if it can wait [15:52:59] I don't know, I guess it prevents the dns cookbook from running atm? Beyond that the urgency is low. [15:53:04] it should start at ‘a’ alright, maybe an error there [15:53:09] the root of the issue is taht the primary IP of aqs1022 is a /12 [15:53:11] https://netbox.wikimedia.org/ipam/ip-addresses/18146/ [15:54:48] XioNoX: if you can delete the dns names or IPs for now to unblock the dns cookbook I’ll do some debugging in the morning and get it fixed up [15:55:07] yeah don't worry [15:55:20] thanks <3 [15:57:17] urandom: how many extra IPs do you need? [15:57:24] XioNoX: two [15:59:30] urandom: cool, you can try the dns cookbook now [15:59:42] trying... [16:04:35] Ok, I think that part succeeded, and that I'm now presented with diff from someone else's change? [16:04:44] the removal of a genetic host? [16:04:49] what's the convention here? [16:04:56] go/abort? [16:05:16] we check SAL for which host it is usually [16:05:20] can you tell us which one it is? [16:05:26] ganeti1044.mgmt.eqiad.wmnet [16:05:27] urandom: send the diff here [16:05:49] https://www.irccloud.com/pastebin/BN4v8mHZ/ [16:06:18] urandom: probably because of this: https://netbox.wikimedia.org/extras/changelog/195033/ [16:06:58] auh, yeah [16:07:01] urandom: anyway, it's fine to accept [16:07:18] TIL [16:07:21] XioNoX: \o/ [16:07:24] thank you! [16:07:47] urandom: so there is a deeper problem, which I don't get https://netbox.wikimedia.org/extras/scripts/results/110137/ [16:08:09] it tries to add /12 and /56 IPs [16:08:23] while it's nowhere on the host nor in puppetdb [16:09:28] yeah, that's weird. It's supposed to be /22 and /64? [16:09:32] yeah [16:09:40] I manually fixed it [16:09:57] but it will bite us again [16:10:03] yeah [16:11:05] anyway, I'll investigate [17:24:43] I have to step away for today and tomorrow is a holiday in France. [17:25:10] topranks, I put all my findings on https://phabricator.wikimedia.org/T378751 if you have time to poke at it tomorrow [17:25:43] urandom: watch out if you need to do anything similar for any of those 3 types of hosts defined in the task as `NO_VIP_RE` [17:26:39] XioNoX: ok [17:27:03] XioNoX: thanks. Yeah I definitely did not consider that or use those in my tests [17:27:30] topranks: your script to add extra IP works fine if not everything around it is broken :) [17:27:40] I’m sure it won’t be difficult to fix I’ll have a look tomorrow, going from IP -> vlan -> subnet also strikes me as a way to tackle [18:21:10] just to note I have a diff for recent mx changes in external_services for admin_ng in deployment-charts, I'm just gonna roll head with them. they seem reasonable [18:23:13] hnowlan: it would probably be a decent idea for us to auto-deploy external_services tbh [18:24:03] yeah true [18:34:58] cdanis: long story short we moved to using shellbox-video for videoscaling earlier. it was going fine but over time long-running jobs have run us out of capacity [18:35:14] ahh [18:35:16] given that the pods running jobs aren't ready, we can't just do an apply to scale up [18:35:53] hm that seems like a problem we have to fix eventually heh [18:35:56] at this point it's looking like rolling back might just be the best course of action [18:35:57] can we kubectl scale the deployment for now? [18:36:08] and then modify the values in helm to match? [18:36:10] yeah we could just edit the replicaset to add more replicas I *think* [18:36:20] you need to edit the deployment, the deployment owns the replicaset [18:36:31] ahh right [18:36:43] I made that mistake during a different mw-k8s outage ;) [18:36:45] umm [18:37:20] at this point I am considering just rolling back the mediawiki-config change [18:37:32] that's reasonable [18:37:42] I think it would also be fine to kubectl scale the deployment + merge a similar change to helm values [18:38:03] we've already bumped replicas by 16, but that apply is going to eventually fail [18:38:49] I'm leaning towards the latter [18:38:51] can we just make maxUnavailable 100% ? [18:39:28] I'm not entirely sure that applies to _exiting_ an update, is the problem [18:40:01] i.e., I'm not sure if it only applies to _progressing_ an update [18:41:21] (I realize that has implications for how aggressive the update is when scaling the replicasets) [18:43:10] but also, +1 to rollback maybe being the right strategy if this aspect needs a bit of a rethink in order to be confident it's stable [18:43:36] yeah, I think that's safest [18:43:40] is the apply still running? [18:43:45] It's getting late here and I'm fighting trick-or-treaters [18:43:47] cdanis: yeah [18:44:56] yeah, maybe that's best then [18:45:06] although, would it be enough to run helm without `atomic`? [18:45:43] I dunno, there's no guarantee that they won't just get overwhelmed after we scale up [18:45:53] oh after you scale up the concurrency [18:45:55] sure [18:46:01] ok I'm +1 for rollback [18:46:33] hnowlan: if you need to go, we can take care of the rollback / backport [18:47:39] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1085464 [18:48:43] swfrench-wmf: if you could sync that it would be great, I kinda need to move in the next few minutes but I will be back in ~30 minutes or so [18:49:08] +1'd and yeah, I can drive that [18:49:17] <_joe_> I'm -1 on rolling back [18:49:23] <_joe_> let me explain why [18:49:43] <_joe_> if the load is unusually high, k8s has some way to cope with it for us, if uneasy [18:49:47] <_joe_> baremetal does not [18:50:23] <_joe_> but YMMV [18:51:05] it *is* just a matter of numbers in this case [18:51:15] you can run more jobs on the metal videoscalers [18:51:23] (and we don't page for them, but we do for shellbox-video) [18:52:10] <_joe_> we can rollback and see if it kills videoscalers :) [18:52:10] I agree with _joe_ [18:52:36] I doubt it will kill them fwiw [18:52:41] _joe_: I think we're also holding up a backport window right now [18:52:49] yeah we are [18:52:56] <_joe_> or, we can rollback, while we're rolled back we can scale up [18:53:09] this is kind of the problem, yeah - we've changed a couple of things about how capacity is "represented" (for lack of a better term), and unfortunately that's tangled with a bunch of other things, like how pybal alerts are useless [18:53:11] <_joe_> and see if videoscalers hold up [18:53:36] <_joe_> pybal is useless in general when it comes to k8s [18:53:43] so I am +1 on rollback to unblock the window and then fix things after [18:53:45] precisely, yeah [18:53:46] <_joe_> anyways, sure, roll back [18:54:17] I can take that from here, then [18:54:35] thank you swfrench-wmf! [18:56:43] it does make this any less annoying but this is more or less the same situation as thumbor [18:56:56] we could absolutely beat the crap out of a few metal instances and clog them with processes [18:57:05] er *it doesn't [18:57:59] the other option is DIY loadbalancing instead of abusing readiness probes ;) [18:58:06] I've implemented that before [18:58:29] that's exactly what we did for thumbor [18:58:34] and it suuuuucks :D [18:58:35] oh neat, I didn't know [18:58:37] ahaha [18:58:53] we actually want to get rid of it and go to a single-worker-per-pod thing and let k8s sort it out [18:59:28] the miss penalty is a lot less for subsecond tasks, at least [18:59:59] okay, I have to scoot briefly - I will be back in ~30 mins but I have my phone and can get back to a computer so please let me know if needed [19:03:27] thanks for sticking around, h.nowlan! good luck with candy distribution :) [19:10:50] alright, now that we're no longer blocking the train, I might go ahead and test out the maxUnavailable hack - any objections to that, or better yet, someone who knows authoritatively that will not help? [19:22:28] aaand never mind. it turns out we've not wired strategy overrides into the shellbox chart. [19:24:23] I'll try applying once we drift below the default of 25% [19:39:44] thanks for handling the rollout swfrench-wmf! I'm around now if there's any follow-up [19:42:46] hnowlan: no problem at all. I don't think there's much immediate follow-up - I'm just waiting for %-unavailable to drift below 25% and then I'll try applying you changes. [19:43:00] *your [19:43:36] might be a while :D I suspect a few of those jobs will be a few more hours [19:44:30] hnowlan: have we thought about sharded transcodes? I'm guessing that involves even a few extra complexities than the ones I was already thinking of, but [19:45:01] it was something we talked about way back at the start of this project [19:45:09] "it'll be quicker to lift and shift" I said [19:45:12] lolol :< [19:45:15] 😅 [19:45:18] you might still be right! [19:45:32] it would require a good bit of work on TMH [19:46:19] hm interesting, I was imagining ways to split up individual transcodes but keep it transparent to 'clients' [19:46:34] I've stayed far away from TMH though :) [19:48:02] what I would imagine is a process where if we get a video larger than $x, we use ffmpeg to split the file up and stick them in a queue which get encoded as normal and then something that again uses ffmpeg to stitch them back together [19:48:53] ^ that (and that could be transparent for tmh) [19:49:24] there are a lot of assumptions about ffmpeg splitting being format-friendly though [20:25:04] shellbox-video is now upsized in codfw - the answer to my question from before: yes, maxUnavailable applies not update exit, not just progress [20:26:58] interesting typo ... that should say "applies to updated exit"