[08:37:53] volans: can you review https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/827508 when you have some time? the cookbooks repo tests are failing and that is blocking us from merging things on the wmcs branch too [08:43:24] dcaro: I'm just back from vacations and submerged by emails, but sure I can have a look [08:49:03] dcaro: do you mind if I amend your patch? the reimage cookbook should use the session from wmflib that has automatic timeouts, only the one in upgrade-firmware should use requests directly because of a specific issue with that use case [08:56:02] volans: no problem [09:13:25] Deploying some unrelated changes, I have puppet corrections on deploy2002 that I'm not sure are correct (they do not appear on deploy1002, and didn't pop up in PCC) [09:13:29] Notice: /Stage[main]/Deployment::Deployment_server/File[/srv/deployment]/owner: owner changed 'mcrouter' to 'trebuchet' (corrective) [09:13:31] Notice: /Stage[main]/Imagecatalog/File[/srv/deployment/imagecatalog]/owner: owner changed 'helm' to 'imagecatalog' (corrective) [09:13:33] Notice: /Stage[main]/Imagecatalog/File[/srv/deployment/imagecatalog]/group: group changed 'helm' to 'imagecatalog' (corrective) [09:13:35] Notice: /Stage[main]/Profile::Mediawiki::Deployment::Server/File[/srv/deployment/mediawiki-staging]/owner: owner changed 'mcrouter' to 'trebuchet' (corrective) [09:13:53] I'd rather ask than potentially break deployment stuff [09:38:07] claime: That's interesting. I'd say that the corrections look 'correct' in that the results match what's on deploy1002 and make sense. But I can't explain why they would have needed correcting in the first place. [09:52:17] btullis: I don't know if it's worth digging or not but it certainly caught my attention [09:52:36] topranks: I'm planning a second attempt to deploy this BGP change today: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/826579 - Not expecting any drama, but just making sure you're aware. [09:53:40] <_joe_> claime: I think there's some work to do with setgid bits on directories [09:56:02] _joe_: If the setgid bit on /srv/deployment was badly set, we would see the same owner for subdirs, would we not? Or do you mean that the setgid bit should be set on /srv/deployment and isn't? [09:56:22] <_joe_> I am saying the latter yeah, without having looked deeply [09:56:32] Right [09:56:36] <_joe_> I suggest you don't either, unless this really bugs you, then be my guest :D [09:56:51] It doesn't, as long as puppet does its job of setting perms correctly, tbh [09:57:02] I have some docker cleanup to do :P [10:11:16] hi, I'll be rebooting graphite1004 shortly -- please let me know if I shouldn't ! [10:16:04] btullis: ok good stuff, change looks ok to me, if you've any trouble drop me a line happy to look at it [10:16:15] topranks: Many thanks. Will do. [10:24:59] (change of plans re: graphite1004, I'll do it this afternoon or tomorrow) [10:49:01] topranks: I got a different error on commit this time. [10:49:06] https://www.irccloud.com/pastebin/hzUiMg3a/ [10:52:19] Oh I see, mismatch between policy names. Will patch it now. [10:56:12] Yep, change "dse-k8s_import" in the 64609.policy file to "kubedse_import" [11:15:56] topranks: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/827979 - I also added what I think was a missing policy statement. [11:49:24] btullis: ah yes well spotted, that nested reference is needed also, although not directly related to the session between K8s host and network (it's used on core between Spine switch and Core router) [11:49:44] Sorry for the delay was in a meeting [11:51:12] Thanks and no worries. I'd rather move cautiously where BGP is concerned anyway, so no hurry at all. :-) I'll proceed to try to commit it again now'ish. [14:24:20] topranks: herron: let's move the jobrunner discussion here so we can have it without noise? [14:24:38] ok [14:25:08] I was noticing that the hosts alerting are pooled for both, thinking we could try dividing the cluster a bit more [14:25:22] I think that is reasonable [14:25:40] there's some hosts that have load averages >200 and there's some that have load average of like 1 heh [14:25:46] yeah no objection [14:25:58] I did notice there that jobrunner's enabled for a bunch of hosts not running videoscaler too [14:26:11] I think that is on purpose? but I don't really know [14:26:21] _joe_: or someone else from serviceops do you have some background context to give? [14:26:35] so perhaps de-pooling jobrunner from the hosts running videoscalers, so they aren't competing with ffmpeg, might do it [14:26:45] <_joe_> context for what? [14:26:55] <_joe_> sorry I was writing code and not really reading here [14:26:57] cdanis you've probably seen this but in case not: https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Jobrunners [14:27:01] how was videoscaler / jobrunner provisioned in the first place? [14:27:07] some mw hosts are running host due to videoscaler jobs [14:27:14] s/host/hot [14:27:15] <_joe_> topranks: that's the suggested course of action yes [14:27:25] <_joe_> cdanis: I don't really get the question [14:27:32] <_joe_> what you mean by "provisioned"? [14:27:36] capacity planning [14:28:15] <_joe_> is it relevant right now? anyways, there is no capacity planning that can work with videoscaling given how irregular the workload is [14:28:25] <_joe_> the thing we can do is adapt concurrency in changeprop [14:28:32] <_joe_> on one hand [14:28:45] <_joe_> and reduce videoscaling to a few servers only to reduce impact [14:29:11] <_joe_> videoscaling is broken by design, since I've been here. Fixing it was one of the main motivators for introducing k8s at the wmf [14:29:23] thanks, that last part was what I was wondering about [14:29:36] <_joe_> now, I guess you're interested in what to do now, do we have an ongoing overload? [14:29:56] <_joe_> cdanis: at least now it's not just the oldest host in the whole infra as it was when I joined :P [14:31:03] <_joe_> "fixing" as in "working around the fact no one touched the code since at least 2012 with clever infrastructure" [14:31:27] yeah mw1437/1437/1439 are all more or less maxed out, I guess de-pooling them for jobrunner would be sensible? [14:31:42] mw1440 also [14:31:43] topranks: +1, and perhaps depooling anything else in the videoscaler group [14:31:56] <_joe_> yeah [14:32:07] i.e. any over-laps? so leave a certain set of hosts dedicated to videoscaling, but so it won't affect other workloads [14:32:10] <_joe_> keep in mind this has no real functional impact on jobrunning [14:32:13] makes sense to me [14:32:14] <_joe_> as we retry anyways [14:32:17] ok [14:32:31] we also have a lot of hosts pooled for jobrunner not running videoscaler [14:32:50] so if it retries after timeout from busy host it'll likely hit another not running ffmpeg [14:36:37] OK I will disable jobrunner on mw1437/mw1439/mw1440 as a start if that sounds reasonable ? [14:37:43] +1 [14:37:59] please !log or use confctl invocations that will auto-log [14:38:35] hey, have a look too at the log entries in #-operations I have done a couple of those hosts so far [14:40:35] oh ok - sry missed that [14:40:40] I just did it for mw1437/mw1439/mw1440 [14:41:10] herron: ah shit we are getting our wires crossed [14:41:32] I was taking the approach of de-pooling the busy hosts for jobrunner (let them keep at the video tasks) [14:41:46] I see you were disabling them for videoscaler though [14:42:41] yeah, was depooling from videoscaler to give resources to jobrunner [14:42:50] So mw1437 and mw1439 right now are not in pool for jobrunner or videoscaler [14:43:09] I'll re-add them to jobrunner pool I'm thinking? [14:44:33] topranks: ok sounds good [14:44:46] I'll stop depooling as well [14:45:05] ok I'll re-pool those two for jobrunner now [14:45:26] ok [14:45:43] ok done [14:48:54] So right now mw1338/mw1445/mw1446 are still pooled for both. [14:49:07] They're busy but not completely flatlined like some of the othere [14:49:30] I guess let's leave it and see how it goes, may need to de-pool those for jobrunner if it blips again [14:50:28] sounds good, will keep an eye on the alerts [14:51:25] in the mean time, coffee is much needed [15:12:35] I'm in a meeting right now, re: labweb page [15:16:20] just paged as well, looking [15:16:51] <_joe_> jhathaway: I'd ping andrewbogott and dhinus [15:17:08] will do thanks, _joe_ [15:17:14] <_joe_> "labweb" IIRC means wikitech [15:17:17] looking [15:17:26] that might be us yes [15:17:28] <_joe_> which wfm [15:19:38] just ack [15:19:42] *acked [15:19:50] <_joe_> !incidents [15:19:50] 2946 (ACKED) [FIRING:1] ProbeDown (10.2.2.40 ip4 labweb-ssl:7443 probes/service http_labweb-ssl_ip4 ops page eqiad prometheus sre) [15:19:51] 2945 (RESOLVED) [FIRING:1] ProbeDown (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 ops page eqiad prometheus sre) [15:19:51] 2944 (RESOLVED) [FIRING:1] ProbeDown (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 ops page eqiad prometheus sre) [15:19:51] 2943 (RESOLVED) [FIRING:1] ProbeDown (10.2.2.5 ip4 videoscaler:443 probes/service http_videoscaler_ip4 ops page eqiad prometheus sre) [15:19:53] we are working on it, sorry for the noise [15:20:00] <_joe_> dcaro: np [15:20:10] <_joe_> damn I wanted to try to ack it via sirenbot :P [15:20:23] dcaro: thanks! [15:20:32] I can unack if you want ^.^ [15:20:51] <_joe_> dcaro: I can resolve it though via the bot [15:20:58] <_joe_> I'm 90% sure it won't work [15:21:33] <_joe_> herron: page handled [15:22:17] thx [15:59:37] hmm, is there a way to change the team for the alerts that are generated from the lvs settings? (I guess, the alerts are currently triggering, see https://alerts.wikimedia.org/?q=instance%3Dlabweb-ssl%3A7443) [16:02:59] I see it's defined in the alerts repo, under team-sre/probes.yaml [16:37:46] _joe_: no, john is out this week [16:46:33] herron: created T316682 to follow up 👍 [16:46:34] T316682: [cloudweb] Improve the alerts coming from the LVS setup - https://phabricator.wikimedia.org/T316682 [16:46:46] dcaro: great thanks! [17:19:06] _joe_: saw a ping but I'm on holiday and wont be near a laptop for another 2 hours anything I can help with? [17:19:17] <_joe_> balder: go away [17:19:20] <_joe_> it's not urgent [17:19:32] <_joe_> :* [17:19:38] Ok thanks will catch up later :) [17:20:04] <_joe_> and turn off irc on your phone [17:20:21] Lol [17:20:25] * balder gone