[08:13:27] elukey: we got alerts on the puppetmaster regarding Stale template error files present for '/srv/config-master/pybal/codfw/ml-staging-ctrl' [08:13:52] elukey: do you know anything about this? I couldn't find anything on SAL [08:16:19] <_joe_> vgutierrez: if i had to bet [08:16:30] <_joe_> they added the pool with no servers pooled [08:16:50] <_joe_> klausman probably can confirm, i've seen him working on the staging cluster [08:18:21] what would I do with old LVS VIPs in netbox? Do we just delete (and potentially reuse) them or do we block them somehow? [08:19:27] jayme: what do you mean by old? [08:19:45] by old I mean "removed from service::catalog" [08:19:52] and DNS [08:20:26] jayme: so not in use anywhere? [08:20:36] exactly [08:20:44] in that case I'd say delete it yep, so it's back in the pool [08:21:41] ok, thanks [08:22:40] <_joe_> +1 [08:47:21] cookbook: error: argument --datacenter/-D: invalid choice: 'esams' (choose from 'eqiad', 'codfw') [08:47:36] hmmm interesting restriction on sre.hosts.reboot-cluster [08:54:36] <_joe_> vgutierrez: hah that's been thought for backend services, indeed [08:54:39] <_joe_> :) [08:54:52] <_joe_> at the time [08:55:13] <_joe_> the cache hosts didn't use profile::lvs::realserver so they didn't have any restart-$service script [08:55:28] I was trying to use it on ncredir hosts [08:55:31] <_joe_> and they had a dedicated command to restart varnish [08:55:39] <_joe_> vgutierrez: yeah makes sense [08:55:47] <_joe_> we can change that I guess :) [08:55:49] cache hosts need some tuning regarding profile::lvs::realserver [08:56:08] I'll fix that soon, but not today (Friday) [09:38:56] _joe_/vgutierrez: yes, we're building the staging cluster, but it's not complete. Are the stale config alerts a problem? [09:39:17] <_joe_> klausman: no, maybe you can acknowledge them [09:39:26] <_joe_> so that noone else asks :) [09:39:27] will do [09:39:45] that wouldn't potentially hide other confd templates issues? [09:43:18] I might as well just revert that bit of conftool stuff. It's not like it would be hard to recreate [09:47:45] vgutierrez: https://gerrit.wikimedia.org/r/c/operations/puppet/+/776168 for your reviewing pleasure (Luca is out today) [09:53:38] ah I didn't notice until now but puppet is failing on prometheus/codfw re: k8s ml staging cc klausman [09:53:45] Error: Could not retrieve catalog from remote server: Error 500 on SERVEtion detected in k8s_clusters declaration [09:54:18] Hurm. [09:54:31] Oh well, more config to revert, then [09:55:11] I'm off this afternoon but roll forward is also fine with me FWIW [09:56:02] As I said, Luca is out today, I'd rather do this together with him on Monday [09:57:04] fair! SGTM [09:57:07] ack, so let's get back to a stable state for the weekend [09:57:31] Folded the prom revert into the existing change [10:03:18] I don't think that the confd revert is required [10:03:24] let me clean the stale errors... [10:03:41] I need to go but LGTM for prometheus [10:04:30] thx, Filippo [10:06:55] So, undo the conftool bit and we good? [10:08:01] klausman: I think if you merge that bit that removes the conftool entries, it will break things more [10:08:15] I suspected so [10:08:27] pybal already has that service configured, I had to pool those two hosts yesterday to resolve a pybal CRIT (because all 2 were depooled) [10:08:39] Oops. my bad. [10:10:52] Will sumbit the prom changes now [12:28:35] can I tell PCC that a build depends on a particular labs/private change? the gerrit standard Depends-On syntax does not seem to wok [12:29:09] taavi: i'm not sure there is a way. [13:04:48] taavi: the current workflow seems to be merge-and-retry :-( not very optimal [13:07:03] filed T305245 [13:07:03] T305245: pcc should support Depends-On for a labs/private patch - https://phabricator.wikimedia.org/T305245 [16:17:20] https://phabricator.wikimedia.org/T304089#7824967 <- TL;DR, while thinking on things during the drmrs experiment, I realized we should probably re-pool esams an hour earlier than planned [16:17:50] so, expect esams repool (and drmrs back to just serving CY, PT, ES, and FR) at ~22:00 UTC, not ~23:00 UTC as originally outlined [16:17:53] XioNoX: ^ [16:18:06] ok! [16:18:42] I'm out for the weekend in ~15min btw [16:19:03] np, one way or another we can handle whatever happens! [16:19:06] enjoy the weekend [16:48:42] also notable for anyone following along on the drmrs stuff: [16:49:01] we've already survived most of the daily high plateau, but the final part of it usually contains a spike at the end. [16:49:25] that final spike to the true peak usually ramps up around ~17:00 -> ~20:00, so that phase is about to begin soon. [16:51:24] https://w.wiki/4$Tr <- shows the pattern [16:52:19] eh that was meant to be a 7d view, but once you unlock absolute on the link copy, it also removes the timeframe setting apparently :P [16:53:07] it's about a +10% rampup over that final little up-spike of the day