[07:44:00] brett: Predictable Network Interfaces naming aren't as predictable as you would imagine [07:44:14] so in the update from stretch to buster those names changed [07:44:22] and probably the DNS records weren't upgraded [07:45:00] the source of truth regarding IP<->interface mapping for LVS can be found in the puppet repo, hieradata/common/lvs/interfaces.yaml [07:58:01] hello folks [07:58:11] there are several ms-be hosts with their puppet cert expired [08:03:26] tested the renew on ms-be1028, all good even in the puppet logs [08:26:25] Emperor: o/ [08:27:00] I have renewed the ms-be's eqiad certs via the sre.puppet.renew-cert, everything looks fine afaict [08:27:12] if you have time I'll leave the codfw ones to you, otherwise I can keep going [08:28:49] elukey: that's not meant to happen is it? [08:30:24] (and I assume the affected hosts are the ones alerting on old puppet runs)? [08:30:29] <_joe_> Emperor: if a server outlives the puppet cert, which lasts 5 years, it will happen eventually [08:30:39] if they don't get reimaged within 5y yes [08:30:45] Emperor: we have an alarm when client puppet certs expire ("Puppet CA expired certs" in icinga for example), but it is an aggregate alarm and it is easy to be missed [08:30:54] Ah, OK. Nice we have a cookbook to fix it, I'll have a look [08:31:08] super thanks :) [08:38:48] strictly speaking we aleady have the reimage cookbook to fix this :-) [08:39:15] these hosts are scheduled for retirement pretty soon (I just need some CFT) [08:39:20] [Copious Free Time] [08:39:31] I know, just trolling :-) [08:40:25] ;p [08:43:21] moritzm: can I have a review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/811270 ? [08:43:53] (it's the revert we talked about yesterday) [08:44:06] done :-) [08:44:12] thanks! 👍 [09:03:20] slyngs, moritzm: another quick look (you already approved the patch that depends on this): https://gerrit.wikimedia.org/r/c/operations/puppet/+/802143 [09:05:18] looking [09:06:16] done :-) [09:07:18] :stroopwafel: for you, thanks! [09:08:17] yummy! since this patch removes a component, please remember to run "reprepro clearvanished" on apt1001, see https://wikitech.wikimedia.org/wiki/Reprepro#Removing_a_component [09:08:31] Thanks! I was looking for that page :) [09:11:32] hmm... could we add a note somewhere to remind of that? (as in, add a comment on the gerrit patch, when doing the puppet-merge, or even just a comment on the distributions-wikimedia file) [09:11:44] do you think that would make sense? (I can give it a try) [09:15:27] maybe a comment to distributions-wikimedia at the top? even if someone forgets to run clearvanished it's not a big issue, only the next time someone imports a package they will need to run the command instead [09:15:49] sounds good to me (and will help me in the future xd) [09:18:12] this should do the trick: https://gerrit.wikimedia.org/r/c/operations/puppet/+/811671 [09:26:50] OK, all those old ms-be* nodes are content again [12:28:26] btullis: there are pending dns changes in netbox not propagated to the dns, seems related to the makevm of +se-k8s-etcd1003. Did anything went wrong? I see the SAL for the cookbook START but not its end [12:28:40] (see also the related icinga alert) [12:35:57] hi, hopefully quick code review question if someone has a few minutes https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/808208/9#message-c19c8b3949b96b1a0e795a43744bf39c06c7f7e1 [12:37:24] hmm, i might be able to answer my own question, I'll push some patches. [12:39:43] meh, no. I'm lost. [12:40:20] kostajh: yes, that should use the proxy, and https://gerrit.wikimedia.org/r/c/operations/puppet/+/766777 did set one up [12:43:30] thanks [12:56:26] <_joe_> therte's more issues with that patch [12:56:42] <_joe_> let me comment [13:04:23] ty [13:05:44] <_joe_> kostajh: should I create the proxy? [13:06:07] _joe_: is that something different from what I made in https://gerrit.wikimedia.org/r/c/operations/puppet/+/811701 ? [13:06:27] <_joe_> ah sorry I didn't see it:) [13:06:33] if so, yes please do. If it's the same as 811701, then yes I'd love to deploy that. I'm trying to do the switchover to the new API in the backport window happening now, but ofc it's fine to leave it for later, rather than rush it. [13:06:54] volans: Thanks. Looking now. [13:07:48] There's a createvm cookbook still running for dse-k8s-etcd1003.eqiad.wmnet but I've had no errors from any of the five cookbook runs yet. Will update you. [13:08:11] could it be that you just let it sit for a while when it was asking for confirmation? [13:08:16] of the dns changes [13:08:36] Ah yes, that could well be it. Too many terminals open. [13:08:38] <_joe_> kostajh: I merged the change, now we need to wait at least 30 minutes for it to propagate though [13:09:10] <_joe_> so not sure if that fits your release window [13:09:38] _joe_: thanks! how do I know when it's completed propagating? [13:09:52] <_joe_> kostajh: you ask me :P [13:10:11] heh, ok. [13:10:22] <_joe_> also let me verify it works [13:10:59] btullis: if possible try to avoid it, as it would have few side effects: 1) the icinga alert after a while 2) if anyone runs the dns cookbook will get your diff too and will either get stuck or start looking for confirmations 3) if anyone merges the changes your run will fail (because master would have changed at that point) [13:11:27] <_joe_> kostajh: are we sure the service works? :P [13:12:47] _joe_: yes, unless you're seeing otherwise? try e.g. `curl https://image-suggestion.discovery.wmnet:30443/public/image_suggestions/suggestions/enwiki/2383439` from a mwmaint host [13:13:02] <_joe_> yeah the problem is envoy doesn't seem to be able to connect to it [13:16:12] so is that something we'd fix in operations/deployment-charts repo? if so hnowlan would be the person to ask about it, I think. [13:16:29] can that be because of the 30443 port? not sure how does envoy determine the port to use. [13:17:03] <_joe_> kostajh: no the problem is ours [13:17:26] <_joe_> I think the way we've integrated the new PKI for ingress is the culprit [13:18:33] <_joe_> kostajh: sorry, it won't be ready by the end of this window [13:18:41] <_joe_> I can merge your change off window later [13:19:24] _joe_: no problem. A Growth team engineer needs to be around for the backport, so I'll just look for a later window. Thanks for your help! [13:19:48] <_joe_> kostajh: yeah but before doing anything wait for a green light from us [13:19:54] ack [13:21:57] _joe_: do you want me to make a task to track this? Or could you make one with details/logs etc? Then we could track progress there [13:30:27] <_joe_> kostajh: jayme and I are actively looking [13:36:26] I'd need to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/810007, is the above issue a blocker or ok to proceed? [13:36:39] (also I haven't done it in a while, some help would be good :) [13:36:51] (I know that Joe and Janis are already busy) [13:36:55] <_joe_> elukey: not a blocker but I can't help [13:36:59] yep yep [13:37:18] if there is a kind soul that is more expert than me please ping, otherwise I'll do it in another time [13:37:22] <_joe_> btu it's unrelated [13:37:33] <_joe_> elukey: just add it to a deployment window? [13:37:39] <_joe_> and ask a deployer to help [13:40:56] _joe_ yeah it is already in progress, not sure if people are checking it anymore, this is why I was asking in here [13:41:12] anyway I'll try to sneak it in [13:41:29] <_joe_> yep that was my suggestion [13:45:02] _joe_: filed as T312225 [13:45:02] T312225: Envoy cannot connect to image-suggestion service - https://phabricator.wikimedia.org/T312225 [13:59:35] volans: Noted, many thanks. All cookbook runs completed successfully now. The alert has gone away too, right? [14:00:25] correct [20:30:14] maryum: do you have a minute [20:38:13] OT: The Phabricator 'meme' button is hilarious! [20:51:54] <_joe_> denisse|m: there's some gold there indeed [20:52:10] <_joe_> I don't know why we don't use them more [21:51:26] anyone know if it's possible to have arbitrary unit display in grafana? In particular I'm trying to graph the rate of something that runs a couple times per hour and would like the units to be "ops/hr" or similar, but the only available option seems to be "ops/s". For now there is simply a note that says the annotation is wrong and its really ops/hr [22:03:09] ebernhardson: yeah, it depends on whether you're using a Graph (the old thing) or a Time Series (the new thing) but either way you can do it [22:03:16] can you link me to your graph? [22:04:29] rzl: https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&viewPanel=37 [22:05:26] plausible the query could also be increase(...) instead of rate(...)*3600 [22:06:31] i suppose it's been this way a long time, i was just using it and thought there must be a way to not have the wrong units on the left there :) [22:07:10] haha I don't know enough promql to nitpick your query :) but the unit is just a display setting, I'll show you where it lives [22:07:23] it looks like Grafana only has ops/sec and ops/min natively, but there's also a free-text option which should just be all you need [22:08:05] ahh, free-text would work just fine [22:08:46] sign in first to switch to grafana-rw, then Edit: https://usercontent.irccloud-cdn.com/file/SAebg0JZ/image.png [22:09:02] ok, done [22:09:33] then Time Series on the right, and scroll down to Unit under Standard Options: https://usercontent.irccloud-cdn.com/file/r4QtShMc/image.png [22:09:50] whoops screenshotted from the wrong tab [22:10:01] yours will say "Graph (old)" on the right [22:10:21] and then the unit under Axes: https://usercontent.irccloud-cdn.com/file/fj73Aw1H/image.png [22:11:29] ok [22:11:31] "ops/sec" and "ops/min" are under Throughput, but you can type whatever you want and then choose Custom Unit when it doesn't autocomplete to anything [22:11:59] then Save at the top right and you should be all set [22:12:00] oh wow, i never thought to just type there. In retrospect it seems obvious :) thanks! [22:12:28] sure thing! [23:47:57] rhinosF1: I do have a minute