[00:00:29] <sukhe>	 RhinosF1: I get, "This workboard has been disabled, but can be restored to its former glory."
[00:14:45] <denisse|m>	 Hello team, the Observability team is deploying a new netmon instance using Debian Bullseye, progress is being tracked on Phabricator's task T309074.
[00:14:45] <denisse|m>	 As part of this change a new netmon instance called netmon1003 was deployed in eqiad and a failover from netmon1002 to netmon1003 is planned for Tuesday 9 of August 2022 at 13:00 UTC.
[00:14:46] <denisse|m>	 We expect an outage of approximately 30 minutes during the failover.
[00:14:46] <denisse|m>	 For more information please write to the Observability team in the #wikimedia-observability channel on LiberaChat. 
[00:14:46] <stashbot>	 T309074: Put netmon1003 in service - https://phabricator.wikimedia.org/T309074
[07:13:41] <RhinosF1>	 sukhe: no, one the 2nd link I gave there should be pins next to workboard and project details
[07:13:45] <RhinosF1>	 Workboard is green
[07:13:50] <RhinosF1>	 Make project details green
[07:13:53] <RhinosF1>	 Instead
[08:54:43] <btullis>	 dcaro: Should I merge your puppet change with mine?  "novafullstack: remove leaked VMs test, moved to alertmanager" 
[09:08:17] <dcaro>	 btullis: oh yes please, I thought I did
[09:08:34] <btullis>	 dcaro: ack, many thanks.
[09:09:37] <btullis>	 Done.
[09:09:43] <dcaro>	 👍 thanks!
[09:20:17] <RhinosF1>	 _joe_: re vopsbot, have you considered using irc account or cloak for authentication rather than username as if the account doesn't enforce the nick instantly then there is a short period of time where it might be possible to impersonate an SRE
[09:21:06] <_joe_>	 RhinosF1: when we register the user, we'll set nick enforcement on
[09:21:35] <_joe_>	 Also, please, if you have questions of this nature, it would be easier to get them asynchronously via phabricator :)
[09:22:51] <RhinosF1>	 _joe_: I can leave a comment on phab if preferred but for each SRE whose listed you'd have to check they all have enforce on (and some networks disable it if you don't login in a while), would also mean not needing to add away nicks or alts too
[09:25:55] <_joe_>	 RhinosF1: we all have nick enforcement on, and sorry, I didn't understand your question
[09:26:11] <_joe_>	 but please add it to phabricator :)
[09:26:23] <_joe_>	 I don't have time to have the discussion right now live in sync
[09:30:00] <RhinosF1>	 I added https://phabricator.wikimedia.org/T314842#8139835
[09:30:28] <RhinosF1>	 Mentioning both reasons, the risks of nicknames and away nicks
[10:54:14] <Krinkle>	 _joe_: wee, got my first php74 req via wikimediadebug.
[10:54:20] <Krinkle>	 according to special:Version
[10:54:21] <Krinkle>	 7.2.34-18+0~20210223.60+debian10~1.gbpb21322+wmf5 (fpm-fcgi)
[10:54:32] <Krinkle>	 7.4.30 (fpm-fcgi)
[10:54:52] <Krinkle>	 is it really that "clean" or is this hiding something?
[11:04:37] <Krinkle>	 I do note that the switch does not appear to work for Beta. I guess something is either making the ATS code not run, or perhaps php74 isn't exposed/provisioned on those appservers yet? I recall something about the manifest doing only 72 by default and it currently being opt-in through prod-specific Hiera that just happens to basically cover all prod servers.
[13:21:01] <denisse|m>	 Hello jynus and XioNoX , godog and I are going to start the netmon1003 failover.
[13:21:38] <godog>	 \o/
[13:43:17] <XioNoX>	 denisse|m: I'm around, how is it going?
[13:46:28] <denisse|m>	 Last step of the failover: Add the new host as a syslog destination in homer templates/common/system.conf https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124
[13:46:56] <denisse|m>	 XioNoX: Hi Arzhel, so far so good. I just had an issue with a change I made in the DNS repository but I fixed it on time.
[13:55:00] <denisse|m>	 Hello XioNoX , do you know if there are any precautions we should take before/after merging homer changes?
[13:55:12] <denisse|m>	 More specifically, this is the change I'd like to merge: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124
[13:55:54] <XioNoX>	 denisse|m: make sure the diff returned by Homer is what you would expect
[13:56:26] <XioNoX>	 as this is applied to all the devices it will take some time and a lot of answering "yes" to the prompt
[13:58:27] <denisse|m>	 Thanks, checking that...
[14:12:04] <inflatador>	 can anyone help me with a confctl depool command for wdqs? The select command is ` confctl select dc=codfw,service=wdqs get`
[14:13:27] <inflatador>	 just wanna make sure I don't depool the entire service =)
[14:25:06] <_joe_>	 inflatador: so it's dns?
[14:25:29] <_joe_>	 ah wait
[14:25:45] <inflatador>	 _joe_ just trying to depool codfw from wdqs. Not sure if confctl can do that though
[14:25:50] <_joe_>	 that command you wrote would depool all wdqs from pybal in codfew
[14:26:05] <_joe_>	 inflatador: yes it can, but you ahve to act on another object type, not the default
[14:26:22] <_joe_>	 inflatador: confctl --object-type discovery select 'dnsdisc=wdqs' get
[14:27:02] <denisse|m>	 Hello XioNoX and godog , I ran 'homer "*" diff' in 'cumin1001.eqiad.wmnet' before merging my changes and homer gave 1 error and changes for 2 devices: https://phabricator.wikimedia.org/P32327
[14:27:12] <denisse|m>	 Is it okay to proceed with merging my changes?
[14:27:25] <inflatador>	 ACK, I got that far, do I just use 'depool' instead of 'get' maybe?
[14:27:42] <inflatador>	 I guess not, need to target only DFW
[14:27:47] <_joe_>	 inflatador: confctl --object-type discovery select 'dnsdisc=wdqs,name=codfw' set/pooled=false
[14:28:13] <XioNoX>	 denisse|m: run puppet on the cumin host to pick up you change
[14:28:40] <_joe_>	 inflatador: always look at https://wikitech.wikimedia.org/wiki/Conftool#The_tools
[14:28:50] <denisse|m>	 Thanks Arzhel, I'm on it...
[14:29:28] <_joe_>	 inflatador: there's also a cookbook but it seems it broke, I have to go check what's wrong there again
[14:29:43] <_joe_>	 (to depool a service from a dc, I mean)
[14:30:05] <inflatador>	 _joe_ got it! and I did check there, will add the depool a service from a DC example to the page
[14:30:10] <XioNoX>	 denisse|m: the change for asw-a-codfw is because of me, I'll push it. The one on cr1/2-codfw seems safe too (cc topranks)
[14:30:48] <XioNoX>	 denisse|m: and instead of running it with "*" you can do "status:active" that will ignore the device erroring out
[14:31:11] <_joe_>	 inflatador: thanks <3
[14:31:13] <topranks>	 Apologies - change to cloud-in filter? Should have realised that was also on codfw CRs
[14:31:50] <denisse|m>	 XioNoX: Running it as 'homer "status:active" diff' now. Thank you.
[14:33:46] <XioNoX>	 denisse|m: you can run it with commit directly, it will prompt you for the changes
[14:33:57] <XioNoX>	 and you will save time :)
[14:34:58] <denisse|m>	 XioNoX: ACK, let me try that.
[14:37:07] <XioNoX>	 denisse|m: also don't let that change block the migration, that's low priority
[14:38:45] <denisse|m>	 XioNoX: Okay, while that change is working I'm doing the post-failover validations you suggested. :)
[14:40:06] <denisse|m>	 XioNoX: QQ, one of the points you suggested is ' Ensure no device took too long to poll an alert'. Do you know if there's a particular way to check for that? 
[14:40:29] <denisse|m>	 To clarify, I'm mostly wondering if there's something I could trigget to check that or if it consists on looking at the graphs.
[14:47:07] <XioNoX>	 denisse|m: the alert would show up in https://librenms.wikimedia.org/alerts (and on IRC)
[14:47:10] <XioNoX>	 so that's good
[14:47:49] <denisse|m>	 XioNoX: Awesome, thank you! I don't see any alerts on Icinga so I think the failover is going good.
[14:56:23] <denisse|m>	 I'm going to have breakfast now. I'll be on the look for IRC alerts or something that requires my attention regarding the netmon1003 failover.
[14:57:39] <godog>	 thank you denisse|m 
[15:00:16] <XioNoX>	 awesome! thanks!
[15:36:47] <jynus>	 I will keep db1117:m1 with its sql thread stopped until tomorrow. Please update ticket or send me an email if you see something weird so I don't restart it during my UTC morning.
[15:37:03] <jynus>	 Re: librenms
[15:38:07] <godog>	 jynus: thank you, AFAICT things are looking good and you can restart replication, if you'd rather do that tomorrow that's fine too I think
[15:38:39] <jynus>	 yeah, no worries
[15:39:01] <jynus>	 we can wait, catching up tomorrow will only take a few minutes
[15:39:27] <godog>	 ack
[15:39:36] <jynus>	 I was giving a heads up because I will get offline
[15:40:02] <jynus>	 so you had a way to communicate with me before I restart it tomorrow
[15:40:25] <jynus>	 although to be fair, if something very wrong happened, you can call me, don't wait until tomorrow
[15:40:57] <jynus>	 sorry I look pessimistic (things possibly going wrong) but it kind of goes with my job as the recovery person
[15:41:04] <jynus>	 0:-)
[15:41:28] <jynus>	 I have to be ready for that 0.01% of the times
[15:41:38] <godog>	 haha! thank you for that jynus 
[15:41:52] <jynus>	 but I have full trust on your work!
[16:05:58] <inflatador>	 elukey how's your k8s? We are trying to stop/destroy all running flink-session pods to fix the codfw thanos-swift storage craziness (ref https://phabricator.wikimedia.org/T304914 ) ...not sure 
[16:07:50] <btullis>	 inflatador: I might be able to help, if elukey isn't around.
[16:08:47] <btullis>	 or unless someone else from serviceops wants to step in.
[16:11:05] <dcausse>	 btullis: can helmfile destroy be used to undeploy a service, e.g. "helmfile -e codfw destroy"?
[16:13:12] <btullis>	 Yes, I believe that method is fine. 
[16:13:48] <btullis>	 You can also set up your `kubectl` ready for use like this.
[16:13:52] <btullis>	 https://www.irccloud.com/pastebin/sxKMybOD/
[16:16:56] <dcausse>	 btullis: thanks! it worked
[16:17:19] <btullis>	 👍 Great.
[16:18:36] <inflatador>	 btullis we also need to delete all associated configmaps, ` kubectl delete configmap  -l app=rdf-streaming-updater-codfw-flink-cluster` gives a permission error...is this the correct cmd?
[16:23:13] <btullis>	 Hmm. Less confident on this one. It might be that we need to get access to the admin namespace: `sudo -i kube_env admin codfw`
[16:23:46] <btullis>	 Then delete them with `kubectl delete configmap -n rdf-streaming-updater -l app=rdf-streaming-updater-codfw-flink-cluster`
[16:33:09] <inflatador>	 btullis excellent, it worked
[16:33:24] <inflatador>	 I think we are done for the time being, thanks again for helping on short notice
[16:33:47] <btullis>	 A pleasure.
[17:04:25] <_joe_>	 it seems strange that helmfile destroy would leave configmaps dangling, uhm
[17:05:20] <_joe_>	 btullis: thanks for being the k8s helldesk in our absence <3
[17:09:53] <btullis>	 _joe_: No worries. Just glad I didn't accidentally bork something. I was also wondering about the dangling configmaps. I've seen `job` objects left behind before, but not configmaps.
[17:10:38] <btullis>	 I see that inflatador has updated this page with the steps carried out: https://wikitech.wikimedia.org/w/index.php?title=Wikidata_Query_Service/Flink_On_Kubernetes&diff=2002778&oldid=1974622&diffmode=visual
[17:33:02] <denisse|m>	 jynus: Thank you! :D