[00:00:29] RhinosF1: I get, "This workboard has been disabled, but can be restored to its former glory." [00:14:45] Hello team, the Observability team is deploying a new netmon instance using Debian Bullseye, progress is being tracked on Phabricator's task T309074. [00:14:45] As part of this change a new netmon instance called netmon1003 was deployed in eqiad and a failover from netmon1002 to netmon1003 is planned for Tuesday 9 of August 2022 at 13:00 UTC. [00:14:46] We expect an outage of approximately 30 minutes during the failover. [00:14:46] For more information please write to the Observability team in the #wikimedia-observability channel on LiberaChat. [00:14:46] T309074: Put netmon1003 in service - https://phabricator.wikimedia.org/T309074 [07:13:41] sukhe: no, one the 2nd link I gave there should be pins next to workboard and project details [07:13:45] Workboard is green [07:13:50] Make project details green [07:13:53] Instead [08:54:43] dcaro: Should I merge your puppet change with mine? "novafullstack: remove leaked VMs test, moved to alertmanager" [09:08:17] btullis: oh yes please, I thought I did [09:08:34] dcaro: ack, many thanks. [09:09:37] Done. [09:09:43] 👍 thanks! [09:20:17] _joe_: re vopsbot, have you considered using irc account or cloak for authentication rather than username as if the account doesn't enforce the nick instantly then there is a short period of time where it might be possible to impersonate an SRE [09:21:06] <_joe_> RhinosF1: when we register the user, we'll set nick enforcement on [09:21:35] <_joe_> Also, please, if you have questions of this nature, it would be easier to get them asynchronously via phabricator :) [09:22:51] _joe_: I can leave a comment on phab if preferred but for each SRE whose listed you'd have to check they all have enforce on (and some networks disable it if you don't login in a while), would also mean not needing to add away nicks or alts too [09:25:55] <_joe_> RhinosF1: we all have nick enforcement on, and sorry, I didn't understand your question [09:26:11] <_joe_> but please add it to phabricator :) [09:26:23] <_joe_> I don't have time to have the discussion right now live in sync [09:30:00] I added https://phabricator.wikimedia.org/T314842#8139835 [09:30:28] Mentioning both reasons, the risks of nicknames and away nicks [10:54:14] _joe_: wee, got my first php74 req via wikimediadebug. [10:54:20] according to special:Version [10:54:21] 7.2.34-18+0~20210223.60+debian10~1.gbpb21322+wmf5 (fpm-fcgi) [10:54:32] 7.4.30 (fpm-fcgi) [10:54:52] is it really that "clean" or is this hiding something? [11:04:37] I do note that the switch does not appear to work for Beta. I guess something is either making the ATS code not run, or perhaps php74 isn't exposed/provisioned on those appservers yet? I recall something about the manifest doing only 72 by default and it currently being opt-in through prod-specific Hiera that just happens to basically cover all prod servers. [13:21:01] Hello jynus and XioNoX , godog and I are going to start the netmon1003 failover. [13:21:38] \o/ [13:43:17] denisse|m: I'm around, how is it going? [13:46:28] Last step of the failover: Add the new host as a syslog destination in homer templates/common/system.conf https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124 [13:46:56] XioNoX: Hi Arzhel, so far so good. I just had an issue with a change I made in the DNS repository but I fixed it on time. [13:55:00] Hello XioNoX , do you know if there are any precautions we should take before/after merging homer changes? [13:55:12] More specifically, this is the change I'd like to merge: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124 [13:55:54] denisse|m: make sure the diff returned by Homer is what you would expect [13:56:26] as this is applied to all the devices it will take some time and a lot of answering "yes" to the prompt [13:58:27] Thanks, checking that... [14:12:04] can anyone help me with a confctl depool command for wdqs? The select command is ` confctl select dc=codfw,service=wdqs get` [14:13:27] just wanna make sure I don't depool the entire service =) [14:25:06] <_joe_> inflatador: so it's dns? [14:25:29] <_joe_> ah wait [14:25:45] _joe_ just trying to depool codfw from wdqs. Not sure if confctl can do that though [14:25:50] <_joe_> that command you wrote would depool all wdqs from pybal in codfew [14:26:05] <_joe_> inflatador: yes it can, but you ahve to act on another object type, not the default [14:26:22] <_joe_> inflatador: confctl --object-type discovery select 'dnsdisc=wdqs' get [14:27:02] Hello XioNoX and godog , I ran 'homer "*" diff' in 'cumin1001.eqiad.wmnet' before merging my changes and homer gave 1 error and changes for 2 devices: https://phabricator.wikimedia.org/P32327 [14:27:12] Is it okay to proceed with merging my changes? [14:27:25] ACK, I got that far, do I just use 'depool' instead of 'get' maybe? [14:27:42] I guess not, need to target only DFW [14:27:47] <_joe_> inflatador: confctl --object-type discovery select 'dnsdisc=wdqs,name=codfw' set/pooled=false [14:28:13] denisse|m: run puppet on the cumin host to pick up you change [14:28:40] <_joe_> inflatador: always look at https://wikitech.wikimedia.org/wiki/Conftool#The_tools [14:28:50] Thanks Arzhel, I'm on it... [14:29:28] <_joe_> inflatador: there's also a cookbook but it seems it broke, I have to go check what's wrong there again [14:29:43] <_joe_> (to depool a service from a dc, I mean) [14:30:05] _joe_ got it! and I did check there, will add the depool a service from a DC example to the page [14:30:10] denisse|m: the change for asw-a-codfw is because of me, I'll push it. The one on cr1/2-codfw seems safe too (cc topranks) [14:30:48] denisse|m: and instead of running it with "*" you can do "status:active" that will ignore the device erroring out [14:31:11] <_joe_> inflatador: thanks <3 [14:31:13] Apologies - change to cloud-in filter? Should have realised that was also on codfw CRs [14:31:50] XioNoX: Running it as 'homer "status:active" diff' now. Thank you. [14:33:46] denisse|m: you can run it with commit directly, it will prompt you for the changes [14:33:57] and you will save time :) [14:34:58] XioNoX: ACK, let me try that. [14:37:07] denisse|m: also don't let that change block the migration, that's low priority [14:38:45] XioNoX: Okay, while that change is working I'm doing the post-failover validations you suggested. :) [14:40:06] XioNoX: QQ, one of the points you suggested is ' Ensure no device took too long to poll an alert'. Do you know if there's a particular way to check for that? [14:40:29] To clarify, I'm mostly wondering if there's something I could trigget to check that or if it consists on looking at the graphs. [14:47:07] denisse|m: the alert would show up in https://librenms.wikimedia.org/alerts (and on IRC) [14:47:10] so that's good [14:47:49] XioNoX: Awesome, thank you! I don't see any alerts on Icinga so I think the failover is going good. [14:56:23] I'm going to have breakfast now. I'll be on the look for IRC alerts or something that requires my attention regarding the netmon1003 failover. [14:57:39] thank you denisse|m [15:00:16] awesome! thanks! [15:36:47] I will keep db1117:m1 with its sql thread stopped until tomorrow. Please update ticket or send me an email if you see something weird so I don't restart it during my UTC morning. [15:37:03] Re: librenms [15:38:07] jynus: thank you, AFAICT things are looking good and you can restart replication, if you'd rather do that tomorrow that's fine too I think [15:38:39] yeah, no worries [15:39:01] we can wait, catching up tomorrow will only take a few minutes [15:39:27] ack [15:39:36] I was giving a heads up because I will get offline [15:40:02] so you had a way to communicate with me before I restart it tomorrow [15:40:25] although to be fair, if something very wrong happened, you can call me, don't wait until tomorrow [15:40:57] sorry I look pessimistic (things possibly going wrong) but it kind of goes with my job as the recovery person [15:41:04] 0:-) [15:41:28] I have to be ready for that 0.01% of the times [15:41:38] haha! thank you for that jynus [15:41:52] but I have full trust on your work! [16:05:58] elukey how's your k8s? We are trying to stop/destroy all running flink-session pods to fix the codfw thanos-swift storage craziness (ref https://phabricator.wikimedia.org/T304914 ) ...not sure [16:07:50] inflatador: I might be able to help, if elukey isn't around. [16:08:47] or unless someone else from serviceops wants to step in. [16:11:05] btullis: can helmfile destroy be used to undeploy a service, e.g. "helmfile -e codfw destroy"? [16:13:12] Yes, I believe that method is fine. [16:13:48] You can also set up your `kubectl` ready for use like this. [16:13:52] https://www.irccloud.com/pastebin/sxKMybOD/ [16:16:56] btullis: thanks! it worked [16:17:19] 👍 Great. [16:18:36] btullis we also need to delete all associated configmaps, ` kubectl delete configmap -l app=rdf-streaming-updater-codfw-flink-cluster` gives a permission error...is this the correct cmd? [16:23:13] Hmm. Less confident on this one. It might be that we need to get access to the admin namespace: `sudo -i kube_env admin codfw` [16:23:46] Then delete them with `kubectl delete configmap -n rdf-streaming-updater -l app=rdf-streaming-updater-codfw-flink-cluster` [16:33:09] btullis excellent, it worked [16:33:24] I think we are done for the time being, thanks again for helping on short notice [16:33:47] A pleasure. [17:04:25] <_joe_> it seems strange that helmfile destroy would leave configmaps dangling, uhm [17:05:20] <_joe_> btullis: thanks for being the k8s helldesk in our absence <3 [17:09:53] _joe_: No worries. Just glad I didn't accidentally bork something. I was also wondering about the dangling configmaps. I've seen `job` objects left behind before, but not configmaps. [17:10:38] I see that inflatador has updated this page with the steps carried out: https://wikitech.wikimedia.org/w/index.php?title=Wikidata_Query_Service/Flink_On_Kubernetes&diff=2002778&oldid=1974622&diffmode=visual [17:33:02] jynus: Thank you! :D