[01:10:56] * bd808 off
[08:38:02] <taavi>	 I'm rebooting all codfw1dev cloudvirts (again) to test my new cookbook to restart them all
[08:57:47] <dcaro>	 morning
[08:57:52] <dcaro>	 ack
[09:00:06] <arturo>	 ok!
[09:00:09] <arturo>	 good morning 
[09:11:04] <dcaro>	 I'm seeing some weird errors on tools-k8s-worker-nfs-25 (unreachable through ssh, using console)
[09:11:07] <dcaro>	 Feb 14 09:10:16 tools-k8s-worker-nfs-25 containerd[3178]: time="2024-02-14T09:10:16.834661515Z" level=error msg="StopPodSandbox for \"c3750d94c13e9be5a6ffab687fc0ef380261df8f97a22031360c38ce51bccf43\" failed" error="failed to destroy network for sandbox \"c3750d94c13e9be5a6ffab687fc0ef380261df8f97a22031360c38ce51bccf43\": plugin type=\"loopback\" failed (delete): failed to find plugin \"loopback\" in path [/usr/lib/cni]"
[09:12:20] <dcaro>	 it's also failing puppet runs (complaining about self-signed cert)
[09:12:24] <taavi>	 odd, I'll have a look
[09:12:40] <taavi>	 (I provisioned that node this morning)
[09:12:41] <dcaro>	 is it one you just created?
[09:12:45] <dcaro>	 ah, okok
[09:13:03] <dcaro>	 I'll let you play with it then :)
[09:15:25] <taavi>	 I ran the refresh puppet certs cookbook on it and now it's fine
[09:15:27] <taavi>	 how did you notice that?
[09:24:03] <dcaro>	 there was an alert about puppet failing to run, then tried to ssh and did not work, so I used the console, and checked journalctl (and browsed a bit finding also the puppet cert error)
[09:25:08] <dcaro>	 taavi: I suspect that removing toolforge workers leaks the DNS records, is that a known issue?
[09:28:18] <dcaro>	 https://www.irccloud.com/pastebin/osL7dVVE/
[09:29:14] <taavi>	 dcaro: yes. the tl;dr is that these instances are too old and so the automated cleanup doesn't work properly. so Andrew and I are cleaning things up manually
[09:29:38] <dcaro>	 okok, should I leave it to you or can I just run the cleanup script
[09:30:10] <taavi>	 go ahead if you already have it handy, I don't think I'll be deleting more today
[09:30:29] <dcaro>	 awesome, I'll do
[09:31:05] <taavi>	 thanks!
[09:44:02] <taavi>	 aand the cloudvirt reboot in codfw1dev finished. can I get a review on https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1003047 before I use it in eqiad1?
[09:55:03] <dcaro>	 taavi: if I understand, the parallelization there is done by cumin, when using the `str(hosts)` as `D{{{fqdn}}` for the cumin query right?
[09:58:44] <dcaro>	 hmm, how does it work with the drain cookbook? (I'll point it in the patch)
[10:01:27] <taavi>	 in run_batch_operation() it splits the host set to run on one host at the time
[10:02:23] <dcaro>	 so the moment we enable multiple hosts per batch, that cookbook will fail no?
[10:04:25] <taavi>	 right now the implementation of SafeRebootRunner would mean that changing the batching logic in run_batch_operation() would be an issue, yes
[10:06:10] <dcaro>	 hmm, maybe we can verify that there's only one at the beginning of the `run_on_hosts` implementation to make it explicit
[10:07:11] <taavi>	 sure, one moment
[10:07:59] <dcaro>	 otherwise it looks ok, I still don't se easily how to parallelize the batch (if that's the goal), we could try to do so in the  `run_batch_operation` on the metaclass, otherwise the cookbook will have to handle it itself (that might be the right thing though, as it know how to parallelize stuff)
[10:08:29] <dcaro>	 as in it knows the details on how to parallelize the actions of the specific cookbook (ex. first drain all in parallel, then reboot all in parallel)
[10:09:39] <dcaro>	 taavi: LGTM
[10:11:04] <taavi>	 thanks! I think something like what you said (drain x nodes, reboot them, undrain, then repeat for the next x nodes) is worth at least exploring
[10:11:18] <dcaro>	 yep, would save a lot of time
[10:50:20] <dcaro>	 anyone doing anything on cloudvirt1031?
[10:51:41] <taavi>	 I am running the 'reboot all cloudvirts' cookbook which sems to have done 1031 then chashed when draining 1032
[10:51:44] <dcaro>	 ah yes, taavi  let me know if you want any help, it seems it failed to reboot or similar and the alerts (for neutron) are showing up
[10:52:00] * dcaro should look first on cloud-feed
[10:53:52] <taavi>	 I think it's just that silencing things from cloudcumin is broken and an alert will fire faster than a reboot takes
[10:54:09] <dcaro>	 okok, I'll keep that in mind then
[10:59:58] <taavi>	 sorry about that, we really need to get this fixed :/
[11:18:59] <taavi>	 I rebased https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/991016 to silence the alerts in alertmanager
[11:24:27] <taavi>	 hmm, this time the drain script crashed on an instance on 1032 that's in error state
[11:27:45] <dcaro>	 icinga is not the source of pages for anything anymore right? and things are moving out of it already, I think that silencing alertmanager only is better than not silencing anything, and it's even future proof xd
[11:27:51] <dcaro>	 I'll comment there too
[11:28:21] <taavi>	 I don't think we have anything paging for cloudvirts in icinga anymore
[11:29:16] <arturo>	 topranks: any specific agenda for the network meeting later today? anything to prepare on our side?
[11:29:18] <dcaro>	 I think so too
[11:35:08] <taavi>	 and other patches in the same stack could use some reviews too :-)
[11:43:28] <arturo>	 I just reviewed them
[11:44:44] <dcaro>	 thanks
[12:06:49] <topranks>	 arturo: nothing specific I want to raise in the meeting later 
[12:07:21] <topranks>	 Probably good for us to get T316544 (cloudsw upgrades) over the line - I think we said last time we can prep on it I didn't follow up myself 
[12:07:22] <stashbot>	 T316544: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544
[12:57:58] <arturo>	 ack
[13:36:57] <komla>	 I have disabled the first batch of 40 tools. 12 of them have already been archived
[13:40:31] <taavi>	 I am merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/998401 to make the floating IP reverse DNS updater use a RFC2317 style zone name
[13:51:19] <taavi>	 that seems to have made the designate api unwell. looking
[13:56:38] <taavi>	 seems like the script was just mass-creating records a bit too fast and designate was not keeping up. all fine now
[14:13:38] <Rook>	 According to https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster#Renewing_puppetmaster_CA_certificate certs should expire after five years, though the paws system is yet to be 2 years old, is expiration in this situation expected?
[14:17:38] <taavi>	 Rook: likely the puppetmaster was replaced at some point for OS upgrades but the CA certificate was simply transferred from the previous one instead of entirely replacing it
[14:17:50] <dcaro>	 yep, that's the case
[14:18:18] <Rook>	 Fair enough
[14:18:48] <taavi>	 is there anything in paws that requires a local puppetmaster anymore? you can likely just get rid of that instead of renewing the cert now that the prometheus VMs are gone
[14:19:06] <dcaro>	 I think the last VMs were the prometheus ones maybe?
[14:19:18] <Rook>	 The only thing that I'm aware of is the magic that manages the nfs exports file
[14:19:38] <dcaro>	 oh, true, the nfs server
[14:19:49] <taavi>	 I don't think the NFS server needs a local puppetmaster?
[14:20:02] <dcaro>	 probably not
[14:20:47] <dcaro>	 there's a bastion also
[14:20:56] <dcaro>	 (just checking the VMs in horizon)
[14:22:20] <dcaro>	 it's possible that it's not needed at all, and we can use the common puppetmaster yes, that'd be nice
[14:23:01] <Rook>	 The bastion doesn't use puppet in any paws specific way
[14:23:21] <Rook>	 So the nfs exports magic still happens if I drop the local puppet master?
[14:23:59] <dcaro>	 you'll have to move the nfs VM to use the central puppetmaster, but I think it should
[14:24:22] <dcaro>	 do you have any secrets on the private puppetmaster repo?
[14:24:31] <Rook>	 None that I know of
[14:24:31] <dcaro>	 (I'm guessing not)
[14:24:58] <dcaro>	 then it should be possible :)
[14:30:06] <Rook>	 That should only be taking `puppetmaster: paws-puppetmaster-2.paws.eqiad1.wikimedia.cloud` out of the project puppet, yes?
[14:30:48] <dcaro>	 not really, you have to rebuild the host certificates and such
[14:31:13] <taavi>	 do that and run the refresh certs cookbook I think?
[14:31:21] <dcaro>	 we have a cookbook to move ... yep
[14:31:48] <Rook>	 How is that run?
[14:32:10] <dcaro>	 not sure if the cookbook has any expectation of what the puppetmaster should be though (as in, I have only used it to move hosts to a project puppetmaster from the common one, not the other way around)
[14:32:27] <Rook>	 Also any ideas why paws-nfs-1.paws.eqiad1.wikimedia.cloud is not letting me in as me?
[14:32:56] <dcaro>	 quick look looks ok
[14:33:02] <dcaro>	 (the cookbook I mean)
[14:33:53] <dcaro>	 I can ssh as root, but not as dcaro
[14:34:40] <dcaro>	 oh, there might be project-specific secrets for the replica_cnf_api
[14:37:30] <dcaro>	 yep, there's some secrets, so still needed
[14:38:01] <Rook>	 What's the replica_cnf_api?
[15:00:20] <dcaro>	 sorry, got distracted, it's the API that allows us to generate database credentials for the users
[15:00:39] <dcaro>	 and puts them in the nfs directories (that's why it's running on the NFS server)
[15:16:34] <Rook>	 Oh that. Very good. Well I believe the puppet master is updated l, so all can remain as is for now 
[15:17:27] <dcaro>	 Rook: did you follow that process? did it work ok? (I have to do it in a couple other puppetmasters)
[15:18:14] * dhinus paged: cludvirt1041/ensure kvm processes are running
[15:18:20] <Rook>	 Yes, though I need to make one edit to it.
[15:18:38] <taavi>	 dhinus: me too, looking since I just rebooted it
[15:18:48] <dcaro>	 awesome :), please do, I appreciate it
[15:19:17] <dcaro>	 taavi: dhinus I got paged by that, but it went away right after, I though taavi was rebooting it
[15:19:37] <taavi>	 well I did, but it's not supposed to page
[15:19:56] <Rook>	 dcaro: done, it now reflects what I did
[15:19:59] <dcaro>	 I thought it was just a slow reboot maybe (like the other alerts)
[15:20:05] <dcaro>	 Rook: thanks a lot!
[15:20:28] <taavi>	 do we not have a canary on cloudvirt1041?
[15:21:58] <taavi>	 now we do
[15:22:35] <dcaro>	 did you have to create it?
[15:23:00] <taavi>	 I ran the cookbook to create that
[15:23:05] <dhinus>	 btw I removed Nicholas from VictorOps, and I'm moving the shift change from Wed to Thu
[15:24:09] <dhinus>	 and I'm adding back arturo :P
[15:33:44] <arturo>	 👍
[15:46:14] <arturo>	 Uh oh — The VictorOps user you have linked to your external SSO ID is not part of wikimedia. Please contact your administrator. 
[15:46:29] <arturo>	 ^^^ I get that when trying to log in into victorops
[15:46:42] <dhinus>	 hmm I sent you a new invite, but maybe there's some trace of your old user in the system
[15:46:50] <dhinus>	 did you get an email from victorops?
[15:46:59] <arturo>	 yes I did
[15:47:09] <dhinus>	 ot
[15:47:10] <arturo>	 it feels like my user is disabled or something
[15:47:27] <dhinus>	 *it's "pending", but I guess you're unable to link it for some reason
[15:48:11] <dhinus>	 you might have to email techsupport@wikimedia.org
[15:50:51] <arturo>	 also, the invitation takes to a form to create a new account, which I think is wrong. I need to use the SSO
[15:51:11] <taavi>	 I think o11y manages victorops accounts instead of ITS?
[15:52:35] <komla>	 all selected tools have been stopped and moved to Backlog(Disabled) column on the phab board.
[15:53:01] <andrewbogott>	 \o/
[15:53:06] <andrewbogott>	 Now we just have to wait for the complaints :)
[15:54:08] <taavi>	 hm, https://grafana.wmcloud.org/d/zyM2etJ4k/toolforge-grid-deprecation?orgId=1 is showing many more tools still running than I'd expect
[15:56:44] <taavi>	 ah zooming in helps a bit. is "backlog" supposed to be the column for tools that have not been disabled?
[15:57:08] <andrewbogott>	 komla: ^ ?
[15:58:38] <komla>	 I stopped a little over 100 tools
[15:58:46] <taavi>	 what about the other columns like "help wanted" and "pywikibot"? are those supposed to be disabled or not?
[15:59:22] <komla>	 they are part of tools that are still running.
[15:59:49] <komla>	 does the grafana dashboard update instantly?
[16:00:43] <komla>	 taavi: yes backlog has tools that have not been disabled 
[16:01:53] <taavi>	 ok, I see
[16:02:02] <taavi>	 I also see some tools (like wikishizhao) running that have no task at all
[16:03:45] <andrewbogott>	 maybe those were born after the board was compiled?
[16:04:16] <taavi>	 yeah, probably
[16:04:51] <komla>	 let me check. there were incidents where a tool was created after the fact. let me check and then create the ticket
[16:04:55] <komla>	 yeah
[16:07:26] <dhinus>	 taavi: yeah I think you're right about victorops accounts -- arturo try pinging in #-observability maybe?
[16:07:33] <dhinus>	 there's also some info here https://wikitech.wikimedia.org/wiki/Splunk_On-Call
[16:07:45] <dhinus>	 that seems to suggest you have to create a non-SSO user first, then link it to SSO
[16:08:02] <arturo>	 I'll do tomorrow
[16:08:05] <arturo>	 thanks!
[16:08:08] <dhinus>	 no rush :)
[16:12:25] <komla>	 I see wikishizhao running on the grid-deprecation portal but there's no entry on toolsadmin when I search. It also has no members/maintainers
[16:13:11] <taavi>	 hmmm I wonder if that's someone running something on the grid as their user
[16:15:49] <taavi>	 himowd is a tool that is running things on the grid without a ticket
[16:19:48] <taavi>	 and gergesbot
[16:21:18] <komla>	 okay. checking
[16:22:06] <bd808>	 it is completely possible to run workloads on grid engine as a user rather than a tool
[16:22:48] <komla>	 I will pull from the grid-deprecation portal again and go through the list
[16:23:31] <taavi>	 the ones I listed were running something at that exact moment, https://grid-deprecation.toolforge.org/ seems to have a bunch more that only have cron jobs that were not running at that moment
[16:24:50] <komla>	 yeah. there are currently 116 on the grid portal. many of them showing disabled
[17:02:23] <dhinus>	 andrewbogott: I was thinking of rebooting cloudvirtlocals, are you around in case anything goes wrong?
[17:03:24] <dhinus>	 (context is T356975)
[17:10:32] <andrewbogott>	 dhinus: I'm here, go ahead.
[17:10:54] <dhinus>	 thanks
[17:12:01] <dhinus>	 rebooting cloudvirtlocal1001
[17:16:46] <dhinus>	 hmm the cookbook tried and failed to migrate instances
[17:17:07] <dhinus>	 "No valid host was found. There are not enough hosts available."
[17:17:22] <dhinus>	 what is the right way to do this?
[17:18:15] <Amir1>	 I'm trying to create a MX record set in horizon for wikimedia.beta.wmflabs.org. to 185.15.56.115 but it refuses to create it without any errorr
[17:18:21] <Amir1>	 Fehler: Unable to create the record set.
[17:18:40] <Amir1>	 I can't find any existing MX record for this
[17:18:49] <Amir1>	 I could make A record for the domain no problem
[17:21:02] <taavi>	 Amir1: MX records point to fqdns not IPs
[17:21:23] <Amir1>	 I tried fqdns as well 
[17:21:29] <Amir1>	 but maybe not the correct one
[17:21:34] <Amir1>	 gonna try that too
[17:21:39] <taavi>	 did you try them with a trailing dot?
[17:21:58] <Amir1>	 both
[17:22:03] <Amir1>	 I'm desperate 
[17:22:49] <taavi>	 and you have the priority included in the record too?
[17:23:38] <dhinus>	 andrewbogott: from my IRC logs, the last time we touched cloudvirtlocals we used wmcs-cold-migrate, does that sound right to you?
[17:23:51] <Amir1>	 that might be why, let me take a look
[17:24:13] <taavi>	 dhinus: if you're just rebooting them then not migrating anything might be easier
[17:24:51] <andrewbogott>	 dhinus: yes, just reboot. The cookbook probably doesn't know what to do with a local
[17:24:58] <andrewbogott>	 and we don't want to migrate, just let the VMs reboot
[17:25:56] <Amir1>	 sigh, I thought that's a separate field
[17:25:58] <Amir1>	 jeez why
[17:26:22] <Amir1>	 amir@amir:~$ dig wikimedia.beta.wmflabs.org mx +short
[17:26:22] <Amir1>	 10 instance-deployment-mx03.deployment-prep.wmflabs.org.
[17:26:24] <andrewbogott>	 taavi, what am I doing wrong with 'cluster health'?
[17:26:26] <Amir1>	 thanks
[17:26:29] <andrewbogott>	 https://www.irccloud.com/pastebin/oL5ZLmjk/
[17:27:11] <taavi>	 andrewbogott: etc will not talk to you without a client cert
[17:27:15] <andrewbogott>	 (I can also keep googling if you don't immediate see my error)
[17:27:22] <andrewbogott>	 ok, I'll add that
[17:27:34] * andrewbogott doesn't understan why it doesn't just take the default cert from /etc/etcd
[17:28:26] <dhinus>	 andrewbogott taavi are you suggesting to use sre.hosts.reboot-single instead of wmcs.openstack.cloudvirt.safe_reboot?
[17:29:24] <andrewbogott>	 dhinus: yes
[17:29:35] <komla>	 I have identified a couple more tools without tickets. I'm creating the tickets for them
[17:29:47] <dhinus>	 andrewbogott: ok, trying that!
[17:29:58] <Amir1>	 mx record didn't fix the mail reject but meh, tomorrow-me problem.
[17:38:18] <dhinus>	 cloudvirtlocal1001 rebooted, and I see all VMs running
[17:39:32] <dhinus>	 is there a quick command to check if etcd is happy?
[17:41:51] <andrewbogott>	 yes, I'll do it since I only just figured it out
[17:42:05] * dhinus popping out for 5 mins, brb
[17:42:24] <andrewbogott>	 looks good
[17:42:25] <andrewbogott>	 dhinus: 
[17:42:27] <andrewbogott>	 https://www.irccloud.com/pastebin/dH3sIeyt/
[17:53:20] <dhinus>	 cool! I will proceed with cloudvirtlocal1002
[18:08:05] <dhinus>	 cloudvirtlocal1002 rebooted and etcd is happy, proceeding with cloudvirtlocal1003
[18:16:18] <dhinus>	 hmm alertmanager is still unhappy about neutron on cloudvirtlocal1002, a few mins after the reboot
[18:18:12] <dhinus>	 but I think it's just the alert is based on min_over_time over 20m
[18:22:43] <dcaro>	 yep, they have a bit of delay
[18:22:45] * dcaro off
[18:31:19] <dhinus>	 all 3 cloudvirtlocals have been rebooted
[18:31:32] <dhinus>	 I added some notes here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance#cloudvirtlocal_reboot
[18:38:35] * dhinus off
[18:50:24] <andrewbogott>	 thanks dhinus 
[19:13:41] * bd808 lunch
[19:23:13] <andrewbogott>	 taavi, do you still have cloudvirt reboots running in the background or should I take over for a round?
[19:25:32] <taavi>	 andrewbogott: feel free to. I'm tracking progress in https://phabricator.wikimedia.org/T356975, and command I've been using `test-cookbook -c 991016 wmcs.openstack.cloudvirt.safe_reboot --fqdn cloudvirtXXXX.eqiad.wmnet`
[19:25:46] <andrewbogott>	 great, I'll do a few
[19:26:20] <taavi>	 turns out draining a node is a bit too unstable to just start the script in the background and come back hours later. I had like 3 failures in a row draining 1032 which made me switch to per-node cookbook executions and after that it's worked flawlessly
[19:27:04] <andrewbogott>	 is your test-cookbook -c 991016  different from the git head in the wmcs-cookbook repo?
[19:27:40] <taavi>	 it's HEAD with https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/991016 applied
[19:27:54] <andrewbogott>	 cool
[21:33:06] <taavi>	 hmm. seems like I screwed up when removing some toolsbeta nodes when I thought the cookbook would take care of the hiera updates when it clearly does not
[21:33:51] <taavi>	 fixed. and apologies for anyone who got paged for that, toolsbeta being able to page in the first place seems like a bug, I'll look into that tomorrow