[07:05:44] <dcaro>	 morning
[07:06:07] <taavi>	 morning
[07:06:20] <dcaro>	 oh, things seem unstable, alertmanager (metricsinfra) seems down
[07:06:51] <taavi>	 yeah, and -cloud is reporting lots of things down
[07:07:25] <dcaro>	 there was nothing going on this morning right?
[07:08:25] <dcaro>	 ceph looks ok
[07:08:30] <taavi>	 I'm not aware of anything. I also don't see any alerts
[07:09:26] <dcaro>	 yep, metricsinfra alertmanager is down though, so no alerts from vps/toolforge are expected (except the ones on prod directly, like toolschecker)
[07:09:46] <dcaro>	 I say it's down because I see this message in the prod one
[07:09:50] <dcaro>	 https://www.irccloud.com/pastebin/aA1NNmt1/
[07:09:53] <dcaro>	 hmm, no route
[07:09:59] <dcaro>	 bgp stuff?
[07:10:05] <dcaro>	 I can ssh to the machine
[07:10:27] <dcaro>	 no services are down there, looking into project-proxy
[07:11:25] <dcaro>	 I can't curl that ip yes, that's instance-proxy-03.project-proxy.wmflabs.org.
[07:11:39] <dcaro>	 hmm, ssh to the machine is slow
[07:12:49] <dcaro>	 trying console... will reboot if it does not work
[07:12:56] <taavi>	 I'm in the console
[07:13:03] <taavi>	 seems like it does not have an IP address
[07:13:04] <taavi>	 is DHCP ok?
[07:13:15] <dcaro>	 that would be cloudnets right?
[07:13:40] <taavi>	 I think yes
[07:13:54] <dcaro>	 looking
[07:14:03] <dcaro>	 I think we have alerts for some of that (agents down)
[07:14:55] <dcaro>	 looks ok
[07:15:09] <dcaro>	 https://www.irccloud.com/pastebin/YyOqzGBF/
[07:15:33] <dcaro>	 let's reboot the machine, see if it fails to get the ip
[07:15:46] <dcaro>	 that's a floating ip right?
[07:15:55] <dcaro>	 (as in, a public ip for the VM)
[07:16:15] <dcaro>	 taavi: you reboot it?
[07:16:18] <taavi>	 yeah, but I also can't reach it via the private IP.
[07:16:23] <taavi>	 `ifup ens3` says "No DHCP client software found!"
[07:16:27] <dcaro>	 oh
[07:16:56] <dcaro>	 is dhclient available/installed?
[07:17:11] <taavi>	 no
[07:17:13] <taavi>	 wtf
[07:17:17] <dcaro>	 hmm..... 
[07:17:24] <dcaro>	 is puppet running ok?
[07:17:45] <dcaro>	 we can debug later though
[07:18:06] <taavi>	 moritzm: good morning! it seems like https://gerrit.wikimedia.org/r/c/operations/puppet/+/961005 has started uninstalling isc-dhcp-client from many Cloud VPS instances
[07:19:03] <dcaro>	 good catch, let's stop puppet, install manually, restore service and then revert
[07:19:39] <taavi>	 i think "install manually" is the hard part here, as the instance doesn't have network connectivity at the moment
[07:19:55] <taavi>	 found that from /var/log/apt/history.log, btw
[07:20:04] <dcaro>	 you can manually assign the ip no?
[07:20:17] <taavi>	 let me try
[07:22:18] <taavi>	 yeah, that seems to work
[07:22:36] <dcaro>	 nice
[07:22:46] <taavi>	 let's revert the patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/961915/
[07:22:53] <dcaro>	 I can see the alert now, yes
[07:23:59] <dcaro>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/961916
[07:24:22] <taavi>	 +1
[07:24:51] <taavi>	 so if I had to guess, all bullseye instances are affected
[07:24:53] <dcaro>	 do you know which of the packages is the one that triggers the uninstall of the dhcp client
[07:24:54] <dcaro>	 `/
[07:25:12] <dcaro>	 yep, I think a lot did not yet get affected because the puppet enc failed first xd
[07:25:16] <taavi>	 libisc-export1105 and libdns-export1110
[07:25:18] <dcaro>	 (so puppet failed)
[07:25:24] <dcaro>	 I'll add that to the task
[07:25:27] <taavi>	 good point :P
[07:25:47] <taavi>	 can you send a cloud-announce email?
[07:25:53] <dcaro>	 hmm, I think ci might not be working
[07:25:57] <taavi>	 I will script something to automatically recover most stuff
[07:26:04] <taavi>	 that's a possibility
[07:26:05] <dcaro>	 yes, I'll do
[07:26:06] <taavi>	 let's force merge
[07:26:48] <dcaro>	 we need to sync the puppetmasters too
[07:29:01] <dcaro>	 merged on prod
[07:29:36] <arturo>	 wow!
[07:29:56] <arturo>	 thanks for dealing with this
[07:29:58] <arturo>	 good morning 
[07:30:42] <moritzm>	 oh? having at look at what could have gone wrong there
[07:32:00] <dcaro>	 message sent, taavi there was a command to update the status on irc right?
[07:32:03] <taavi>	 I constructed a one-liner that will manually add the the address if it's not there: `( ip route get 208.80.154.224 | grep 172.16.0.1 ) || ( ip link set ens3 up; ip addr add <INSTANCE IP>/21 dev ens3; ip route add default via 172.16.0.1 )`
[07:32:31] <taavi>	 now I'm looking at writing a script to automatically run that on all bullseye instances
[07:32:52] <moritzm>	 sorry for that, the libdns-export1110 libisc-export1105 were the incorrect sonames, I'll push a fix to drop these
[07:33:37] <dcaro>	 ack, give us a bit of time though, so we can test it on a working infra xd
[07:40:08] <dcaro>	 manually fixed enc-1.cloudinfra so puppet-runs can work
[07:40:22] <moritzm>	 no rush, this can wait until next week, https://gerrit.wikimedia.org/r/961983 is the fix
[07:40:26] <moritzm>	 and sorry for the mess :-/
[07:40:35] <dcaro>	 moritzm: np :), stuff happens
[07:41:09] <dcaro>	 we can check if we add a check to avoid it next time somehow though
[07:41:28] <dcaro>	 xd, check-check
[07:48:04] <dcaro>	 taavi: another option is running dhclient directly right?
[07:48:21] <taavi>	 well you would need to install that package first, which you can't really do
[07:48:46] <dcaro>	 true xd, hmm, scratch-1.cloudinfra-nfs was able to reach the internet
[07:50:19] <taavi>	 uhh
[07:50:47] <taavi>	 trying to run something via cloud-cumin and it fails with 'Permissions 0440 for '/etc/keyholder.d/cloud_cumin_master' are too open.'
[07:51:23] <taavi>	 is cloud-cumin not able to SSH to cloudvirts?
[07:51:39] <dcaro>	 I think it should
[07:51:57] <dcaro>	 hmm
[07:52:17] <dcaro>	 a lot of the openstack commands are through ssh to a control node (for now)
[07:52:31] <taavi>	 hm, it works via cumin
[07:52:50] <taavi>	 ah, I think I'm missing an env variable
[07:53:32] <taavi>	 that also fixes the permission issue
[07:53:33] <taavi>	 nvm
[07:53:43] <taavi>	 error: Cannot run interactive console without a controlling TTY
[07:55:15] <taavi>	 if someone has an idea how to run something on the virtual console via a script, let me know
[07:55:25] <dcaro>	 I think there's no easy way xd
[07:55:32] <dcaro>	 I was thinking that too
[07:55:34] <taavi>	 this is what I have so far: https://phabricator.wikimedia.org/P52761
[07:56:02] <dcaro>	 I wrote something to run scripts on consoles in $previous, but will need some adapting for sure
[07:56:56] <dcaro>	 hmm, does it work?
[07:57:06] <dcaro>	 (the console part)
[07:57:25] <taavi>	 no
[07:57:32] <dcaro>	 xd
[07:57:39] <dcaro>	 okok, we can do it manually for now
[07:57:53] <dcaro>	 as in print the command you would do, then connect to the console, so we can copy-paste?
[07:59:12] <dcaro>	 oh, cloud-cumin can't ssh to cloudvirts, cloudcumin can
[07:59:36] <dcaro>	 (vm -> prod is not allowed, prod -> prod/vm is)
[08:01:55] <taavi>	 hm
[08:02:12] <dcaro>	 hmm...
[08:02:17] <taavi>	 sigh
[08:02:20] <dcaro>	 I'm on proxy3
[08:02:35] <dcaro>	 it seems to not have 185.15.56.49 as ip
[08:02:37] <taavi>	 I think our config for cloud-cumin forbirds TTY allocation for that key
[08:02:50] <taavi>	 that's intentional, it has the private IP and floating IP mapping happens at the neutron layer
[08:03:17] <dcaro>	 but I can't ssh to it from my laptop directly
[08:03:31] <taavi>	 ssh to what?
[08:03:49] <dcaro>	 instance-proxy-03.project-proxy.wmflabs.org oohhh, that's the floating ip
[08:03:50] <dcaro>	 xd
[08:03:53] <taavi>	 yeah
[08:04:05] <dcaro>	 it's proxy-03.project-proxy....
[08:04:07] <dcaro>	 xd, sorry
[08:06:53] <taavi>	 so if I run that script from my laptop, the issue is that I can't use novaobserver to log in to openstack
[08:07:34] <dcaro>	 you can use a socks proxy (like the cookbooks)
[08:07:39] <dcaro>	 you can paste that code in a cookbooks maybe
[08:07:48] <dcaro>	 (so the proxy is started for you)
[08:07:57] <taavi>	 good idea, let me try that
[08:10:25] <arturo>	 anything I can do to help?
[08:11:05] <dcaro>	 hmm, there's no alerts now :/
[08:13:02] <dcaro>	 I'm going fixing manually to some of the projects that are shared (nfs/idp/metricsinfra/...), so those get restored first, we can pair on that if you want (unless taavi needs help on anything, he's script will fix everything at once xd)
[08:14:32] <dcaro>	 hmm, I was using the alerts to guide me a bit :/
[08:14:54] <taavi>	 dcaro: sure, good idea
[08:14:58] <arturo>	 ok lets pair
[08:15:58] <dcaro>	 the alerts are gone, I'm looking into that, it seems I can't ssh to metricsinfra-prometheus-2.metricsinfra.eqiad1.wikimedia.cloud 
[08:16:43] <arturo>	 me neither
[08:16:56] <arturo>	 lets try console
[08:17:05] <dcaro>	 on it, can you do prometheus-3?
[08:17:13] <arturo>	 yes
[08:17:18] <arturo>	 I'll do -3
[08:17:33] <dcaro>	 ack
[08:21:23] <dcaro>	 we got alerts \o/
[08:21:27] <arturo>	 what is the name of the deb package that's missing?
[08:21:40] <arturo>	 it may still be at /var/cache/apt/archives/
[08:21:56] <dcaro>	 isc-dhcp-client
[08:22:06] <dcaro>	 I'm manually adding the ip + route
[08:22:28] <dcaro>	 this chunk of taavi's message before `( ip link set ens3 up; ip addr add <INSTANCE IP>/21 dev ens3; ip route add default via 172.16.0.1 )`
[08:22:40] <arturo>	 ok
[08:22:41] <dcaro>	 but if it's there might help yes
[08:22:55] <taavi>	 can someone check that puppet will install it back once it starts running?
[08:22:59] <dcaro>	 (it's not reinstalled by puppet btw. so we have to apt install)
[08:23:25] <dcaro>	 it's not as is, or it did not in a test I did, I can try again with the next
[08:23:26] <arturo>	 the deb is not in the local cache :-(
[08:23:27] <arturo>	 it is not easy to inject the deb package without network, though
[08:23:37] <dcaro>	 yep xd
[08:24:25] <dcaro>	 I'll go next for cloudinfra stuff https://prometheus-alerts.wmcloud.org/?q=project%3Dcloudinfra
[08:24:40] <dcaro>	 ntp-03
[08:25:42] <taavi>	 I'm having trouble getting the script to talk to the virsh console
[08:25:56] <dcaro>	 yep, consoles are tricky
[08:25:58] <taavi>	 I get the "Escape character is ^] (Ctrl + ])" text, but it doesn't seem to react to anything
[08:26:00] <dcaro>	 and non-responsive
[08:28:11] <dcaro>	 this is the non-generic script I had for console interaction, https://github.com/david-caro/serial-uploader/blob/master/serial_uploader/__init__.py
[08:28:33] <dcaro>	 has to be adapted though
[08:30:33] <dcaro>	 starting with ntp-04
[08:30:52] <dcaro>	 hmm, I think we had a more generic version of that internally, never made it to become open source though :/
[08:31:18] <taavi>	 I think it works now
[08:32:09] <taavi>	 yep I have a working script
[08:32:10] <dcaro>	 it's very flaky, sending enters/spaces and sleeping a bit usually helps
[08:32:20] <dcaro>	 nice :)
[08:32:23] <arturo>	 cool
[08:33:06] <arturo>	 (prometheus-3 was fixed BTW, forgot to mention)
[08:33:22] * dcaro doing cloud-cumin-03
[08:33:57] <arturo>	 I'll do tools-db-2
[08:34:58] <taavi>	 script is running now
[08:35:14] <dcaro>	 nice, let us know how it's progressing
[08:35:18] <taavi>	 yeah
[08:35:32] <dcaro>	 I'll hold doing anything else then
[08:35:39] <taavi>	 can someone write a puppet patch to re-install the dhcp client?
[08:35:46] <arturo>	 I can do that
[08:35:46] <dcaro>	 I can do that :)
[08:35:51] <dcaro>	 arturo: hahaha, yours
[08:35:54] <arturo>	 ok
[08:37:08] <taavi>	 on cloudinfra
[08:38:10] <taavi>	 the script is surprisingly reliable so far
[08:38:18] <taavi>	 updated https://phabricator.wikimedia.org/P52761 with the current version
[08:38:22] <dcaro>	 nice! we can reuse that for other stuff then :)
[08:39:03] <dcaro>	 this is ctrl+d? '\x01D'
[08:39:28] <taavi>	 that's the "detach from console" shortcut that I copy-pasted from some stackoverflow answer. so probably yes
[08:39:39] <dcaro>	 ctrl+] thingie then
[08:39:42] <dcaro>	 nice
[08:40:42] <taavi>	 we might want to puppetize a "run command on specific console" script to all cloudvirts, that can then be used as a building block for future use cases
[08:41:00] <arturo>	 +1
[08:41:01] <arturo>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/961986
[08:41:04] <taavi>	 now doing deployment-prep
[08:41:12] <arturo>	 taavi: can you do tools first
[08:41:14] <arturo>	 ?
[08:41:35] <taavi>	 no, but I can run a separate instance of the script on tools only
[08:41:48] <arturo>	 ok, please do
[08:41:58] <dcaro>	 arturo: the name of the package is the same on buster right?
[08:42:07] <taavi>	 doing
[08:42:33] <arturo>	 dcaro: yes, I think this package has been in the same name since its inception
[08:42:38] <dcaro>	 👍
[08:43:36] <dcaro>	 arturo: you might want to force-merge, as we don't have jenkins tests yet
[08:44:32] <arturo>	 yes
[08:44:39] <arturo>	 I'm running the CI in my laptop just in case
[08:44:41] <taavi>	 hm, actually, the "detach from console" thing doesn't work, but the script ignores that, so it has been leaving tons of open SSH processes in the background :D
[08:44:49] <taavi>	 tools is fixed now, btw
[08:45:00] <arturo>	 cool
[08:45:10] <arturo>	 we have now stuff like ` ToolsDB replication on tools-db-2 is lagging behind the primary, the current lag is 13181`
[08:45:16] <dcaro>	 oops xd, that might backfire later, as console connections might be limited, but we can handle it later once evenything is up
[08:45:31] <taavi>	 it froze on deployment-db09
[08:45:42] <arturo>	 virsh console can be --forced AFAIK
[08:46:05] <dcaro>	 arturo: we can pair trying to sort that out, anything specific you want me to look into?
[08:46:28] <dcaro>	 it seems tools-db is catching up though
[08:46:44] <dcaro>	 11224 now
[08:46:52] <arturo>	 dcaro: I just saw the alert. But I'm not there yet
[08:46:56] <arturo>	 feel free to start with that
[08:47:14] <arturo>	 there was a delay between me bringing up db-2 and db-3
[08:47:21] <arturo>	 so that's possibly the problem
[08:47:28] <arturo>	 and maybe the would catch up soon
[08:47:36] <arturo>	 (on their own)
[08:47:42] <dcaro>	 I think so yes
[08:47:56] <dcaro>	 will just check the logs make sure it's not complaining about something else
[08:48:30] <arturo>	 this problem was just for bookworm VMs? 
[08:48:39] <arturo>	 the main puppetmaster should be unaffected, no?
[08:48:41] <dcaro>	 yes
[08:48:47] <dcaro>	 enc was though
[08:48:55] <taavi>	 just bullseye I think
[08:48:58] <dcaro>	 so tools-enc might be
[08:49:06] <arturo>	 oh, bullseye
[08:49:18] <dcaro>	 yes, bullseye, sorry
[08:49:37] <taavi>	 tools is all fixed
[08:49:41] <arturo>	 great
[08:49:52] <dcaro>	 \o/
[08:50:00] <arturo>	 I'll stand by and wait for the script to finish doing its magic
[08:50:13] <dcaro>	 integration might be a good one to fix too, to allow running tests
[08:50:13] <taavi>	 lmk if there are other projects further in the alphabet you want fixed now
[08:50:47] <taavi>	 doing
[08:51:11] <arturo>	 maybe paws
[08:51:20] <taavi>	 I'll do that after integration
[08:51:34] <arturo>	 ack
[08:51:43] <dcaro>	 paws is working for me
[08:52:14] <dcaro>	 the puppetmaster is down xd
[08:52:34] <taavi>	 fixed now
[08:52:45] <taavi>	 paws had NFS affected too, that might cause some issues
[08:53:13] <dcaro>	 ack, there were no alerts on it though :/
[08:53:17] <taavi>	 found the issue why the script was freezing, turns out I didn't account for SHUTOFF instances
[08:53:35] <dcaro>	 I saw that in the loop, sorry I did not mention
[08:53:54] <dcaro>	 you are waiting for a line without timeout/retries too
[08:54:03] <dcaro>	 so if the console hangs, the script hangs too
[08:54:08] <taavi>	 yeah it's not great :D
[08:54:14] <taavi>	 but it's the best I can do for this situation
[08:54:27] <dcaro>	 that's ok yes, that's why I did not mention xd
[08:56:16] <dcaro>	 tools-db is just picking up, looks ok
[08:58:00] <taavi>	 toolforge nfs was affected, right? do we need to start looking at toolforge-wide restarts?
[08:58:38] <taavi>	 script at gitlab-runners
[08:58:39] <dcaro>	 looking
[08:58:57] <dcaro>	 grafana is down on metricsinfra
[08:59:07] <dcaro>	 ^ mabye next candidate?
[08:59:15] <taavi>	 I restored metricsinfra already
[08:59:22] <taavi>	 it works for me
[08:59:36] <dcaro>	 nm, dashboard is in prod anyhow: https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=tools&var-cluster_datasource=prometheus-tools
[09:00:02] <dcaro>	 there's some yes, will start rebooting things
[09:00:27] <arturo>	 there was a cookbook for that?
[09:01:22] <dcaro>	 yep
[09:01:23] <taavi>	 script is at 'language' project
[09:01:35] <dcaro>	 well, to reboot a worker, not to target the ones with D processes iirc
[09:01:48] <blancadesal>	 toolsbeta-harbor.wmcloud.org still down? 
[09:02:16] <arturo>	 Project toolsbeta instance toolsbeta-harbor-1 is down since 6 hours ago
[09:02:17] <dcaro>	 probably
[09:02:45] <dcaro>	 that's when the issue started more or less
[09:03:13] <dcaro>	 wait no, but the alerts say so :/
[09:03:20] <taavi>	 I can restore toolsbeta for you
[09:03:38] <taavi>	 it's back
[09:03:44] <dcaro>	 https://usercontent.irccloud-cdn.com/file/IwXxuAuS/image.png
[09:03:57] <taavi>	 the fact that we had zero pages go off for that isn't great :/
[09:04:04] <dcaro>	 agree
[09:04:51] <arturo>	 yes
[09:05:33] <taavi>	 script at 'onfire'
[09:05:48] <dcaro>	 🔥
[09:06:02] <taavi>	 so roughly halfway at the project list
[09:06:21] <dcaro>	 good, alerts down to 177 from ~400
[09:06:28] <blancadesal>	 taavi: thanks
[09:09:07] <taavi>	 so it seems like the package was uninstalled like yesterday afternoon, and then things broke as leases expired this morning?
[09:09:10] <dcaro>	 hmm, the patch was submitted yesterday at ~11UTC, git-sync-upstream runs every 10min, and puppet agent every 30, so tops 40 min after every VM should have been affected :/
[09:09:23] <arturo>	 taavi: seems like a valid theory
[09:09:24] <taavi>	 script at 'signwriting'
[09:10:06] <dcaro>	 what's the lease time, 12h?
[09:11:44] <dcaro>	 I think it's `option dhcp-lease-time 86400;`
[09:11:50] <dcaro>	 from the leases file on a client
[09:12:18] <dcaro>	 that's a day in seconds
[09:12:19] <taavi>	 script at 'traffic'
[09:12:21] <arturo>	 I wonder if neutron allows configuring that
[09:12:36] <dcaro>	 probably yes, I think it can be overriden on the client side too
[09:14:32] <taavi>	 script at 'wikidumpparse'
[09:15:07] <dcaro>	 that project name always makes me jiggle xd
[09:16:07] <taavi>	 'wmcz-stats'. just a few more left
[09:16:58] <taavi>	 script is complete. I'm going to eat something now
[09:17:12] <dcaro>	 taavi: it's going in alphabetical order right?
[09:17:19] <taavi>	 yes, it was
[09:17:57] <dcaro>	 there's still a few alerts
[09:18:11] <arturo>	 I'll manually run puppet on some VMs to clean some of the "no puppet resources found" alerts
[09:18:12] <dcaro>	 might take some minutes to update, but will check some
[09:18:47] <dcaro>	 looking into  dumps-nfs-1 
[09:19:01] <arturo>	 some VMs were left with puppet disabled, like proxy-03
[09:19:04] <arturo>	 doing that now
[09:19:53] <dcaro>	 hmm, Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed when searching for node dumps-nfs-1.dumps.eqiad1.wikimedia.cloud: Failed to find dumps-nfs-1.dumps.eqiad1.wikimedia.cloud via exec: Execution of '/usr/local/bin/puppet-enc dumps-nfs-1.dumps.eqiad1.wikimedia.cloud' returned 1:
[09:19:57] <dcaro>	 looking
[09:20:01] <arturo>	 mmmm yeah
[09:20:04] <arturo>	 the enc seems down again
[09:25:19] <dcaro>	 2023-09-29 09:25:13.714 1367852 ERROR uwsgi_file__usr_local_lib_python3_9_dist-packages_puppet-enc pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'cloudinfra-db03.cloudinfra.eqiad1.wikimedia.cloud' ([Errno 113] No route to host)")
[09:25:20] <dcaro>	 oops
[09:25:29] <arturo>	 ouch
[09:26:46] * taavi back
[09:27:00] <dcaro>	 hmm, that one has no ip
[09:27:18] <taavi>	 anything in puppet logs? does it have the dhcp client package installed?
[09:27:38] <arturo>	 dcaro: you fixing it by hand?
[09:27:43] <dcaro>	 yes
[09:27:48] <arturo>	 ack
[09:28:42] <dcaro>	 getting now 2023-09-29 09:28:15.247 1367996 CRITICAL flask_keystone [-] Unhandled error: OSError: write error
[09:28:47] <dcaro>	 let me restart just in case
[09:29:05] <dcaro>	 2023-09-29 09:28:57.983 1368018 INFO flask_keystone [-] Couldn't authenticate user 'None' with X-Identity-Status 'Invalid'
[09:29:07] <dcaro>	 hmm
[09:29:11] <taavi>	 is the dhcp package install maybe removing the manually added IP?
[09:29:19] <dcaro>	 oh, but now works
[09:29:20] <taavi>	 dcaro: that's mostly expected logspam
[09:29:25] <dcaro>	 okok
[09:29:37] <dcaro>	 taavi: when I ran it manually it did not
[09:29:44] <dcaro>	 I did ran dhclient ens3 at the end though
[09:29:50] <dcaro>	 and it kept the ip
[09:30:27] <taavi>	 there are some InstanceDown alerts that are a few minutes old which is worrying me
[09:30:48] <taavi>	 tools-nfs-2 for example
[09:30:49] <arturo>	 yes, for example tools-nfs-2 is down
[09:30:53] <arturo>	 let me check that one
[09:30:58] <dcaro>	 yes, that one seems to be having io issues 
[09:31:01] <dcaro>	 64 D processes
[09:31:09] <dcaro>	 (it has been increasing)
[09:31:15] <taavi>	 it has an IP
[09:31:18] <taavi>	 should I try rebooting it?
[09:31:34] <arturo>	 I can't ssh to tools-nfs-2
[09:31:36] <dcaro>	 that will knock out all the k8s/grid nodes
[09:31:42] <dcaro>	 is it responsive in any way
[09:31:45] <dcaro>	 ?
[09:31:47] <taavi>	 I got in via the console
[09:31:50] <dcaro>	 (if not, reboot)
[09:31:58] <dcaro>	 anything weird going on?
[09:32:07] <dcaro>	 htop/etc.
[09:32:08] <arturo>	 yeah, I'd say reboot, and be prepared to reboot the whole of toolforge
[09:32:38] <dcaro>	 if nothing pops up yes, let's reboot
[09:32:38] <taavi>	 no, it seems oddly normal
[09:32:46] <dcaro>	 weird :/
[09:32:51] <taavi>	 rebooting
[09:32:52] <arturo>	 we added the IP by hand in most servers
[09:32:55] <dcaro>	 why is it considered down?
[09:33:09] <arturo>	 I wonder if neutron is somehow forgetting about the VMs since they are not excercising dhcp
[09:33:23] <taavi>	 or if the NFS service IP is confusing something
[09:33:25] <dcaro>	 I mean, what does it mean to have an instance down alert? (is it prometheus not being able to scrap eit?)
[09:33:38] <taavi>	 prometheus is not able to reach node-exporter
[09:33:39] <arturo>	 taavi: didn't the NFS servers had double ports for redundancy?
[09:33:42] <dcaro>	 if we set them by hand, neutron might re-lease the ips
[09:33:51] <jelto>	 thanks for fixing the dhcp issues in wmcs. At least all gitlab instances seem to work fine again
[09:33:54] <dcaro>	 unless we run dhclient again, that should refresh the lease
[09:34:10] <dcaro>	 taavi: thanks
[09:34:31] <taavi>	 I think we need to start rebooting toolforge, does anyone have a script for that?
[09:35:10] <arturo>	 I can create one, if there is no cookbook
[09:35:17] <dcaro>	 I got it, there's a cookbooks for k8s
[09:35:46] <taavi>	 I wrote one for the grid a while ago as well
[09:35:59] <arturo>	 dcaro: `wmcs.toolforge.k8s.reboot` no?
[09:36:22] <dcaro>	 yep
[09:36:45] <dcaro>	 wmcs.toolforge.grid.reboot_workers specifically
[09:36:59] <arturo>	 taavi: so tools-ns-2 has 2 IP addresses
[09:37:03] <dcaro>	 that one's for grind
[09:37:09] <taavi>	 some of the instances firing InstanceDowns seem to need a manual `ifdown ens3; ifup ens3` to get a lease. I'll run that on all bullseye nodes without a default route
[09:37:37] <dcaro>	 running teh k8s one
[09:38:08] <taavi>	 I'll start the k8s ones  then
[09:38:10] * dhinus paged: checker.tools.wmflabs.org/toolschecker: NFS read/writeable on labs instances
[09:38:12] <taavi>	 grid ones, sorry
[09:38:16] <taavi>	 hi dhinus
[09:38:18] <dcaro>	 dhinus: sorry
[09:38:27] <dhinus>	 no problem :)
[09:38:36] <taavi>	 you've missed out all of the fun :-/
[09:38:49] <dhinus>	 LOL, catching up now
[09:39:08] <arturo>	 taavi: indeed there are 2 neutron ports for tools-nfs servers:
[09:39:10] <arturo>	 https://www.irccloud.com/pastebin/qMeVqVwN/
[09:39:22] <arturo>	 for some kind of HA (I guess keepalived or similar)
[09:39:30] <taavi>	 tl;dr is that isc-dhcp-client accidentally got uninstalled from all bullseye based cloud vps instances
[09:39:58] <dcaro>	 breaking all bullseye vms networks little by little when the leases expired xd
[09:41:01] <dcaro>	 I think that we noticed mostly when the instance proxy went down, taking all the vps projects down (no pages though!)
[09:41:11] <arturo>	 all main nfs servers seem to have this double port approach, but it seems DOWN in all of them
[09:41:12] <arturo>	 https://www.irccloud.com/pastebin/g66g6HM4/
[09:41:17] <arturo>	 this is confusing
[09:41:29] <dcaro>	 could it be a leftover of the drbd setups?
[09:41:44] <dcaro>	 propagated by the nfs server creation script
[09:41:44] <arturo>	 did we even had drbd in the VMs?
[09:42:02] <taavi>	 I think not, but there's some other mechanism
[09:42:11] <arturo>	 I guess this may be the service address
[09:42:19] <taavi>	 grid reboot cookbooks are failing
[09:42:19] <arturo>	 the port is created just to reserve the address maybe
[09:42:32] <taavi>	 probably
[09:42:42] <dcaro>	 kubernetes reboot cookbooks seem to be going ok
[09:42:46] <dcaro>	 maybe
[09:43:01] <dcaro>	 taavi: what errors are you getting?
[09:43:10] <taavi>	 arturo: can I ask you to look into non-toolforge nfs servers, if they are having any similar issues?
[09:43:17] <taavi>	 dcaro: fails to run `reboot-host` on the VMs
[09:43:21] <arturo>	 yes
[09:43:57] <dcaro>	 taavi: getting stuck/timeout or erroring out?
[09:44:34] <taavi>	 I don't know, it just says "Cumin execution failed"
[09:44:51] <dcaro>	 you can check the cumin logs for the full command
[09:45:34] <dcaro>	 I can try too, all of them are failing?
[09:45:59] <taavi>	 timeout it seems
[09:46:14] <taavi>	 yeah, all three queue types
[09:46:30] <taavi>	 note you can't run them on cloudcumins due to some sort of dependency issue
[09:46:32] <dcaro>	 hmm, does it try to force-reboot? (using openstack)
[09:46:41] <dcaro>	 that's not nice :/
[09:47:07] <taavi>	 it does not, maybe I need to hack it to do that
[09:47:17] <dcaro>	 yeas, the  k8s has that
[09:47:26] <taavi>	 I'll do that
[09:47:31] <dcaro>	 👍
[09:49:28] <dcaro>	 arturo: how is the nfs check going? seeing similar issues?
[09:49:34] <arturo>	 none so far
[09:49:43] <arturo>	 but I'm getting a weird thing with math-nfs-1
[09:49:53] <arturo>	 can you ssh to `math-nfs-1.math.eqiad1.wikimedia.cloud`?
[09:50:08] <dcaro>	 on it, paws nfs says it's down https://prometheus-alerts.wmcloud.org/?q=project%3Dpaws
[09:50:31] <dcaro>	 from my laptop rigt?
[09:50:36] <arturo>	 yes
[09:50:50] <arturo>	 I can't
[09:50:51] <arturo>	 https://www.irccloud.com/pastebin/u50UAfgv/
[09:50:52] <dcaro>	 it seems to time out
[09:51:00] <arturo>	 but the servers looks good from inside the console
[09:51:20] <dcaro>	 yep same error, no route
[09:51:20] <arturo>	 maybe the bastion is having problems?
[09:52:01] <arturo>	 wait
[09:52:10] <dcaro>	 https://www.irccloud.com/pastebin/cqLcqYH7/
[09:52:11] <arturo>	 the IP address in the server is just the /32
[09:52:18] <dcaro>	 oh, it should be /21
[09:52:23] <dcaro>	 maybe a manual typo?
[09:52:25] <arturo>	 https://www.irccloud.com/pastebin/Irn2mfLi/
[09:52:29] <arturo>	 is just the special VIP
[09:52:34] <arturo>	 let me run dhclient
[09:52:39] <arturo>	 or better, ifdown, ifup
[09:52:46] <taavi>	 `ifdown ens3; ifup ens3` seems to be the best way to cycle the ip
[09:52:57] <dcaro>	 ack
[09:53:01] <Rook>	 Hmm...
[09:53:15] <dcaro>	 it's replying to ping now
[09:53:22] <dcaro>	 \o/
[09:53:24] <arturo>	 yes
[09:53:28] <arturo>	 but it lost the /32
[09:53:35] <arturo>	 https://www.irccloud.com/pastebin/jaBiugt4/
[09:53:43] <arturo>	 so not sure if the service is up :-P
[09:53:58] <arturo>	 I suspect this is neutron's doing
[09:54:08] <arturo>	 at least for the NFS servers, that they have this double port thingy
[09:54:16] <dcaro>	 hmm, reboot?
[09:54:21] <taavi>	 is the ip in /e/n/i?
[09:54:54] <arturo>	 taavi: ens3 -> dhcp in /e/
[09:55:02] <arturo>	 let me reboot
[09:55:03] <taavi>	 yes, what about the service ip?
[09:55:13] <arturo>	 taavi: assigned by neutron via dhcp too
[09:55:16] <arturo>	 that's why the port?
[09:55:33] <taavi>	 no, the port iirc is to reserve the IP from being allocated to anyone else
[09:55:44] <taavi>	 profile::wmcs::nfs::standalone has puppet code to assign it to an interface
[09:56:05] <arturo>	 ok, rebooted, came up without the /32
[09:56:09] <arturo>	 let me run puppet
[09:56:33] <arturo>	 indeed puppet mangles the addr
[09:56:35] <arturo>	 https://www.irccloud.com/pastebin/xE7MidoZ/
[09:56:59] <arturo>	 this is a very flaky implementation :-(
[09:57:17] <arturo>	 but whatever, after the puppet run both IPs are in the interface
[09:57:52] <arturo>	 root@math-nfs-1:~# ip -br a show dev ens3
[09:57:52] <arturo>	 ens3             UP             172.16.3.52/21 172.16.2.227/32 fe80::f816:3eff:fe45:fa0b/64 
[09:58:10] <taavi>	 sigh
[09:58:16] <dcaro>	 :S
[09:58:33] <dcaro>	 same with paws-nfs then?
[09:58:44] <arturo>	 I don't think we have many NFS servers at the moment, but they will all require a full reboot + puppet run most likely
[09:59:32] <dcaro>	 are you on that? want to get the list and pair on it?
[10:00:13] <arturo>	 we have many nfs servers :-(
[10:00:14] <arturo>	 https://www.irccloud.com/pastebin/D0m1pTiY/
[10:00:38] <dcaro>	 only bullseye ones would be affected no?
[10:00:42] <dcaro>	 still a bunch
[10:00:47] <taavi>	 yep
[10:00:52] <taavi>	 and toolforge is ok, remember not to touch it
[10:01:17] <dcaro>	 let's get the filtered list in an etherpad then?
[10:01:25] <arturo>	 yeah, working on the filtered list
[10:01:33] <dcaro>	 thanks 👍
[10:02:43] <arturo>	 https://etherpad.wikimedia.org/p/we-really-love-nfs
[10:02:46] <dcaro>	 I just saw an error message from irccloud saying something about a database error (lasted less than a second), might be unreliable on irc :/
[10:03:10] <taavi>	 LOL at the etherpad name
[10:03:21] <arturo>	 :-)
[10:03:43] <arturo>	 dcaro: I'll start from the bottom
[10:03:58] <dcaro>	 +1 for the name too xd
[10:04:10] <dcaro>	 I added a section there with 'working on' so we can copy-paste there
[10:04:14] <arturo>	 ok
[10:06:44] <taavi>	 where are we with the k8s reboots?
[10:06:59] <dcaro>	 doing tools-k8s-worker-68.tools.eqiad1.wikimedia.cloud
[10:07:02] <taavi>	 ok
[10:07:09] <taavi>	 it's going to take a while with that pace though :/
[10:07:13] <dcaro>	 it's a bit slow (goes one by one, and waits for reboot)
[10:07:53] <dcaro>	 new processes should work ok though, only old stuck ones could have issues
[10:07:54] <dcaro>	 https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=tools&var-cluster_datasource=prometheus-tools
[10:08:30] <arturo>	 so rebooting a nfs server alone doesn't fix the problem because until the next puppet run (maybe 30 minutes) it wont get the address
[10:08:38] <arturo>	 (just to clarify)
[10:08:42] <taavi>	 i rebooted the bastions and the cron node
[10:09:02] <taavi>	 and after that the clients might be unhappy still
[10:09:28] <dcaro>	 arturo: ack, I'm sshing and manually running puppet
[10:09:33] <arturo>	 dcaro: ack
[10:09:37] <dcaro>	 taavi: hmm, that means that nfs is not healthy no?
[10:09:51] <taavi>	 sorry, I mean after you run puppet on a nfs server
[10:09:59] <taavi>	 so not related to toolforge
[10:10:14] <dcaro>	 taavi: aaahhh, yes, okok, yes
[10:10:58] <dcaro>	 we don't crontrol the clients on most of those projects though, but we can send an email to let them know they might need to reboot the client VMs
[10:11:29] <dcaro>	 any other ideas?
[10:11:57] <dcaro>	 (we can try rebooting the VMs ourselves, but we are a bit blind on what is mounting it/if it can be rebooted right away)
[10:12:13] <taavi>	 there are not that many projects, I think we should try rebooting them by ourselves
[10:13:55] <dcaro>	 I'm not worried about the amount of projects, mostly about rebooting something without knowing what is running in it
[10:14:59] <dcaro>	 hmm.. there's a bunch of new alerts
[10:15:36] <arturo>	 dcaro: I think we are both working on cvn-nfs-1
[10:15:38] <arturo>	 :-(
[10:15:45] <dcaro>	 oh, sorry
[10:15:48] <arturo>	 I think I got confussed in the etherpad
[10:15:52] <dcaro>	 split brain on etherpad?
[10:16:05] <arturo>	 human error on mys ide most likely
[10:16:09] <dcaro>	 okok, you take it, I'll go for the next
[10:16:11] <dcaro>	 np
[10:16:26] <arturo>	 dcaro: to clarify: cnv-nfs-1 got rebooted and puppet run, so it should be OK now
[10:16:40] <dcaro>	 ack, you can move it to done then
[10:17:06] <arturo>	 ack
[10:17:47] <dcaro>	 dhinus: if you are around feel free to join xd
[10:18:04] <arturo>	 FYI I opened T347681 for later
[10:18:05] <stashbot>	 T347681: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681
[10:18:15] <dcaro>	 thanks!
[10:21:52] <arturo>	 Rook: you may want to check PAWS for workers with D processes and reboot them (or rebuild, whatever is the routine maintenance)
[10:23:19] <dhinus>	 I'm here, just finished reading the backscroll, I'm not sure I fully understand what you're doing with the nfs servers though :)
[10:23:52] <taavi>	 can we make the toolforge k8s reboots faster? that seems to be the most visible broken thing at the moment
[10:24:26] <arturo>	 taavi: yes, I can loop in wmcs-openstack server reboot --force if that's what you want
[10:24:28] <dhinus>	 there are lots of grid failure emails as well
[10:24:58] <taavi>	 arturo: I'm starting to think that would be useful. dcaro thoughts?
[10:25:36] <arturo>	 dhinus: mostly fighting with T347681
[10:25:36] <stashbot>	 T347681: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681
[10:25:52] <dcaro>	 that will not evict the pods, but the deployment should restart it somewhere else, it will only break custom stuff if any
[10:26:03] <dcaro>	 I'm ok with that
[10:26:20] <arturo>	 dcaro: what if we just delete the VMs testabscookbook-nfs-[1,2] ?
[10:26:42] <dcaro>	 I'm not sure what they are, but they seem just tests
[10:27:05] <dcaro>	 testabs?
[10:27:08] <taavi>	 fwiw, webservicemonitor seems to be working just fine and is restarting all of the grid webservices that sge lost track of
[10:27:09] <dcaro>	 looks like a typo?
[10:27:16] <dcaro>	 is that a project?
[10:27:46] <arturo>	 dcaro: yes typo
[10:27:48] <arturo>	 https://www.irccloud.com/pastebin/qdz9N6hR/
[10:27:51] <arturo>	 I'll just delete them
[10:27:57] <dcaro>	 what project is that in?
[10:28:02] <arturo>	 testlabs
[10:28:09] <dcaro>	 I don't see them on horizon
[10:28:15] <arturo>	 well, I'll shutdowm them first
[10:28:24] <arturo>	 pagination?
[10:28:33] <dcaro>	 dammit, wrong page yes
[10:28:46] <dcaro>	 +1 from me
[10:28:57] <arturo>	 done
[10:29:04] <arturo>	 now, about the k8s workers reboot
[10:29:07] <arturo>	 let me prepare the loop
[10:29:22] <dcaro>	 it's doing 46 now
[10:29:56] <Rook>	 paws seems to be well. I'll wait to see if anyone else reports that it is acting strangely. Thanks!
[10:30:22] <dhinus>	 toolschecker is still complaining about NFS btw, is that expected?
[10:30:22] <arturo>	 Rook: ack
[10:30:37] <dcaro>	 dhinus: it runs on the grid I think
[10:30:39] <arturo>	 dcaro: what number shall I pick as the first then?
[10:30:46] <taavi>	 dcaro: was't that much higher previously?
[10:30:50] <taavi>	 dhinus: which alert?
[10:31:04] <dhinus>	 taavi: OK' not found on 'http://checker.tools.wmflabs.org:80/nfs/home' - 
[10:31:09] <dcaro>	 taavi: yep, probably got through the ones that were stuck on nfs, and now rebooting is faster
[10:31:20] <arturo>	 https://usercontent.irccloud-cdn.com/file/8iIYz0b0/image.png
[10:31:37] <dhinus>	 ah yes that's on Labs
[10:31:50] <taavi>	 the toolschecker instance probably needs a reboot
[10:31:55] <taavi>	 dcaro: is it doing them in order or just randomly?
[10:31:56] <dhinus>	 sorry I meant grid, but not it's not grid
[10:32:04] <dhinus>	 sorry I cannot type :D
[10:32:09] <taavi>	 (toolschecker reboot done)
[10:32:50] <dcaro>	 it's working now :)
[10:33:11] <dhinus>	 NFS alert gone
[10:33:28] <arturo>	 shall I stop trying to create a fast force-reboot loop?
[10:33:50] <dcaro>	 on 43 now, 13 left for the cookbook, all with <10 D processes
[10:33:59] <dcaro>	 I think we might want to focus on grid first instead
[10:34:01] <taavi>	 so probably not needed then?
[10:34:04] <arturo>	 ok
[10:34:13] <taavi>	 all of the grid workers have been rebooted
[10:34:24] <dcaro>	 hmm, there's a couple with >10 D processes https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=tools&var-cluster_datasource=prometheus-tools
[10:34:24] <taavi>	 webservicemonitor is veeery slowly starting all of the web services back up
[10:34:35] <dcaro>	 (might be ok, just sudden load)
[10:34:49] <dcaro>	 yes, they went away
[10:34:57] <dcaro>	 okok, then maybe back to instances down?
[10:35:36] <dcaro>	 https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[10:36:19] <dcaro>	 some of those have been down for a while, so that's ok, only a few are 'new'
[10:36:23] <taavi>	 sure. most of them seem to be "2 months old" so probably unrelated, but there seem to be a few that are actually new
[10:37:51] <dcaro>	 we wanted to reboot the clients of the nfs servers though right?
[10:38:07] <taavi>	 if they're having issues, yeah
[10:38:53] <dcaro>	 ok, I'll git a stab at that then, I'll use the same etherpad
[10:39:34] <dhinus>	 more grid failures coming in
[10:39:59] <dcaro>	 yes there will be a lot of emails sent today :/
[10:40:22] <dcaro>	 rebooting k8s-worker-37
[10:40:56] <taavi>	 dhinus: they'll keep coming for a while, the toolforge outbound mail server has a per-host rate limit so all of the grid workers have a bunch of them that are queued but not getting delivered
[10:41:03] <dhinus>	 ah-ha, TIL
[10:41:21] <taavi>	 in gmail if you click "show original" from the dropdown menu you can see how long it's been in the queue
[10:41:32] <arturo>	 dcaro: I'm not sure we should reboot nfs clients
[10:41:42] <arturo>	 we are rebooting grid/k8s because it affects scheduling
[10:41:50] <arturo>	 but in other projects I don't think they will have such problems
[10:42:01] <arturo>	 we may cause more harm than good
[10:42:02] <arturo>	 ?
[10:42:31] <dcaro>	 I'm checking on the VMs of the projects, for D state processes, if there's many of those might be worth rebooting
[10:42:51] <arturo>	 ok
[10:43:11] <dcaro>	 math for example seems clean, no stuck processes
[10:43:50] <dcaro>	 I'm a bit weary too though
[10:44:30] <arturo>	 dcaro: better get some rest outside the keyboard
[10:44:33] <arturo>	 I think the outage is gone
[10:44:45] <dcaro>	 yep, I'll have some lunch
[10:44:47] <dcaro>	 in a minute
[10:45:20] <dcaro>	 let me send an update email
[10:45:34] <taavi>	 toolforge grid webservices will still take some time to get back up
[10:45:57] <dcaro>	 ack
[10:46:11] <taavi>	 based on very quick math, I'd say hopefully less than one hour from this point
[10:46:33] <dcaro>	 okok, I'll add it to the email
[10:49:18] <dcaro>	 quarry is down though
[10:50:00] <taavi>	 I'm a bit worried about the new InstanceDown alerts, some instances have dhclient running but still don't have an IP
[10:50:07] <Rook>	 Quarry seems to be running...?
[10:50:39] <dcaro>	 oh, I got 500, refreshing worked
[10:51:05] <dcaro>	 I get 500 from time to time, one of the web nodes?
[10:51:09] <arturo>	 taavi: where do you see that?
[10:51:24] <dcaro>	 or something, not sure what's the current setup xd
[10:51:26] <arturo>	 taavi: nevermind, got it
[10:51:26] <taavi>	 arturo: cyberbot-exec-iabot-01.cyberbot is one example at the moment
[10:51:46] <taavi>	 `ssh -t cloudvirt1034.eqiad.wmnet "sudo virsh console i-0005ec1b"` if you want to log in to the console and have a look
[10:51:48] <dcaro>	 arturo: https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown
[10:52:01] <arturo>	 I'm now using https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DInstanceDown
[10:52:13] <dcaro>	 that works also :)
[10:52:45] <arturo>	 taavi: this server has DNS problems
[10:53:11] <arturo>	 taavi: and puppet agent doesn't run either
[10:53:27] <taavi>	 and that's unrelated to the current issues?
[10:53:45] <taavi>	 if it does not have an IP, DNS issues are kind of expected
[10:53:45] <arturo>	 well, it makes sense, since it doens't have IP
[10:53:59] <arturo>	 ok I did ifdown ens3 ; ifup ens3
[10:54:04] <arturo>	 and everything is green now
[10:54:05] <dcaro>	 Rook: in case it help
[10:54:07] <dcaro>	 https://usercontent.irccloud-cdn.com/file/gOEYrQEt/image.png
[10:56:17] <arturo>	 taavi: can we run the script and `if no ip, then ifdown; ifup` for all servers?
[10:56:29] <taavi>	 sure, one moment
[10:57:05] <taavi>	 running this everywhere: "( ip route get 208.80.154.224 | grep 172.16.0.1 ) || ( ifdown ens3; ifup ens3 )"
[10:57:15] <taavi>	 (208.80.154.224 is what en.wikipedia.org resolves, in case anyone is curious)
[10:57:18] <arturo>	 ack
[10:58:15] * arturo brb
[10:58:20] <dcaro>	 this looks ok as email? https://etherpad.wikimedia.org/p/QgiZvx5mWMQzlM03zCC1
[11:03:03] <taavi>	 I did some adjustments
[11:03:21] <dcaro>	 quarry is still not working though
[11:03:38] <dcaro>	 Internal Server Error
[11:03:57] <taavi>	 hmm. it works for me
[11:04:04] <dcaro>	 full refresh
[11:04:13] <taavi>	 that might be unrelated, though? T345685
[11:04:13] <stashbot>	 T345685: On first visit to Quarry in that browser session, error 500 (intermittent) - https://phabricator.wikimedia.org/T345685
[11:04:49] <dcaro>	 probably not, but it's not fully operational either
[11:05:13] <taavi>	 let's just not mention if, if the issues are unrelated to this?
[11:05:28] <dcaro>	 maybe restored to previous state
[11:06:11] <arturo>	 taavi: how is the script performing?
[11:06:35] <taavi>	 at 'extdist' at the moment
[11:07:12] <dcaro>	 okok, if nobody has more edits, I'll send the update
[11:07:55] <taavi>	 dcaro: I still think we should not mention quarry, it's too confusing in that form
[11:08:16] <dcaro>	 back to normal?
[11:11:18] <dcaro>	 sure, redacted, sent
[11:11:21] * dcaro lunch
[11:11:46] <taavi>	 thanks
[11:14:07] <arturo>	 taavi: you should take some break too
[11:14:18] <arturo>	 :-P
[11:14:23] <arturo>	 it was a very intense morning
[11:15:26] <arturo>	 (leave the script running in the bg)
[11:17:55] <taavi>	 good idea
[11:21:03] <andrewbogott>	 I just woke up -- is there anything left that I can do?
[11:21:33] <arturo>	 andrewbogott: I think we are mostly ok, just waiting to reset the network in a few VMs. There is a script being run by t.aavi
[11:22:31] <andrewbogott>	 I read the backscroll -- really great work by everyone!
[11:23:29] <arturo>	 we need to capture t.aavi's script into a cookbook, let me create a ticket for that
[11:25:28] <arturo>	 T347683
[11:25:29] <stashbot>	 T347683: openstack: create a cookbook to inject commands to VMs via console at scale - https://phabricator.wikimedia.org/T347683
[11:28:39] * arturo errand
[11:29:28] <andrewbogott>	 dhinus: this is a bit anti-climactic now, but last night I re-imaged cloudservices2005-dev to bookworm and ldap seems to be syncing back and forth (thanks to arturo's cert fix). So all that's left is that wmfbackups package.
[11:30:59] <dhinus>	 andrewbogott: I saw the update in the phab task, well done! I'll try to figure out the wmfbackups issue
[11:32:45] * taavi back
[11:32:59] <taavi>	 grid webservice restarts are done
[11:33:34] <taavi>	 the script to cycle dhclient on the broken instances is at 'wikitextexp'
[12:28:10] <arturo>	 the alerts have basically cleared up. There are some InstanceDown but they are 2 month old
[12:29:05] <dcaro>	 hmm, there's a bunch of puppet errors around though, not sure if those were there earlier
[12:32:05] <dcaro>	 checking a random few, seem unrelated
[12:32:07] <dcaro>	 Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Class[Profile::Kafka::Broker]: parameter 'statsd' expects a String value, got Undef (file: /etc/puppet/modules/role/manifests/kafka/jumbo/broker.pp, line: 10, column: 5) on node 
[12:32:07] <dcaro>	 deployment-kafka-jumbo-8.deployment-prep.eqiad1.wikimedia.cloud
[12:32:30] <taavi>	 I checked a couple too, most seem unrelated
[12:32:34] <andrewbogott>	 def seems unrelated
[12:33:04] <dcaro>	 probably the alerts got refreshed during the outage
[12:33:25] <arturo>	 so, random reflection
[12:33:35] <arturo>	 a month ago we had the DNS resolve change
[12:33:45] <arturo>	 today, the DHCP problem
[12:33:49] <arturo>	 both of them were related to puppet
[12:34:03] <arturo>	 however, from time to time we discuss if we should reduce/stop using puppet within VMs
[12:34:31] <arturo>	 I'm pretty sure the DNS resolver address can be distributed via neutron / DHCP, no puppet involved
[12:34:44] <arturo>	 and similar to the DHCP address problem with NFS servers
[12:35:04] <arturo>	 could a setup be introduced that _doesn't_ rely on puppet but in openstack-native things?
[12:35:22] <dcaro>	 the dhclient is not delivered with neutron though
[12:35:31] <arturo>	 I'm not sure what such mechanism should be, maybe via cloud-init, some kind of metadata service or whatever
[12:35:44] <dcaro>	 that would require cloud-init/prebuilt image + something to keep them running (that now is puppet)
[12:35:59] <taavi>	 what do you mean by dhclient not being delivered?
[12:36:03] <dcaro>	 stopping puppet means stopping continuous config drift support no?
[12:36:16] <arturo>	 let me reword taht
[12:36:18] <dcaro>	 as in neutron does not install dhclient for you
[12:36:37] <dcaro>	 or maintain it setup/installed in the machine, it just servers dhcp
[12:36:37] <arturo>	  well, basically T347681
[12:36:38] <stashbot>	 T347681: Cloud VPS: NFS servers: the current setup requires a puppet run after a reboot to get address right - https://phabricator.wikimedia.org/T347681
[12:36:53] <arturo>	 I think we should research how to get that fixed without using puppet
[12:36:55] <dcaro>	 yes, that can be improved for sure
[12:37:09] <taavi>	 arturo: speaking of neutron distributing the nameserver addresses https://phabricator.wikimedia.org/P52768
[12:37:34] <dcaro>	 how are the dns addresses setup now/
[12:37:34] <arturo>	 taavi: :-( we really need to introduce infra-as-code for all that
[12:37:34] <dcaro>	 ?
[12:37:45] <arturo>	 via the command line / API, but manually
[12:37:56] <arturo>	 we need to instrument all that data via ansible/terraform/whatever
[12:38:02] <dcaro>	 on neutron? and then neutron serves through dhcp?
[12:38:18] <arturo>	 neutron distributes them via dhcp and I guess puppet overwrites it
[12:38:37] <dcaro>	 so they are in puppet too... yep, that's not nice
[12:38:52] <taavi>	 getting DNS resolver addresses via DHCP is a standard DHCP feature, but yes, we have puppet overriding those
[12:39:12] <dcaro>	 I would expect puppet managing the neutron config (though I know that's tricky if there's no support on puppet side)
[12:39:58] <dcaro>	 (given that it manages most of the neutron setup already)
[12:40:01] <taavi>	 yes, but neutron doesn't run anything on the client instances
[12:40:27] <dcaro>	 yep
[12:40:40] <kindrobot>	 Rook: thanks for the ping
[12:40:54] <arturo>	 introducing some flavor of gitops for openstack admin data, such as projects, neutron networks, glance images, quotas, etc would be a huge improvement
[12:41:02] <dcaro>	 I don't mean to completely remove puppet from the VMs, just start handling that config on neutron side instead
[12:41:31] <kindrobot>	 With regard to the environment-per-patch, this was a thing we tried out in DUCT and got very positive feedback. The designer, PM, QA could all test the patch without having to set up a local environment
[12:41:54] <dcaro>	 arturo: isn't that what puppet does?
[12:42:13] <dcaro>	 (maybe not in a very easy way)
[12:42:15] <arturo>	 dcaro: it doesnt. We don't have any of that information in puppet
[12:42:28] <dcaro>	 that does not mean that we could not
[12:42:45] <arturo>	 and anyway, for a openstack-helm (or kolla-ansible) future, that's not very elegant
[12:42:46] <dcaro>	 I mean puppet as in the tool, not as in our current configuration of it
[12:42:59] <kindrobot>	 Of course, resources became a concern, so we ended up using KEDA HTTP autoscaler, to scale the patches down to zero when they weren't in use (think something similar to what Heroku did with inactive dynamos). This made it so that the resource consumption scaled with the number of people testing things and not the number of open patches
[12:43:28] <dcaro>	 we don't have that yet unfortunately
[12:43:50] <dcaro>	 (replying to arturo )
[12:44:12] <kindrobot>	 The upshot was if a PM wanted to test something out, they might have to wait ~5 minutes for the pod to become active again instead of ~15 minutes for an environment to spin up, which made it much more likely that they would actually use/test it
[12:44:18] <arturo>	 dcaro: I know. But why bothering with puppet if you will introduce something else later anyway
[12:44:29] <kindrobot>	 dcaro, arturo: sorry for interrupting x_x;
[12:44:44] <dcaro>	 arturo: to avoid implementing a half solution that will have to be thrown away
[12:45:18] <dcaro>	 (because trying to fit a new config management in the current setup for sure will not be able to be reused in the new one as is, as it will require lots of hacks)
[12:46:23] <dcaro>	 kindrobot: np ;), good to know that the expected 'start' times are ~5m
[12:47:30] <arturo>	 imagine that you create a terraform config to create the admin network, etc. Why would you need to throw that away later when introducing openstack-helm ?
[12:47:33] <dcaro>	 I'll send an update announcing the resolution
[12:47:51] <arturo>	 (or ansible)
[12:48:02] <arturo>	 the entry point is just the API endpoint
[12:48:44] <Rook>	 kindrobot: I agree that there is a desire to have things as close to immediately available as possible, but doing so has an engineering cost. Which is a resource that we have in scarcity here. As such I tend to disregard such desires as untenable given the resources on hand. Being able to deploy something which is not going to increase engineering effort, but will take longer to deploy at any given time is a good trade in a lot of 
[12:48:44] <Rook>	 cases. I would say including this one
[12:49:33] <dcaro>	 the setup is different, the config being applied is different, and the wanted outcome is different, I don't see much more reusage than having had installed the terraform cli
[12:49:47] <dcaro>	 (or whichever)
[12:50:40] <dcaro>	 for the core openstack at least, for projects might be a different case (ex. metricsinfra)
[12:50:52] <Rook>	 That said, the only reason I'm voicing these opinions is that I'm being included in decision making (or planning at any rate). Most of my other commentary has been directed at reducing the number of stakeholders. wmcs really isn't one of them. This is another bad wiki habit the foundation has, everyone has an opinion and is encouraged to give it. Which is mostly always counterproductive. If you would prefer a single large cluster 
[12:50:52] <Rook>	 that didn't use as many resources that is allowed and welcome. From our perspective it is "just a cloud vps project" be it one with lots of clusters or one or anything else.
[12:51:46] <Rook>	 By denying us insight into the project you can proceed as your, rather than our, or my, bias dictates. Thus allowing the freedom of thought necessary to have something built.
[13:03:58] <kindrobot>	 Thanks Rook. That makes sense. FWIW: we _do_ want Cloud Service's input and don't want to make decisions excluding them, because a reasonable outcome of this research is asking Cloud Services to maintain a k8s cluster(s) for testing environments... but maybe the solutions is one cluster per project, or one cluster per patch. We're not sure yet, and I think we'll need your help to
[13:04:00] <kindrobot>	 find out.
[13:05:19] <kindrobot>	 I think we're going to have Slavina (sorry don't know her IRC) on the prototype team at least in some capacity, so we'll have a touch point with your team :)
[13:05:54] <dcaro>	 blancadesal: ^
[13:07:43] <Rook>	 That is an element of the ask that I don't understand. In my view we are "cloud services" meaning we provide general purpose services (VM, DB, k8s, and platform services) that are turned into specific services by the user. As such what we offer needs to have a wide community usage to be reasonable for the last detail. I find it improbable that we can support such a service as a platform service reasonably well, I also find that we 
[13:07:43] <Rook>	 don't support toolforge and paws that well. Regardless such decisions of what fits that category are above my pay grade (And this is the route that you will want to take to manipulate the forces that be to get support out of WMCS)
[13:10:54] <RhinosF1>	 dcaro: I know the issue is resolved but can we please track an incident report for this
[13:11:32] * RhinosF1 very much thinks an outage this large needs one, especially when it was caused by a bad patch and didn't page
[13:11:45] <dcaro>	 RhinosF1: sure, there's no standard for WMCS, but I'll come up with something
[13:13:13] <arturo>	 I would kindly invite moritzm to bootstrap one :-)
[13:13:42] <RhinosF1>	 dcaro: the standard templates are fairly good
[13:13:53] <RhinosF1>	 But they are very likely some actionables and lessons
[13:14:03] * dcaro looking at https://wikitech.wikimedia.org/wiki/Incident_status
[13:15:54] <taavi>	 our incident procedures are probably something worth talking more in general. I'll add a note about that for the next team meeting
[13:17:07] <dcaro>	 we started trying to get something running at some point, not sure where we got, dhinus I think you had some template proposal?
[13:17:44] <taavi>	 since filling an IR is good, but actually ensuring we follow up on the actionables is even better
[13:20:47] <dcaro>	 definitely
[13:21:36] <dcaro>	 not saying you don't add it to the team meeting, just trying to help you know what's there already for that meeting (so we don't start from scratch again, though we might want to xd)
[13:24:05] <dhinus>	 I don't remember if I wrote it down somewhere, let me check
[13:27:57] * arturo food
[13:36:45] <dhinus>	 I thought I wrote something on wiki but I can't find it so maybe I didn't :) I think the template at https://wikitech.wikimedia.org/wiki/Incident_response/Full_report_template is pretty good
[13:37:14] <dhinus>	 another template I used in the past is https://response.pagerduty.com/after/post_mortem_template/
[13:47:27] <dcaro>	 I'll use the prod one then for now at least, might be good if we can reuse it
[13:49:21] <dhinus>	 ah right, the non-restricted bastions haven't changed
[13:51:21] <dhinus>	 wrong channel :/
[13:51:39] <dcaro>	 oh, just saw this xd
[14:30:57] <kindrobot>	 Rook: is magnum ready to use for folks like me to use? Last time I think I was blocked on not being able to create some network primitives.
[14:42:44] <dcaro>	 RhinosF1: dhinus I started writing things down here https://wikitech.wikimedia.org/wiki/Incidents/2023-09-29_CloudVPS_vms_losing_network_connectivity, feel free to add more things that I might missed
[14:43:03] <dcaro>	 taavi: arturo andrewbogott Rook blancadesal ^
[14:43:29] <Rook>	 kindrobot: yes that should be available with the closing of https://phabricator.wikimedia.org/T333874 
[14:43:29] <Rook>	 there are some notes https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Magnum
[14:43:29] <Rook>	 And it is easy to deploy with (slightly old open licensed) terraform (Though I think we're moving to something more open) https://github.com/toolforge/paws/blob/main/terraform/123_7.tf for an example
[14:44:09] <dcaro>	 opentofu!
[14:45:09] <Rook>	 Perhaps "Additionally, please be mindful that building this repository in its current state and running it might put you in violation of the Terraform Registry ToS," is kind of creepy
[14:46:20] <dcaro>	 they still have some things to figure out xd
[14:46:55] <dcaro>	 do we get our providers and modules from terraform registries?
[14:46:58] <dcaro>	 "if that's where you fetch your providers or modules from."
[14:48:02] <Rook>	 Yes
[14:48:15] <dcaro>	 then yep, we might not want to use it as it is yet
[14:48:17] <Rook>	 Mostly just this one https://www.irccloud.com/pastebin/jTL9wZXT/
[14:53:37] <dcaro>	 hmm, is there a way to 'embed' the provider? like a vendored dependency?
[14:53:47] <dcaro>	 anyhow, we are in no hurry yet
[14:53:55] <dcaro>	 (I think)
[14:57:17] <taavi>	 we could host a copy on our own registry in theory. although I'd prefer to wait until opentofu has a release first and see how the ecosystem evolves
[14:57:40] <dhinus>	 they're debating how to solve the registry issue and I expect an alternative registry will emerge
[14:57:46] * arturo can't avoid thinking about ansible
[15:00:20] <Rook>	 ansible can do everything that terraform can do, but for the subset that terraform does, it does well. Mostly the state file makes things nice. If opentofu doesn't work out pulumi is probably a good option. Though all ansible means n-1 tools which is also nice. In the case of terraform I felt it was worth the additional complexity (That's right Rook thought additional complexity was worth the effort). As post-terraform tools evolve 
[15:00:20] <Rook>	 my view might be that ansible only is really the way to go (PAWS is currently terraform with a handoff to ansible)
[15:00:23] <dcaro>	 afaik ansible openstack modules are unable to keep track of the current state, and just apply options, building our own to be reentrant is a lot of work (that's probably why it does not do it)
[15:02:34] <dcaro>	 that means that ansible would be a one-off run. That could be ok if we never have to maintain something continuously
[15:03:02] <dhinus>	 agreed, it feels that terraform is more capable than ansible for this specific use case. but we can reconsider ansible if opentofu doesn't work out.
[15:03:03] <dcaro>	 but I think that for openstack as a whole, we want it
[15:03:31] <dhinus>	 LOL @ Rook voting for additional complexity :)
[15:03:54] <Rook>	 :)
[15:04:09] <dhinus>	 there is a discussion here about solving the registry problem: https://github.com/orgs/opentofu/discussions/431
[15:04:25] <dhinus>	 funnily enough, I think the person who replied with a comment was a former member of the WMCS team :)
[15:04:38] <arturo>	 yeah!
[15:08:07] <dcaro>	 for ansible-cloudvps this was an old integration test (it worked!) https://gerrit.wikimedia.org/r/c/cloud/wmcs-ansible/+/647735
[15:09:21] <blancadesal>	 do we have a gitlab repo to put sample apps in? 
[15:09:38] <dcaro>	 for terraform?
[15:09:49] <dcaro>	 *toolforge xd
[15:09:54] <blancadesal>	 for buildpacks, yes
[15:10:07] <dcaro>	 I created a tool for each of the ones I wrote
[15:10:29] <dcaro>	 like https://gitlab.wikimedia.org/toolforge-repos/sample-php-buildpack-app
[15:10:54] <arturo>	 dcaro: I'm curious, you mention that ansible for openstack is not reentrant. What does that means?
[15:11:06] <dcaro>	 I'm not sure nowadays
[15:11:33] <dcaro>	 but it means that it does not try to check what's the current state, and then make a plan to move it to the wanted state
[15:11:38] <dcaro>	 instead it just does api calls
[15:11:53] <dcaro>	 (so you have to put around things like 'check if it exists' 'if not then create it')
[15:12:14] <arturo>	 it supports things like `state: present` and `state: absent` for most modules (didn't check all)
[15:12:19] <arturo>	 example: https://docs.ansible.com/ansible/latest/collections/openstack/cloud/subnet_module.html
[15:12:29] <arturo>	 https://docs.ansible.com/ansible/latest/collections/openstack/cloud/subnet_module.html#parameter-state
[15:13:04] <dcaro>	 they might have changed it
[15:13:12] <dcaro>	 "# Create a new (or update an existing) "
[15:13:22] <dcaro>	 points that way yes
[15:13:26] <Rook>	 When I used ansible to manage infrastructure the main problem was that it was not aware of the state as described by dcaro. Meaning that if you removed things from your code, they would remain in your infrastructure. Making the code less of the source of truth. In terraform when you remove something from code, it will also be removed from the infrastructure as well. meaning if you run the code it should make the world match it
[15:13:31] <blancadesal>	 dcaro: ok, I will move mine to toolforge-repos on gitlab as well, but it might be nice to have them all as subfolders in one repo
[15:13:39] <Rook>	 That can be worked around in ansible, but it was a fair amount of extra work
[15:13:52] <dcaro>	 blancadesal: maybe we can 'label' them or something
[15:14:17] <dcaro>	 I think it might be easier for new users to 'just copy paste' as their repos would have similar paths
[15:14:31] <dcaro>	 no strong opinion though
[15:14:57] <arturo>	 Rook: that sounds like what happens with puppet, that you need to `ensure => absent`, then run, then drop the `ensure`.
[15:15:09] <Rook>	 Yeah, same idea
[15:15:50] <arturo>	 ok, I understand the difference
[15:16:19] <taavi>	 blancadesal: please don't manually create repos within /toolforge-repos/, have Striker create them for you
[15:16:44] <dcaro>	 in one you define the state you want to go to, and it does whatever it need, in the other you tell actions to take, that might get you to a different state depending on where you started
[15:16:57] <dcaro>	 blancadesal: oh yes, create a whole tool for it
[15:17:15] <blancadesal>	 dcaro: I wasn't thinking about the users here really but for us to have sample apps for testing with tekton xd
[15:17:46] <blancadesal>	 taavi: noted
[15:18:29] <dcaro>	 oh, well, if you don't aim to share it with uses as a getting started example, then might be less relevant
[15:19:16] <Rook>	 The cost benefit is basically:
[15:19:16] <Rook>	 Ansible only: single tool, no state file. Manually verify that you removed things you don't want.
[15:19:16] <Rook>	 Ansible+terraform: Less thinking. Have to manage a state file, more tools to know how they work.
[15:19:16] <Rook>	 (More thinking is bad)
[15:19:25] <blancadesal>	 dcaro: I mean, that would be a win-win xd
[15:20:13] <dcaro>	 blancadesal: agree :), I was thinking on periodically building those projects and checking that they still work, and using that as monitoring too
[15:23:32] <dcaro>	 that way we get awesome e2e monitoring, and keep the sample apps working and up to date
[15:24:31] <blancadesal>	 that's the dream
[15:28:35] <blancadesal>	 about that, the folks in the cncf slack channel kindly pointed me to the paketo sample apps https://github.com/paketo-buildpacks/samples
[15:28:35] <blancadesal>	 might not work out-of-the box for our setup, but could save us some time
[15:29:40] <dcaro>	 the paketo buildpacks use a different builder, but might be very similar yes
[15:32:37] <dcaro>	 these should work more or less out of the box (same builder) https://devcenter.heroku.com/articles/buildpacks#officially-supported-buildpacks
[15:34:48] <blancadesal>	 do they have sample apps for them?
[15:37:13] <dcaro>	 on each of the instructions they end up telling you to clone it yes https://devcenter.heroku.com/articles/getting-started-with-ruby?singlepage=true#clone-the-sample-app
[15:38:03] <taavi>	 I've tested all of them :-) https://wikitech.wikimedia.org/wiki/Help_talk:Toolforge/Build_Service#Notes_from_testing_all_Heroku_getting_started_templates
[15:38:08] <dcaro>	 he did :)
[15:38:31] <taavi>	 and we even fixed all of the issues I found iirc
[15:38:46] <dcaro>	 I think so yes
[15:40:33] <dcaro>	 I think this is all the repos: https://github.com/heroku/?q=getting-started&type=all
[15:42:07] <blancadesal>	 this is awesome!
[15:45:13] <dcaro>	 the extra thing that we should try to add to our samples is connecting to replicas and such, that is more toolforlge-specific (and will probably not change no matter which buildpack set you are using)
[16:13:38] * arturo offline
[16:14:15] * dcaro off
[16:40:16] <dcaro>	 Btw. Good job with the outage today 🎉 
[16:41:09] <RhinosF1>	 dcaro: thank you for the IR. I had a quick read but I'm on a train so dodgy internet. Sorry for taking so long to reply.
[16:44:58] <taavi>	 thanks all. see you on monday (and hopefully not before)
[16:47:00] <dhinus>	 thanks everyone and thanks dcar.o for the incident report!