[06:44:49] <dcaro>	 morning
[07:08:14] <dcaro>	 I'm guessing that the opentofu issue is related to taav.i's work on flavors it's complaining about unaplied changes being the changes the g4 flavors
[07:08:39] <dcaro>	 (the runbook there is not very helful)
[07:10:45] <dcaro>	 novafullstack is failing due to timeout waiting for the dns entry
[07:10:47] <dcaro>	 Timed out waiting for A record for fullstackd-20240612070616.admin-monitoring.eqiad1.wikimedia.cloud"
[07:37:03] <dcaro>	 fix for the designate-sink issue when cleaning certs: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042128
[07:37:46] <dcaro>	 that might be the cause for the leaked dns records also
[07:39:14] <arturo>	 morning
[07:41:48] <blancadesal>	 morning
[08:05:10] <taavi>	 morning
[08:05:32] <taavi>	 tofu alert is indeed related to me, i will apply those changes later today or tomorrow, sorry for the noise
[08:05:40] <taavi>	 i'll reset-failed it for now
[08:08:27] <dcaro>	 np
[08:28:03] <taavi>	 quick review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042148/
[08:31:09] <arturo>	 taavi: LGTM. But there is something I don't fully understand: are we creating <projectid>.wmcloud.org domains?
[08:31:18] <arturo>	 if so, that doesn't make any sense to me
[08:32:13] <taavi>	 apparently yes: https://openstack-browser.toolforge.org/project/933ad3ff1e264aada56e6bc3ed9e08f3
[08:32:58] <arturo>	 why would we do that, if we are still enforcing project names to be dns-compatible
[08:33:27] <arturo>	 I'm fine with having project ids being uuid, but I don't fully understand this DNS change
[08:35:40] <dcaro>	 that might be why fullstack started failing, it looks for admin-monitoring domain
[08:36:35] <taavi>	 no existing projects have had their IDs changed
[08:37:13] <dcaro>	 true xd, it does not show any dns zones either :/
[08:41:07] <dcaro>	 anyone knows in detail how mdns<->pdns interact?
[08:42:38] <arturo>	 no 
[08:42:54] <arturo>	 not me :-)
[08:43:35] <dcaro>	 I think that the pdns servers pull zone transfers from mdns servers periodically
[08:45:22] <dcaro>	 but it's kind of a guess, I also think mdns pulls the data from the DB when doing so, and pdns stores it in another db, but I want to test/monitor the dumping of the zone from mdns
[08:45:31] <dcaro>	 see if it's dumping the right data
[08:45:55] <arturo>	 yes, the double DB thing I think is correct
[08:46:13] <arturo>	 the first DB being the openstack DB, the second DB being the internal pdns mysql
[08:47:09] <dcaro>	 yep, it seems also that each project zone is pulled by pdns from a different cloudcontrol
[08:47:36] <dcaro>	 this is interesting
[08:47:57] <dcaro>	 https://www.irccloud.com/pastebin/7JNNWa6r/
[08:48:09] <dcaro>	 that's cloudcontrol1006 private ip
[08:48:36] <arturo>	 oh yeah, I think that was one of the major rabbit holes when we introduced cloud-private and moved the DNS servers to it
[08:49:46] <arturo>	 I believe the configuration semantics on designate may not be very clear, on what to configure where. At least it wasn't trivial for me to understand and operate
[08:57:42] <dcaro>	 od,
[08:57:50] <dcaro>	 oh, that log entry was me trying stuff xd
[08:58:45] <dcaro>	 🤦‍♂️ turns out ns1 is the current authoritative server, so any request for axfr to ns0 will fail...
[08:59:01] <dcaro>	 `root@cloudcontrol1006:~# host -l wmcloud.org ns1.openstack.eqiad1.wikimediacloud.org`
[08:59:04] <dcaro>	 ^that works
[09:03:04] <dcaro>	 hmm... so... shouldn't every project in openstack have a dns zone associated with it?
[09:03:49] <dcaro>	 https://usercontent.irccloud-cdn.com/file/vjNzqP8I/image.png
[09:04:58] <arturo>	 I think so
[09:05:11] <arturo>	 mmm wait, no
[09:05:19] <arturo>	 is under cloudinfra ownership
[09:05:56] <taavi>	 in theory projects should have '{project_id}.wmcloud.org.' and 'svc.{project_id}.eqiad1.wikimedia.cloud.' zones. the actual VM records are in a zone that's in the cloudinfra project
[09:06:25] <arturo>	 ok, yeah, that makes sense
[09:07:08] <dcaro>	 found them yep, under cloudinfra dns
[09:07:22] <dcaro>	 https://usercontent.irccloud-cdn.com/file/5KlR0ch7/image.png
[09:07:53] <dcaro>	 so that would be exposed as eqiad1.wikimedia.cloud zone I guess?
[09:08:50] <dcaro>	 okok, that makes sense
[09:08:51] <arturo>	 yes
[09:08:51] <dcaro>	 https://www.irccloud.com/pastebin/6ubop7iJ/
[09:08:55] <dcaro>	 pdns does not have them
[09:09:26] <dcaro>	 mdns does
[09:09:29] <dcaro>	 https://www.irccloud.com/pastebin/CsUrJOgP/
[09:09:55] <dcaro>	 step by step, xd, not to check how/when pdns pulls stuff for eqiad1.wikimedia.cloud from mdns
[09:11:13] <dcaro>	 *now to check
[09:12:06] <dcaro>	 hmm, I don't see it in the logs, though I see other checks for other zones
[09:12:18] <dcaro>	 ex: `Jun 12 09:11:08 cloudservices1006 pdns_server[1430]: Domain 'adiutor.wmcloud.org' is fresh (no DNSSEC), serial is 1709057136 (checked master 172.20.3.18:5354)`
[09:13:58] <dcaro>	 how does pdns what zones to pull? hmmm...
[09:16:11] <dcaro>	 it's in the db under the `domains` table, but who puts it there? mdns calling the web endpoint of pdns maybe?
[09:16:47] <taavi>	 designate talks to the powerdns api
[09:19:45] <dcaro>	 https://www.irccloud.com/pastebin/xb8jAqYH/
[09:19:50] <dcaro>	 ^ so the domain is there
[09:21:27] <dcaro>	 and doing a manual zone transfer from cloudservices to each of the servers for that domain works (and brings the fullstack entries)
[09:26:45] <dcaro>	 uuuhhh, there's a little cli
[09:28:09] <dcaro>	 Jun 12 09:27:58 cloudservices1006 pdns_server[1430]: XFR-in zone: 'eqiad1.wikimedia.cloud', primary: '172.20.3.18', initiating transfer
[09:28:16] <dcaro>	 https://www.irccloud.com/pastebin/FG0bJ7ZA/
[09:28:54] <dcaro>	 oh yep
[09:28:57] <dcaro>	 https://www.irccloud.com/pastebin/3AezNe7d/
[09:29:35] <dcaro>	 okok, so I know the mechanism xd, now let's see if that unblocked things or if it will still not update
[09:32:25] <dcaro>	 (I doubt it though)
[09:34:10] <dcaro>	 yep, still not pulling them by itself :/
[09:35:11] <dcaro>	 I'll try reloading the zones
[09:41:48] <dcaro>	 hmpf... I'll try restarting pdns
[09:45:21] <dcaro>	 hmm.... maybe a timezone issue?
[09:45:36] <dcaro>	 Jun 12 09:44:08 cloudservices1006 pdns_server[256943]: Domain 'eqiad1.wikimedia.cloud' is fresh (no DNSSEC), serial is 1718117983 (checked master 172.20.1.3:5354)
[09:45:54] <dcaro>	 but in the db the serial is `1718185448`
[09:50:33] <taavi>	 draining cloudvirt1031 to move that to OVS
[09:54:39] <dcaro>	 I think that the 'last_check' from the DB is not the serial :/, not sure where pdns stores the serial
[09:55:49] <dcaro>	 but the serial for the eqiad1.wikimedia.cloud zone is not changing, so it seems to be skipping pulling it every time
[09:58:16] <dcaro>	 hmm, from designate database, the serial of the zone there is `1718117983`
[09:59:06] <dcaro>	 that's yesterday
[09:59:07] <dcaro>	 https://www.irccloud.com/pastebin/AY5tWzif/
[09:59:20] <dcaro>	 so someone is not updating the serial there xd
[09:59:42] <dcaro>	 though it should
[09:59:44] <dcaro>	 https://www.irccloud.com/pastebin/UPDl2V1N/
[10:00:21] <dcaro>	 designate-worker is having issues connecting to the db
[10:00:25] <dcaro>	 2024-06-12 09:58:32.450 2417448 ERROR oslo_db.sqlalchemy.engines oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
[10:00:49] <dcaro>	 Jun 12 09:40:08 cloudcontrol1006 designate-worker[2417448]: 2024-06-12 09:40:08.249 2417448 WARNING designate.worker.tasks.zone [None req-b737cb2f-1a30-4411-994e-45e0dede8b3d - - - - - -] Found 2 zones PENDING for more than 455 seconds
[10:00:53] <dcaro>	 Jun 12 09:38:08 cloudcontrol1006 designate-worker[2417448]: 2024-06-12 09:38:08.278 2417448 INFO designate.worker.tasks.zone [None req-1b1b3dd9-aa8d-4c98-b52d-cd23119df7ef - - - - - -] Attempting to UPDATE zone_name=eqiad1.wikimedia.cloud. zone_id=67603ef4-3d64-40d6-90d3-5b7776a99034
[10:00:56] <dcaro>	 okok, looking
[10:09:00] <dcaro>	 oh my, designate-central is also having db issues
[10:10:08] <dcaro>	 hmm, it was having rabbitmq issues yesterday
[10:10:57] <dcaro>	 new log: Jun 12 10:10:08 cloudcontrol1005 designate-worker[3714005]: 2024-06-12 10:10:08.377 3714005 INFO designate.worker.tasks.zone [None req-cf0e19cb-28cb-4c56-9e94-82e9ad4bcf2e - - - - - -] Attempting to UPDATE zone_name=eqiad1.wikimedia.cloud. zone_id=67603ef4-3d64-40d6-90d3-5b7776a99034
[10:11:02] <dcaro>	 but the serial stays the same
[10:11:10] <dcaro>	 https://www.irccloud.com/pastebin/f43iJgB9/
[10:20:34] * dcaro lunch
[11:15:10] <arturo>	 heads up, I will be deploying kyverno pod policy rules (mutation + validatio in _audit_ mode) soon
[11:15:10] <arturo>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/18
[11:22:31] <dcaro>	 let me restart all the designate processes everywhere...
[11:22:35] <dcaro>	 just to make sure
[11:23:37] <dcaro>	 🤦‍♂️ that seems to have done the trick.... not sure which process is the one that was stuck though
[11:24:26] <arturo>	 did you use a cookbook for that?
[11:24:35] <taavi>	 cloudvirt1031 is now running on OVS, and at least the canary instance is able to talk to everything just fine
[11:24:47] <dcaro>	 cumin
[11:25:03] <dcaro>	 taavi: \o/ yay!
[11:25:03] <arturo>	 taavi: great
[11:26:19] <taavi>	 I will reimage one or two more cloudvirts and then publish the new flavors
[11:26:41] <taavi>	 I found out that the time you spend waiting for the reimages is really good time for writing the IR :D
[11:26:57] <arturo>	 do you have a link?
[11:27:16] <taavi>	 https://wikitech.wikimedia.org/wiki/Incidents/2024-06-11_WMCS_Ceph, still a work-in-progress
[11:27:28] <arturo>	 thanks
[11:31:12] <dcaro>	 quick review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042128
[11:31:32] <dcaro>	 for the designate-sink when cleaning up records
[11:32:45] <taavi>	 did some host key change recently? or why is that needed now?
[11:33:30] <dcaro>	 I think so, well a while ago I guess https://phabricator.wikimedia.org/T367235#9883242
[11:34:06] <dcaro>	 might have been broken for a bit
[11:39:31] <dcaro>	 hm... is that file even being installed?
[11:39:36] <dcaro>	 puppet showed no diff
[11:42:55] <dcaro>	 I think it's using a directory with recurse true
[11:45:16] <dcaro>	 hm, now I saw the diff 
[11:45:19] <dcaro>	 🤷
[11:45:24] <taavi>	 uhhhhhhh
[11:45:47] <taavi>	 how did openstack move some vms to 1031 when I drained 1032?
[11:46:46] <arturo>	 the flavor trick did not work as expected to prevent scheduling?
[11:46:53] <taavi>	 yes
[11:47:13] <arturo>	 maybe you need to leave the HV out of the schedling pool for now?
[11:47:46] <arturo>	 Kubernetes node tools-k8s-worker-nfs-52 is not ready
[11:47:57] <arturo>	 I assume this worker node was migrated to 1032?
[11:48:06] <taavi>	 yeah, that's one of the affected nodes
[11:48:16] <taavi>	 I'll drain 1031 to get those VMs back to linuxbridge hosts
[11:48:47] <arturo>	 aren't OVS VMs supposed to talk with linuxbridge VMs just fine?
[11:49:28] <taavi>	 the issue is the driver flip that needs to happen in the database
[11:49:45] <taavi>	 and now the VMs are not migrating off of 1031 properly
[11:50:04] <arturo>	 how many VMs are affected?
[11:50:16] <arturo>	 it may be better to flip the driver and move on?
[11:50:30] <taavi>	 6 AFAICS
[11:50:39] <taavi>	 yeah, I'll move these to OVS
[12:01:26] <arturo>	 heads, up, maintain-kubeusers is creating kyverno policies for each account in tools
[12:02:18] <taavi>	 arturo: hmm, now the neutron ports are not coming up even after a full cycle of stop VM, undefine virsh domain, restart VM
[12:03:07] <arturo>	 :-(
[12:04:23] <arturo>	 I'm not sure if that dance forces recreation of the ports
[12:04:29] <arturo>	 maybe try restarting nova-compute on the HV
[12:04:41] <arturo>	 and maybe it will re-check everything on startup
[12:06:06] <arturo>	 maintain-kubeusers is failing liveness probes and I can't tell why
[12:06:41] <taavi>	 Jun 12 12:05:59 cloudvirt1031 nova-compute[29604]: 2024-06-12 12:05:59.147 29604 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: {"details":"cannot delete QoS row fffc38df-c278-424f-a645-6baaf739df61 because of 1 remaining reference(s)","error":"referential integrity violation"}
[12:06:41] <taavi>	 Jun 12 12:05:59 cloudvirt1031 nova-compute[29604]: 2024-06-12 12:05:59.164 29604 INFO os_vif [None req-6d06f5c8-0f7d-4296-a326-3c7b51ef001e novaadmin admin - - default default] Successfully unplugged vif
[12:06:41] <taavi>	 VIFOpenVSwitch(active=False,address=fa:16:3e:b0:71:81,bridge_name='br-int',has_traffic_filtering=True,id=3db5c4b7-e172-4cee-a4ca-f05eb8b48451,network=Network(7425e328-560c-4f00-8e99-706f3fb90bb4),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tap3db5c4b7-e1')
[12:07:32] <arturo>	 taavi: we will need to investigate a bit more what that means
[12:07:55] <arturo>	 I suggest you pause the migration, if you had any other in-flight processes
[12:08:11] <taavi>	 I don't have anything else running at the same time
[12:08:17] <arturo>	 ok
[12:08:21] <taavi>	 that error was when trying to start a now-broken VM
[12:08:26] <arturo>	 ok
[12:08:38] <arturo>	 what was the error when trying to migrate them off the now-broken HV?
[12:09:09] <dcaro>	 My router just decided not to work... Rebooting 
[12:09:19] <taavi>	 wmcs-drain-hypervisor: 2024-06-12 11:48:50,857: WARNING: tools-k8s-worker-nfs-52 (9dce488f-fd23-4c26-bee9-d9e1fd0d8901) didn't actually migrate, got scheduled on the same hypervisor. Will try again!
[12:36:25] <taavi>	 ok, all VMs are back up
[12:37:41] <arturo>	 excellent
[12:37:50] <taavi>	 now the question is again how did this happen in the first place
[12:38:48] <taavi>	 the host (before I removed it again from the scheduling pools) was in the 'ceph' and 'network-ovs' aggregates 
[12:39:25] <taavi>	 the network-ovs aggregate has properties: network-agent='ovs' set
[12:39:33] <arturo>	 the original problem is the VMs could not migrate off the OVS-enabled HV
[12:39:37] <arturo>	 no?
[12:39:47] <taavi>	 so, if the flavor has `aggregate_instance_extra_specs:network-agent='linuxbridge'` set, how did that happen
[12:39:53] <taavi>	 no
[12:40:07] <taavi>	 the original problem is why did draining cloudvirt1032 (linuxbridge) move these VMs to 1031 (OVS)
[12:40:35] <arturo>	 I see
[12:46:07] <taavi>	 it'd be interesting to see the debug logs from https://github.com/openstack/nova/blob/7dc4b1ea627d864a0ee2745cc9de4336fc0ba7b5/nova/scheduler/filters/aggregate_instance_extra_specs.py
[12:50:09] <taavi>	 arturo: do you need help with the maintain-kubeusers issue?
[12:50:48] <arturo>	 taavi: I think I found the problem and I'm writing a patch. Will send it your way for review soon
[13:05:10] <arturo>	 taavi: https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/41
[13:06:39] <taavi>	 doesn't moving `handle_k8s_liveness_probe` to the start again result in the fact that a crash during reconciling would go unnoticed? (unless we add an alert on the restart rate)
[13:08:01] <arturo>	 how would it go unnoticed? each liveness check deletes the file, so if the daemon crashes, the liveness probes will eventually fail (after the new threshold)
[13:08:53] <arturo>	 the diff wrt. before is that originally the file was created on the outer loop
[13:09:01] <arturo>	 now the file is created on each account 
[13:09:11] <arturo>	 (plus in the outer loop, in this patch)
[13:10:19] * arturo food time
[13:35:26] <andrewbogott>	 taavi: does ovs-vsctl del-port <port name> 4 cause the VM to get assigned a new IP? Or does that persist across ports somehow?
[13:36:33] <taavi>	 andrewbogott: no, IP assignment is handled by neutron. `ovs-vsctl del-port` deletes the virtual interface from the cloudvirt so that the neutron agent can re-create it with the correct settings
[13:37:01] <andrewbogott>	 Oh, I see, deleting the VM port not the neutron port. ok
[13:39:47] <taavi>	 andrewbogott: if you have any ideas why the VMs were migrated to 1031 despite the flavor and aggregate settings I'd be very happy to hear them
[13:42:45] <andrewbogott>	 taavi: have an example VM I can look at?
[13:43:21] <taavi>	 any of the non-canary ones now on 1031, for example 97644226-43df-4520-8e45-6ee426974402 (toolsbeta-test-k8s-worker-10)
[13:43:27] <andrewbogott>	 also, have you rearranged aggregate membership since that happened?
[13:44:29] <taavi>	 I moved 1031 from ['ceph', 'network-ovs'] to ['maintenance'].
[13:45:14] <taavi>	 the cookbook-generated backup of the aggregate config is at cloudvirt1031:/etc/wmcs_host_aggregates.yaml
[13:45:56] <andrewbogott>	 right, so it shouldn't be eligible for anything to be scheduled there, regardless of linuxbridge/ovs
[13:46:44] <taavi>	 that move happened as an attempt to drain the hypervisor to move the VMs back to linuxbridge nodes
[13:47:22] <andrewbogott>	 ah, so what were the aggregates for 1031 when you saw the bad scheduling?
[13:47:40] <andrewbogott>	 ceph + network-ovs ?
[13:48:27] <taavi>	 yes
[13:49:54] <andrewbogott>	 I'm thinking maybe we need aggregate_instance_extra_specs:ceph='true', network-agent='linuxbridge' instead of aggregate_instance_extra_specs:ceph='true', aggregate_instance_extra_specs:network-agent='linuxbridge'
[13:50:21] <andrewbogott>	 docs say 'Multiple values can be given, as a comma-separated list.'
[13:50:36] <andrewbogott>	 which is ambigious but /could/ mean the former
[13:50:56] <andrewbogott>	 of course there are no examples of multiple extra specs
[13:53:19] <taavi>	 i interpreted that to apply for multiple values of the same key
[13:54:03] <andrewbogott>	 yeah it could mean either
[13:54:10] <andrewbogott>	 I'm trying to find the source that parses that...
[14:00:32] <andrewbogott>	 https://www.irccloud.com/pastebin/lCC26T76/
[14:00:51] <andrewbogott>	 looks like it's expecting multiple extra_specs entries, unless there's some preprocessing I've missed
[14:01:15] <andrewbogott>	 oh hang on that code is in a deprecated function
[14:03:41] <andrewbogott>	 I'll just do an experiment in codfw1dev instead
[14:17:34] <andrewbogott>	 taavi: my tests say that the way you did it (multiple aggregate_instance_extra_specs options) works fine.
[14:17:40] <andrewbogott>	 So I have no explanation for the bad scheduling
[14:17:56] <taavi>	 hmmm. does migrating an instance somehow apply different rules than creating new instances does?
[14:19:26] <andrewbogott>	 well... maybe
[14:19:56] <andrewbogott>	 iirc the original requested specs are persisted someplace. So it's possible that when you migrated it it applied the /original/ set of options and not the current options
[14:20:03] <andrewbogott>	 Let's see if I can find that someplace
[14:20:12] <andrewbogott>	 ideally it would persist the flavor rather than the specs that the flavor had at scheduling time...
[14:20:54] <dcaro>	 andrewbogott: I'm trying to play with the new credentials (toolsbeta domain) that you created, but I'm having issues trying to do things there xd, do you have some time to help me find my way around?
[14:21:36] <dcaro>	 (I can wait until the meeting in a bit)
[14:21:42] <andrewbogott>	 dcaro: I think we're about to have a meeting and then another meeting
[14:21:55] <andrewbogott>	 but yeah we can talk about it a bit in the meeting
[14:21:59] <dcaro>	 ack
[14:24:29] <andrewbogott>	 taavi, big paste:
[14:24:31] <andrewbogott>	 https://www.irccloud.com/pastebin/kFRvtBLZ/
[14:24:49] <andrewbogott>	 If the migration is using those specs rather than referring back to the flavor it would explain what you saw
[14:25:09] <taavi>	 "extra_specs": {"aggregate_instance_extra_specs:ceph": "true", "quota:disk_read_iops_sec": "5000"
[14:25:09] <andrewbogott>	 My recollection is that it does, but we need to do more tests to be sure
[14:25:20] <taavi>	 hrmh
[14:25:33] <andrewbogott>	 right, which would mean that changing the flavor after the fact does nothing :(
[14:25:38] <taavi>	 yep
[14:25:52] <andrewbogott>	 But I'm not 100% sure that's what's happening, it's just a a good theory
[14:25:52] <taavi>	 that's also going to make removing the aggregates after the fact significantly more annoying
[14:25:59] <andrewbogott>	 yeah
[14:29:50] <dcaro>	 andrewbogott: I think I got something "working" :)
[14:30:13] <andrewbogott>	 that sounds ominous
[14:59:55] <dhinus>	 quarry is having issues since yesterday (T365374). I just merged https://github.com/toolforge/quarry/pull/46 that might or might not fix it
[14:59:56] <stashbot>	 T365374: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374
[15:00:32] <dhinus>	 I'm not sure how to deploy that change though, there are some instructions at https://github.com/toolforge/quarry/blob/main/README.md#deploying-to-production
[15:28:37] <arturo>	 FYI I detected kyverno getting OOM-killed because the flooding of new policies that I deployed earlier today, so I removed CPU and MEM limits with https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/329
[15:33:59] <taavi>	 should we bump those limits instead of removing them entirely?
[15:34:10] <taavi>	 or increment the replicas or something?
[15:36:27] <arturo>	 taavi: the upstream recommendation is to remove the limits
[15:37:20] <arturo>	 source: https://kyverno.io/docs/installation/scaling/
[16:12:48] <arturo>	 I need review for a couple of emergency patches
[16:13:21] <arturo>	 first is this one https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/42 for which I'm recording cassettes ATM
[16:13:48] <arturo>	 next would be https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/43
[16:18:44] <arturo>	 tools-k8s-control-9 was briefly down a few minutes ago
[16:19:31] <dhinus>	 andrewbogott: can you have a look at the quarry deployment? I left some notes in the comments at T365374
[16:19:32] <stashbot>	 T365374: [bug] Access denied for user 'quarry'@'172.16.2.72' (using password: NO) - https://phabricator.wikimedia.org/T365374
[16:19:34] <dcaro>	 arturo: I'm happy to stamp if you want, but I'll need some time/help to review propertly
[16:19:55] <dcaro>	 arturo: do you want me to stamp?
[16:20:02] <andrewbogott>	 dhinus: yeah, I was about to ask if you wanted to work late and sort that but if you have to go I can see what I can see
[16:20:41] <arturo>	 dcaro: yes please, unless you see an obvious problem
[16:21:08] <andrewbogott>	 dhinus: is the fix merged to master or is there a topic branch I should deploy?
[16:21:49] <dhinus>	 andrewbogott: I'm still around for a bit but I don't want to break things even more :) though it looks like quarry right now is up but not usable, so it can't get much wors :P
[16:22:33] <dhinus>	 I'm tempted to "git reset --hard origin/main" plus "deploy.sh" but I'm not too confident it will work
[16:23:22] <dcaro>	 arturo: done
[16:23:26] <arturo>	 thanks
[16:23:54] <Rook>	 Have you tried restarting the deployments?
[16:23:56] <dhinus>	 andrewbogott: it's all merged
[16:24:47] <andrewbogott>	 the source is in gitlab or github?
[16:24:50] <dhinus>	 Rook: no, that might fix it temporarily but I'd also like to get the latest patches deployed
[16:24:51] <andrewbogott>	 I guess I can figure that out
[16:24:59] <dhinus>	 andrewbogott: https://github.com/toolforge/quarry
[16:25:11] <andrewbogott>	 weird that github search didn't find that :/
[16:25:20] <Rook>	 Looks like the last patch to go out was 6 days ago
[16:25:36] <Rook>	 You're trying to deploy main?
[16:26:05] <andrewbogott>	 'Move toolsdb creds to the correct config file (#46)' is the fix?
[16:26:17] <arturo>	 tools-k8s-control-9 was briefly unavailable because an unknown load spike. It has 20 loadavg15 (vs 1 loadavg1)
[16:26:43] <dhinus>	 yes main
[16:26:52] <dhinus>	 andrewbogott: that one plus #47
[16:27:08] <Rook>	 Deploying...
[16:27:11] <dhinus>	 thanks
[16:27:30] <dhinus>	 I tried to "git pull" but /home/rook/quarry was in a different branch and that got me confused
[16:27:42] <dhinus>	 what's your workflow?
[16:28:00] <andrewbogott>	 OK, I have a checkout of master, is it really as simple as 'bash deploy.sh' or do I need to set up secrets first?
[16:28:17] <Rook>	 Workflow is "make change > open PR > checkout branch to bastion > deploy"
[16:28:29] <Rook>	 You'll need git-crypt decrypted
[16:28:51] <Rook>	 main is deployed. I just restarted redis for fun as well
[16:29:26] <Rook>	 I'm getting a test query back, is it working for others?
[16:29:54] <Rook>	 Sorry full workflow: "make change > open PR > checkout branch to bastion > deploy > merge PR"
[16:30:21] <andrewbogott>	 Rook: when you have 5 minutes can you add the missing steps to https://github.com/toolforge/quarry/blob/main/README.md#deploying-to-production ?
[16:30:27] <arturo>	 I need to step out, I'll deploy https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/43 tomorrow, I don't think is really that urgent
[16:30:31] <andrewbogott>	 (the missing steps being, I think, copy secrets file and decrypt)
[16:30:48] <Rook>	 Yeah I can do that
[16:30:51] <andrewbogott>	 thanks
[16:31:28] <andrewbogott>	 I feel like there's also a part that's, like, 'check to make sure the github action worked and there's a new container to deploy'
[16:33:04] <andrewbogott>	 dhinus: quarry looking good for now?
[16:33:06] <arturo>	 I have the feeling the Toolforge k8s API server is a bit overloaded. It may be because the kyverno policies
[16:33:10] <Rook>	 Yes, in my mind that was "open PR" I guess that is "Open PR and make sure all the actions pass" that felt wordy
[16:33:27] <dcaro>	 arturo: we can try to setup some graphs to keep track of that
[16:33:58] <arturo>	 now seeing errors
[16:34:29] <arturo>	 I'll drop the kyverno policies
[16:35:28] <arturo>	 I think we are in an outage
[16:35:35] <andrewbogott>	 arturo: do you think the overload was transitional or is kyverno really resource-intensive?
[16:35:40] <arturo>	 I can't barely interact with the k8s API
[16:36:21] <dhinus>	 Rook: thanks for the fix, quarry seems to be working fine :) the main thing I would add to the workflow is "checkout branch to bastion", because when I saw the branch was not "main" I had worries I was not in the right directory
[16:36:31] <andrewbogott>	 openstack-browser isn't loading, which is my standard toolforge canary
[16:36:35] <arturo>	 andrewbogott: I can't tell. I think kyverno might be eating too much resources. I removed CPU/MEM restrictions, but the webhook to evaluate policies may be eating API-server resources as well
[16:36:52] <arturo>	 please somebody declare an incident
[16:37:02] <andrewbogott>	 arturo: ok.  so my understanding is that you're trying to roll back but the API isn't responsive enough to do so?
[16:37:09] <arturo>	 exactly
[16:37:12] <andrewbogott>	 ok.
[16:37:25] <andrewbogott>	 So -- I will be IC just as soon as I can find that doc page again :)
[16:37:44] <Rook>	 It doesn't have to be in that order. Though I usually deploy the branch before merging to see it live, as sometimes that uncovers errors. Then I don't have to deal with reverting the commit. Though that is all more preference than necessity. At any rate, I can try to update the documentation some to that end.
[16:38:03] * andrewbogott curses wikitech search for the millionth time
[16:38:37] <andrewbogott>	 incident meeting: https://meet.google.com/qsb-ixrb-qby
[16:38:49] <andrewbogott>	 arturo, dhinus, dcaro, taavi, etc ^
[16:39:35] <andrewbogott>	 dhinus: link me to the incident commander guide?  I absolutely cannot find it
[16:39:48] <dhinus>	 https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Incident_Response_Process
[16:39:59] <andrewbogott>	 thx
[16:40:20] <dhinus>	 andrewbogott: quarry is looking fine
[16:40:51] <andrewbogott>	 great
[16:40:57] * andrewbogott drops everything for toolforge outage
[16:42:12] <arturo>	 https://www.irccloud.com/pastebin/gm1HBE7a/
[16:42:26] <arturo>	 kubectl scale deploy kyverno-admission-controller -n kyverno --replicas 0
[16:52:22] <dcaro>	 https://usercontent.irccloud-cdn.com/file/PF1DgO8C/image.png
[16:54:17] <dcaro>	 https://usercontent.irccloud-cdn.com/file/LNuUS7TJ/image.png
[16:56:08] <dhinus>	 sorry but I can't help with the incident right now
[16:57:27] <andrewbogott>	 dhinus: it's OK, we have plenty of hands at the moment
[16:57:28] <dcaro>	 kube-system has no suspicious resource peaks
[16:57:30] <dcaro>	 https://usercontent.irccloud-cdn.com/file/fcgFfBGM/image.png
[16:58:44] <dcaro>	 https://www.irccloud.com/pastebin/XkPuKFcU/
[17:05:13] <arturo>	 https://www.irccloud.com/pastebin/QWNp3zmB/
[17:30:05] <andrewbogott>	 Rook: thank you for appearing and deploying, btw :)
[17:32:38] * arturo offline
[17:32:52] * dcaro out
[17:33:39] <dcaro>	 thanks everyone for the quick debugging and fixing, quite an interesting week :)
[17:33:45] <Rook>	 Np
[17:33:59] * Rook returns to carpentry 
[17:37:05] <dcaro>	 andrewbogott: I updated https://phabricator.wikimedia.org/T363983 with some notes from today's meeting, feel free to add more stuff/change stuff if I got anything wrong or missed anything (same for anyone too xd)