[01:01:58] * bd808 off [08:07:29] I'm upgrading mariadb on clouddb2002-dev (so that's for labtestwikitech) [08:09:59] morning [08:21:30] morning [08:41:58] morning [08:54:03] dcaro: in the harbor upgrade docs, we have "Create a backup of the /srv/ops/harbor/data/* folders inside the /srv/ops/harbor/data/ directory (it is a volume mount)." Do you remember if we meant creating a backup inside that same folder, or somewhere else? [08:55:09] inside the same folder, there's probably no space outside to hold a full copy (in the OS volume) [08:55:34] is there space in the same volume? (otherwise we can create a new volume temporarily and use that) [08:56:34] another option is creating a snapshot on horizon [08:57:13] there seems to be enough space `/dev/sdb 40G 6.1G 32G 17% /srv/ops/harbor/data` [08:57:19] ack [09:04:05] I just discovered that you can see a very detailed process tree of what's going on with tools jobs in k8s, inside the worker node [09:04:07] https://usercontent.irccloud-cdn.com/file/nvyoThYb/image.png [09:04:32] is that from within the pods? should be only available from the node [09:04:54] as in, that tree is ran inside a pod? [09:06:28] that's from the node directly right? (there's even root processes there) [09:06:38] yes, this is `htop` in the worker node [09:06:51] you can strace and such also from there (that's how I debugged the connection issues to wikis and such) [09:06:58] nice [09:06:59] they are just processes [09:07:13] you can nsenter and such too ;) [09:14:34] are you playing with crictl? any helpful tips? [09:17:44] no, I was just checking if the worker VM was healthy after the network change + reimage of the hypervisor [09:17:46] hmm... we should add `CONTAINER_RUNTIME_ENDPOINT=unix:///run/containerd/containerd.sock` to the default bashrc, it's a bit annoying having to pass it every time [09:27:52] this should help: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016719 [09:30:05] +1'd [09:30:08] quick review here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016720 [09:30:31] +1d [09:32:24] thanks [09:51:35] huh, I'm seeing this warning on tools-k8s-worker-nfs-53: Warning: The current total number of facts: 3156 exceeds the number of facts limit: 2048 [09:51:37] from puppet [09:54:25] it seems it's just a soft limit, it does not trim the facts or anything [10:48:20] quick review here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016743 [10:57:20] +1 [10:57:30] please have a look at the 'The systemd unit wmf_auto_restart_virtlogd.service on node cloudvirt1036 has been failing for more than two hours.' alert [10:58:03] 1036, arturo is that one of the ones you were working with? [10:58:21] yes, I can take a look soon [10:58:29] thanks [11:11:11] Apr 03 00:52:00 cloudvirt1036 wmf-auto-restart[253589]: INFO: 2024-04-03 00:52:00,461 : Service virtlogd not present or not running [11:11:12] weird [11:12:34] https://www.irccloud.com/pastebin/YHokFdMJ/ [11:12:47] have there been some changes I missed? [11:13:14] yes, puppet 7 upgrades. try toolsbeta-puppetserver-1.toolsbeta.eqiad1.wikimedia.cloud [11:13:21] thanks [11:17:52] is /srv/git/operations/puppet what /var/lib/git/operations/puppet was before? [11:26:37] yes [11:31:01] did someone do something with the cloudinfra-internal-puppetmaster-1 CA recently? I'm seeing this https://phabricator.wikimedia.org/P59316 when trying to do a 'puppet node clean' [11:34:05] I'm upgrading toolsbeta harbor, it might be down for a bit [11:36:49] we have discussed previously the possibility of having kind of a test suite that covers pretty much everything that can/can't be done on toolforge, to use on things like upgrades [11:36:54] do we have a ticket opened for that? [11:37:11] I mention this in the context of T279110 [11:37:12] T279110: Replace PodSecurityPolicy in Toolforge Kubernetes - https://phabricator.wikimedia.org/T279110 [11:37:47] and how I would like to have $something to help me verify that everything works after the PSP migration [11:39:40] I guess the closest that we have is this: [11:39:41] modules/toolforge/templates/automated-toolforge-tests.yaml.erb [11:42:38] taavi: I fixed some issues with the cert yesterday, it was cached T361563 [11:42:39] T361563: [cloudinfra] puppet CA cert expired - https://phabricator.wikimedia.org/T361563 [11:43:31] arturo: T357977 [11:43:35] T357977: [toolforge.infra] create fullstack tests - https://phabricator.wikimedia.org/T357977 [11:45:31] dcaro: thanks [11:46:44] toolsbeta harbor is up again [11:48:58] \o/ [11:49:16] would it be too short notice to upgrade tools harbor tomorrow? [11:49:38] I wonder where `puppet node clean` has the expired certificate then [11:49:38] how much total downtime? [11:50:29] if everything goes down as on toolsbeta, 1-2 mins [11:53:29] taavi: strace? [11:53:55] blancadesal: then I'd say tomorrow is just fine [11:54:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016751 quick review here? [11:54:13] dcaro: oh good idea. I was trying both --debug (which "logs debug information") and --verbose (which "log verbosely") which was were not very helpful [11:54:47] arturo: +1d [11:55:11] dcaro: thanks [12:01:11] found it, /srv/puppet/server/ssl/certs/ca.pem [12:02:07] ok, I've sent an email announcing the harbor upgrade for tomorrow 9 UTC [12:07:04] 👍 to both xd [12:07:39] also strace pro tip: make sure you're running the correct command you want to debug and not one with a required parameter missing [12:11:27] hahahah, yep, had that happen to me too xd [12:41:29] hmm, now there's a second person reporting T361519 [12:41:30] T361519: [buildservice] "failed to create fsnotify watcher: too many open files" and "unable to open destination: open /tekton/home/.docker/config.json: permission denied" - https://phabricator.wikimedia.org/T361519 [12:43:56] interesting, I have seen that issue only on lima-kilo when starting more than one build at a time [12:44:32] it'd be very helpful if the `toolforge build start` output included the build name [12:45:01] my guess is that there's some specific node having an issue, given that most builds are running on the few non-NFS nodes [12:46:25] Is labtesthorizon login blocked for everyone, or just me? [12:47:42] broken for me too. I' [12:47:45] ll have a look [12:51:10] Rook: try now? [12:52:24] I'm in, thanks! [12:55:36] taavi: hmm, indeed [12:56:08] It does if you start detached https://www.irccloud.com/pastebin/QnTIhMzx/ [12:59:04] or you meant the node name? (that might be more helpful in this case) [13:02:32] i think we keep enough old builds that we should be able to just look up the node from the build name [13:07:31] heads up: don't refresh your existing lima-kilo envs yet. The harbor upgrade doesn't interfere with existing robot accounts, but something has changed in how new accounts are created, as we do in setup_harbor.py in builds-builder. [13:14:38] ack [13:22:14] ack [13:23:13] taavi: sounds reasonable yes, then we can just output the same when using the non-detached build start, so you get the build name too, should be easy [13:42:06] cloudcontrol1005 nova seems to be down [13:42:44] Unexpected exception in API method: nova.exception.NovaException: Cell 1ee5b233-6b94-40f5-b3d2-fc1a89c13274 is not responding or returned an exception, hence instance info is not available. [13:43:55] Rook: it seems that the request came from you? are you doing anything on openstack? [13:44:24] `req-121bf4c3-8290-4498-b287-ddeeb34e7273 rook superset - - default default` [13:45:19] I've been poking around in codfw though not eqiad1 [13:45:23] the inner error is `if listener is self.listeners[evtype][fileno]:` `TypeError: expected string or bytes-like object, got 'int'`, seems weird, like wrong python library versions weird [13:46:56] Rook: with superset? Does superset have any integration with openstack/cinder/s3/etc.? [13:47:11] It does, should make a volume [13:49:14] It uses the cindercsi to do so https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Magnum#Add_cinder_csi_to_cluster [13:49:21] it seems that after the exception, the process got stuck in `Exception during reset or similar: AssertionError: do not call blocking functions from the mainloop`, with sqlalchemy trying to rollback [13:49:44] does it use it's own app credentials? or does it use yours? [13:50:00] (the api request was auth'd as rook) [13:50:12] It uses an application credential. I'm unclear on if that belongs to the project or the user. From what you're seeing sounds like the user [13:51:13] application credentials are kind of weird, they're mostly scoped to the project but there are a few user-wide things they can do. those few weird things is why our documentation says you only ever stick application credentials for specific service accounts inside cloud vps [13:51:16] maybe yes [13:51:58] I restarted the api and everything seems ok [13:52:51] if it happens again I'll look more in depth into it (or if anyone has seen that error before) [13:53:31] Superset was upgraded yesterday, so the new version might be treating the volume provision differently. I can rollback if it is an issue [13:53:56] it happened once only, let's see if it reappears, on the superset side is everything running ok? [13:54:10] dcaro: was that error in cinder api or nova api? [13:54:20] Yeah superset seems happy [13:54:40] andrewbogott: nova [13:54:43] *nova-api [13:54:55] OK. There's a bug in the threading library in antelope, seems to cause cinder-api to lock up now and then. [13:55:06] So that's probably what you're seeing. It's supposedly fixed in B. [13:55:25] `Cell 1ee5b233-6b94-40f5-b3d2-fc1a89c13274 is not responding` that's referring to cinder thingies? [13:55:31] *is that referring... [13:55:37] (/me curious) [13:55:54] cells are kind of a broad concept xd [13:56:23] that's for compute partitioning no? [13:56:53] T354483 and T352635 are the issues I'm thinking of. Might not be the same issue though. [13:56:54] T354483: nova-api seems to die after a while, complains of a full listen queue - https://phabricator.wikimedia.org/T354483 [13:56:54] T352635: [openstack] cloudcontrols getting out of space due to nova-api.log message 'XXX lineno: 104, opcode: 120' - https://phabricator.wikimedia.org/T352635 [13:57:06] anyhow, yes might be cinder misbehaving and nova getting stuck somewhat (cell not responding) [13:57:20] 'cell' is a bit overloaded in nova but in theory we only have the one cell. [13:57:44] each cell has its own scheduler, I believe. [13:58:14] oh, so that sounds a bit more troublesome then, if the only cell we have is not responding [13:59:10] Anytime I've seen it it was just the api process that got stuck. Is the error still happening after a restart? [13:59:29] it's not no, and only happened once [13:59:39] (though it seemed to make haproxy find the service as down) [14:00:05] ha proxy depooling the service seems like a good thing :) [14:01:52] yep, the weird part is that it did not recover on it's own [14:03:25] yeah, it gets stuck in a spin and fills up the log file, typically [14:18:39] quick review here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1016795 [14:20:48] +1d [14:21:40] thanks! [14:35:47] quick review? https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/37 [14:35:47] this fixes the broken robot permissions in lima-kilo [14:50:25] thanks for the review dcaro! [14:50:36] does anyone remember how do we deploy PSP into toolforge k8s currently? [14:50:48] maintain-kubeusers [14:50:51] it used to be yaml files in puppet [14:50:52] ok [14:51:20] the admin privileged psp used to be in puppet I think, but user PSPs have always been in maintain-kubeusers as they're user-specific [14:51:58] lima-kilo should be unbroken now [14:52:12] taavi: but I can't find the file definition itself [14:53:07] oh wait, gotcha, it is per-ns and maintain-kubeusers does it [14:53:18] https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/blob/main/maintain_kubeusers/k8s_api.py?ref_type=heads#L480 [14:53:38] yeah, staring at that [15:22:46] quick review here? https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/19 [16:07:31] * arturo offline [16:22:17] * dhinus off [16:26:36] * dcaro off [17:25:10] I can't get into a new system I launched in codfw1dev as myself or root, I seem to be able to get into existing systems. Any thoughts on why? [17:48:59] usually that's a sign that Puppet broke during the first run. can you see any logs via horizon? [17:57:26] Looks like it is getting the same timeouts that the older systems are getting. The older systems don't appear to have puppet updated, perhaps the images are still running older puppet and not communicating well with the puppet master? [18:02:15] Rook: what host are you trying to reach? [18:05:03] test-remove.k8s-dev.codfw1dev.wikimedia.cloud or test-remove2.k8s-dev.codfw1dev.wikimedia.cloud with the underlying intent of getting puppet working on older systems probably by redeploy/updating them [18:06:00] ok, looking [18:07:07] * bd808 lunch [18:11:16] Rook: sorry, I think this was my mistake; there was a security group missing from the new puppetserver that meant lots of projects couldn't reach it. [18:11:45] I think puppet runs should start working again in lots of places. The new VMs that never came up might recover after 30 minutes or so or you can delete and recreate [18:11:52] Sounds ideal, let me see [18:14:11] Puppet seems to run fine now, thank you [18:14:32] something seems to still be broken for new VMs, I'll ping when I have that fixed [18:40:47] * andrewbogott grumbles [19:36:48] Rook: new Vms should work in codfw1dev now. [19:46:38] Thanks!